Handle URI:
http://hdl.handle.net/10754/599106
Title:
Optimizing strassen matrix multiply on GPUs
Authors:
ul Hasan Khan, Ayaz; Al-Mouhamed, Mayez; Fatayer, Allam
Abstract:
© 2015 IEEE. Many core systems are basically designed for applications having large data parallelism. Strassen Matrix Multiply (MM) can be formulated as a depth first (DFS) traversal of a recursion tree where all cores work in parallel on computing each of the NxN sub-matrices that reduces storage at the detriment of large data motion to gather and aggregate the results. We propose Strassen and Winograd algorithms (S-MM and W-MM) based on three optimizations: a set of basic algebra functions to reduce overhead, invoking efficient library (CUBLAS 5.5), and parameter-tuning of parametric kernel to improve resource occupancy. On GPUs, W-MM and S-MM with one recursion level outperform CUBLAS 5.5 Library with up to twice as faster for large arrays satisfying N>=2048 and N>=3072, respectively. Compared to NVIDIA SDK library, S-MM and W-MM achieved a speedup between 20x to 80x for the above arrays. The proposed approach can be used to enhance the performance of CUBLAS and MKL libraries.
Citation:
Ul Hasan Khan A, Al-Mouhamed M, Fatayer A (2015) Optimizing strassen matrix multiply on GPUs. 2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). Available: http://dx.doi.org/10.1109/SNPD.2015.7176172.
Publisher:
Institute of Electrical and Electronics Engineers (IEEE)
Journal:
2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)
Issue Date:
Jun-2015
DOI:
10.1109/SNPD.2015.7176172
Type:
Conference Paper
Sponsors:
The authors would like to acknowledge the support provided by King Abdulaziz City for Science and Technology (KACST) through the Science & Technology Unit at King Fahd University of Petroleum & Minerals (KFUPM) for funding this work through project No.12-INF3008-04 as part of the National Science, Technology and Innovation Plan. We are also very thankful to King Abullah University of Science and Technology (KAUST) for providing access to their K20X GPU cluster to run the experiments.
Appears in Collections:
Publications Acknowledging KAUST Support

Full metadata record

DC FieldValue Language
dc.contributor.authorul Hasan Khan, Ayazen
dc.contributor.authorAl-Mouhamed, Mayezen
dc.contributor.authorFatayer, Allamen
dc.date.accessioned2016-02-25T13:52:59Zen
dc.date.available2016-02-25T13:52:59Zen
dc.date.issued2015-06en
dc.identifier.citationUl Hasan Khan A, Al-Mouhamed M, Fatayer A (2015) Optimizing strassen matrix multiply on GPUs. 2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). Available: http://dx.doi.org/10.1109/SNPD.2015.7176172.en
dc.identifier.doi10.1109/SNPD.2015.7176172en
dc.identifier.urihttp://hdl.handle.net/10754/599106en
dc.description.abstract© 2015 IEEE. Many core systems are basically designed for applications having large data parallelism. Strassen Matrix Multiply (MM) can be formulated as a depth first (DFS) traversal of a recursion tree where all cores work in parallel on computing each of the NxN sub-matrices that reduces storage at the detriment of large data motion to gather and aggregate the results. We propose Strassen and Winograd algorithms (S-MM and W-MM) based on three optimizations: a set of basic algebra functions to reduce overhead, invoking efficient library (CUBLAS 5.5), and parameter-tuning of parametric kernel to improve resource occupancy. On GPUs, W-MM and S-MM with one recursion level outperform CUBLAS 5.5 Library with up to twice as faster for large arrays satisfying N>=2048 and N>=3072, respectively. Compared to NVIDIA SDK library, S-MM and W-MM achieved a speedup between 20x to 80x for the above arrays. The proposed approach can be used to enhance the performance of CUBLAS and MKL libraries.en
dc.description.sponsorshipThe authors would like to acknowledge the support provided by King Abdulaziz City for Science and Technology (KACST) through the Science & Technology Unit at King Fahd University of Petroleum & Minerals (KFUPM) for funding this work through project No.12-INF3008-04 as part of the National Science, Technology and Innovation Plan. We are also very thankful to King Abullah University of Science and Technology (KAUST) for providing access to their K20X GPU cluster to run the experiments.en
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)en
dc.subjectCUDA Programmingen
dc.subjectFast Matrix Multiplicationen
dc.subjectGraphics Processing Unit (GPU)en
dc.subjectStrassenen
dc.titleOptimizing strassen matrix multiply on GPUsen
dc.typeConference Paperen
dc.identifier.journal2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)en
dc.contributor.institutionKing Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabiaen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.