Handle URI:
http://hdl.handle.net/10754/598290
Title:
Exploration of automatic optimisation for CUDA programming
Authors:
Al-Mouhamed, Mayez; Khan, Ayaz ul Hassan
Abstract:
© 2014 Taylor & Francis. Writing optimised compute unified device architecture (CUDA) program for graphic processing units (GPUs) is complex even for experts. We present a design methodology for a restructuring tool that converts C-loops into optimised CUDA kernels based on a three-step algorithm which are loop tiling, coalesced memory access and resource optimisation. A method for finding possible loop tiling solutions with coalesced memory access is developed and a simplified algorithm for restructuring C-loops into an efficient CUDA kernel is presented. In the evaluation, we implement matrix multiply (MM), matrix transpose (M-transpose), matrix scaling (M-scaling) and matrix vector multiply (MV) using the proposed algorithm. We present the analysis of the execution time and GPU throughput for the above applications, which favourably compare to other proposals. Evaluation is carried out while scaling the problem size and running under a variety of kernel configurations. The obtained speedup is about 28-35% for M-transpose compared to NVIDIA Software Development Kit, 33% speedup for MV compared to general purpose computation on graphics processing unit compiler, and more than 80% speedup for MM and M-scaling compared to CUDA-lite.
Citation:
Al-Mouhamed M, Khan A ul H (2014) Exploration of automatic optimisation for CUDA programming. International Journal of Parallel, Emergent and Distributed Systems 30: 309–324. Available: http://dx.doi.org/10.1080/17445760.2014.953158.
Publisher:
Informa UK Limited
Journal:
International Journal of Parallel, Emergent and Distributed Systems
Issue Date:
16-Sep-2014
DOI:
10.1080/17445760.2014.953158
Type:
Article
ISSN:
1744-5760; 1744-5779
Sponsors:
The authors would like to acknowledge the support provided by King Abdulaziz City for Science and Technology (KACST) through the Science & Technology Unit at King Fahd University of Petroleum & Minerals (KFUPM) for funding this work through project No. 12-INF3008-04 as part of the National Science, Technology and Innovation Plan. Thanks to the Department of Information and Computer Science (ICS), King Fahd University of Petroleum and Minerals (KFUPM), and King Abdullah University of Science and Technology (KAUST) for giving access to their computing facilities.
Appears in Collections:
Publications Acknowledging KAUST Support

Full metadata record

DC FieldValue Language
dc.contributor.authorAl-Mouhamed, Mayezen
dc.contributor.authorKhan, Ayaz ul Hassanen
dc.date.accessioned2016-02-25T13:18:04Zen
dc.date.available2016-02-25T13:18:04Zen
dc.date.issued2014-09-16en
dc.identifier.citationAl-Mouhamed M, Khan A ul H (2014) Exploration of automatic optimisation for CUDA programming. International Journal of Parallel, Emergent and Distributed Systems 30: 309–324. Available: http://dx.doi.org/10.1080/17445760.2014.953158.en
dc.identifier.issn1744-5760en
dc.identifier.issn1744-5779en
dc.identifier.doi10.1080/17445760.2014.953158en
dc.identifier.urihttp://hdl.handle.net/10754/598290en
dc.description.abstract© 2014 Taylor & Francis. Writing optimised compute unified device architecture (CUDA) program for graphic processing units (GPUs) is complex even for experts. We present a design methodology for a restructuring tool that converts C-loops into optimised CUDA kernels based on a three-step algorithm which are loop tiling, coalesced memory access and resource optimisation. A method for finding possible loop tiling solutions with coalesced memory access is developed and a simplified algorithm for restructuring C-loops into an efficient CUDA kernel is presented. In the evaluation, we implement matrix multiply (MM), matrix transpose (M-transpose), matrix scaling (M-scaling) and matrix vector multiply (MV) using the proposed algorithm. We present the analysis of the execution time and GPU throughput for the above applications, which favourably compare to other proposals. Evaluation is carried out while scaling the problem size and running under a variety of kernel configurations. The obtained speedup is about 28-35% for M-transpose compared to NVIDIA Software Development Kit, 33% speedup for MV compared to general purpose computation on graphics processing unit compiler, and more than 80% speedup for MM and M-scaling compared to CUDA-lite.en
dc.description.sponsorshipThe authors would like to acknowledge the support provided by King Abdulaziz City for Science and Technology (KACST) through the Science & Technology Unit at King Fahd University of Petroleum & Minerals (KFUPM) for funding this work through project No. 12-INF3008-04 as part of the National Science, Technology and Innovation Plan. Thanks to the Department of Information and Computer Science (ICS), King Fahd University of Petroleum and Minerals (KFUPM), and King Abdullah University of Science and Technology (KAUST) for giving access to their computing facilities.en
dc.publisherInforma UK Limiteden
dc.subjectcompiler transformationsen
dc.subjectCUDAen
dc.subjectdirectivebased languageen
dc.subjectGPGPUen
dc.subjectGPUen
dc.subjectparallel programmingen
dc.subjectsource-to-source compileren
dc.titleExploration of automatic optimisation for CUDA programmingen
dc.typeArticleen
dc.identifier.journalInternational Journal of Parallel, Emergent and Distributed Systemsen
dc.contributor.institutionKing Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabiaen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.