Auto-tuning Non-blocking Collective Communication Operations

Handle URI:
http://hdl.handle.net/10754/597636
Title:
Auto-tuning Non-blocking Collective Communication Operations
Authors:
Barigou, Youcef; Venkatesan, Vishwanath; Gabriel, Edgar
Abstract:
Collective operations are widely used in large scale scientific applications, and critical to the scalability of these applications for large process counts. It has also been demonstrated that collective operations have to be carefully tuned for a given platform and application scenario to maximize their performance. Non-blocking collective operations extend the concept of collective operations by offering the additional benefit of being able to overlap communication and computation. This paper presents the automatic run-time tuning of non-blocking collective communication operations, which allows the communication library to choose the best performing implementation for a non-blocking collective operation on a case by case basis. The paper demonstrates that libraries using a single algorithm or implementation for a non-blocking collective operation will inevitably lead to suboptimal performance in many scenarios, and thus validate the necessity for run-time tuning of these operations. The benefits of the approach are further demonstrated for an application kernel using a multi-dimensional Fast Fourier Transform. The results obtained for the application scenario indicate a performance improvement of up to 40% compared to the current state of the art.
Citation:
Barigou Y, Venkatesan V, Gabriel E (2015) Auto-tuning Non-blocking Collective Communication Operations. 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. Available: http://dx.doi.org/10.1109/IPDPSW.2015.15.
Publisher:
Institute of Electrical and Electronics Engineers (IEEE)
Journal:
2015 IEEE International Parallel and Distributed Processing Symposium Workshop
Issue Date:
May-2015
DOI:
10.1109/IPDPSW.2015.15
Type:
Conference Paper
Sponsors:
Partial support for this work was pro-vided by the National Science Foundation’s Computer Sys-tems Research program under Award No. CNS-0846002 andCRI-0958464. Any opinions, findings, and conclusions orrecommendations expressed in this material are those of theauthors and do not necessarily reflect the views of the NationalScience Foundation. We would like to thank the KAUSTSupercomputing Laboratory for giving us access to their IBMBlueGene/P.
Appears in Collections:
Publications Acknowledging KAUST Support

Full metadata record

DC FieldValue Language
dc.contributor.authorBarigou, Youcefen
dc.contributor.authorVenkatesan, Vishwanathen
dc.contributor.authorGabriel, Edgaren
dc.date.accessioned2016-02-25T12:43:28Zen
dc.date.available2016-02-25T12:43:28Zen
dc.date.issued2015-05en
dc.identifier.citationBarigou Y, Venkatesan V, Gabriel E (2015) Auto-tuning Non-blocking Collective Communication Operations. 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. Available: http://dx.doi.org/10.1109/IPDPSW.2015.15.en
dc.identifier.doi10.1109/IPDPSW.2015.15en
dc.identifier.urihttp://hdl.handle.net/10754/597636en
dc.description.abstractCollective operations are widely used in large scale scientific applications, and critical to the scalability of these applications for large process counts. It has also been demonstrated that collective operations have to be carefully tuned for a given platform and application scenario to maximize their performance. Non-blocking collective operations extend the concept of collective operations by offering the additional benefit of being able to overlap communication and computation. This paper presents the automatic run-time tuning of non-blocking collective communication operations, which allows the communication library to choose the best performing implementation for a non-blocking collective operation on a case by case basis. The paper demonstrates that libraries using a single algorithm or implementation for a non-blocking collective operation will inevitably lead to suboptimal performance in many scenarios, and thus validate the necessity for run-time tuning of these operations. The benefits of the approach are further demonstrated for an application kernel using a multi-dimensional Fast Fourier Transform. The results obtained for the application scenario indicate a performance improvement of up to 40% compared to the current state of the art.en
dc.description.sponsorshipPartial support for this work was pro-vided by the National Science Foundation’s Computer Sys-tems Research program under Award No. CNS-0846002 andCRI-0958464. Any opinions, findings, and conclusions orrecommendations expressed in this material are those of theauthors and do not necessarily reflect the views of the NationalScience Foundation. We would like to thank the KAUSTSupercomputing Laboratory for giving us access to their IBMBlueGene/P.en
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)en
dc.titleAuto-tuning Non-blocking Collective Communication Operationsen
dc.typeConference Paperen
dc.identifier.journal2015 IEEE International Parallel and Distributed Processing Symposium Workshopen
dc.contributor.institutionDepartment of Computer Science, University of Houston, Houston, TX 77204-3010, USA.en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.