research-article

Leveraging non-blocking collective communication in high-performance applications

Authors:

Torsten Hoefler,

Peter Gottschling,

Andrew LumsdaineAuthors Info & Claims

SPAA '08: Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures

Pages 113 - 115

https://doi.org/10.1145/1378533.1378554

Published: 14 June 2008 Publication History

Get Access

Abstract

Although overlapping communication with computation is an important mechanism for achieving high performance in parallel programs, developing applications that actually achieve good overlap can be difficult. Existing approaches are typically based on manual or compiler-based transformations. This paper presents a pattern and library-based approach to optimizing collective communication in parallel high-performance applications, based on using non-blocking collective operations to enable overlapping of communication and computation. Common communication and computation patterns in iterative SPMD computations are used to motivate the transformations we present. Our approach provides the programmer with the capability to separately optimize communication and computation in an application, while automating the interaction between computation and communication to achieve maximum overlap. Performance results with a model application show more than a 90% decrease in communication overhead, resulting in 21% overall performance improvements.

References

[1]

Tarek S. Abdelrahman and Gary Liu. Overlap of computation and communication on shared-memory networks-of-workstations. Cluster computing, 2001.

Digital Library

Google Scholar

[2]

Ron Brightwell, Sue Goudy, Arun Rodrigues, and Keith Underwood. Implications of application usage characteristics for collective communication offload. International Journal of High-Performance Computing and Networking, 4(2), 2006.

Digital Library

Google Scholar

[3]

Anthony Danalis, Ki-Yong Kim, Lori Pollock, and Martin Swany. Transformations to parallel codes for communication-computation overlap. In SC ?05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, page 58, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

Google Scholar

[4]

Sergei Gorlatch. Send-receive considered harmful: Myths and realities of message passing. ACM Trans. Program. Lang. Syst., 26(1):47--56, 2004.

Digital Library

Google Scholar

[5]

T. Hoefler, P. Gottschling, A. Lumsdaine, and W. Rehm. Optimizing a Conjugate Gradient Solver with Non-Blocking Collective Operations. Elsevier Journal of Parallel Computing (PARCO), 33(9):624--633, 9 2007.

Digital Library

Google Scholar

[6]

T. Hoefler, P. Kambadur, R. L. Graham, G. Shipman, and A. Lumsdaine. A Case for Standard Non-Blocking Collective Operations. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, EuroPVM/MPI 2007, volume 4757, pages 125--134. Springer, 10 2007.

Digital Library

Google Scholar

[7]

T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI. In 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM, 11 2007.

Digital Library

Google Scholar

[8]

Kamil Iskra, Pete Beckman, Kazutomo Yoshii, and Susan Coghlan. The influence of operating systems on the performance of collective operations at extreme scale. In Proceedings of Cluster Computing, 2006 IEEE International Conference, 2006.

Google Scholar

[9]

G. Liu and T.S. Abdelrahman. Computation communication overlap on network-of-workstation multiprocessors. In Proc. of the Int?l Conference on Parallel and Distributed Processing Techniques and Applications, pages 1635--1642, July 1998.

Google Scholar

[10]

Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17(1):5--20, 2007 2007.

Crossref

Google Scholar

[11]

Fabrizio Petrini, Darren J. Kerbyson, and Scott Pakin. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8, 192 Processors of ASCI Q. In Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computing, 15-21 November 2003, Phoenix, AZ, USA, CD-Rom, page 55. ACM, 2003.

Digital Library

Google Scholar

[12]

Jose Carlos Sancho, Kevin J. Barker, Darren J. Kerbyson, and Kei Davis. Mpi tools and performance studies-quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 125. ACM Press, 2006.

Google Scholar

Cited By

View all

Denis AJaeger JJeannot EPérache MTaboada H(2019)Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processorThe International Journal of High Performance Computing Applications10.1177/1094342019860184(109434201986018)Online publication date: 2-Jul-2019
https://doi.org/10.1177/1094342019860184
Chakraborty SBayatpour MHashmi JSubramoni HPanda D(2018)Cooperative rendezvous protocols for improved performance and overlapProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291694(1-13)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291694
Chakraborty SBayatpour MHashmi JSubramoni HPanda D(2018)Cooperative rendezvous protocols for improved performance and overlapProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00031(1-13)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.1109/SC.2018.00031
Show More Cited By

Index Terms

Leveraging non-blocking collective communication in high-performance applications
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Implementation and performance analysis of non-blocking collective operations for MPI
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we ...
Maximizing Communication---Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations

Non-blocking collective communication operations extend the concept of collective operations by offering the additional benefit of being able to overlap communication and computation. They are often considered key building blocks for scaling ...
Optimizing a conjugate gradient solver with non-blocking collective operations

This paper presents a case study that analyzes the suitability and usage of non-blocking collective operations in parallel applications. As with their point-to-point counterparts, non-blocking collective operations provide the ability to overlap ...

Comments

Information & Contributors

Information

Published In

SPAA '08: Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures

June 2008

380 pages

ISBN:9781595939739

DOI:10.1145/1378533

General Chair:
Friedhelm Meyer auf der Heide
University of Paderborn, Germany
,
Program Chair:
Nir Shavit
Tel-Aviv University, Israel, and Sun Labs, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SPAA08

Sponsor:

SPAA08: 20th ACM Symposium on Parallelism in Algorithms and Architectures

June 14 - 16, 2008

Munich, Germany

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Upcoming Conference

SPAA '25

Sponsor:
sigact
sigact

37th ACM Symposium on Parallelism in Algorithms and Architectures

July 28 - August 1, 2025

Portland , OR , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
279
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Denis AJaeger JJeannot EPérache MTaboada H(2019)Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processorThe International Journal of High Performance Computing Applications10.1177/1094342019860184(109434201986018)Online publication date: 2-Jul-2019
https://doi.org/10.1177/1094342019860184
Chakraborty SBayatpour MHashmi JSubramoni HPanda D(2018)Cooperative rendezvous protocols for improved performance and overlapProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291694(1-13)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291694
Chakraborty SBayatpour MHashmi JSubramoni HPanda D(2018)Cooperative rendezvous protocols for improved performance and overlapProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00031(1-13)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.1109/SC.2018.00031
Barigou YGabriel E(2017)Maximizing Communication---Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective OperationsInternational Journal of Parallel Programming10.1007/s10766-016-0477-745:6(1390-1416)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1007/s10766-016-0477-7
Barigou YVenkatesan VGabriel E(2015)Auto-tuning Non-blocking Collective Communication OperationsProceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop10.1109/IPDPSW.2015.15(1204-1213)Online publication date: 25-May-2015
https://dl.acm.org/doi/10.1109/IPDPSW.2015.15
Song SHollingsworth J(2014)Designing and auto-tuning parallel 3-D FFT for computation-communication overlapACM SIGPLAN Notices10.1145/2692916.255524949:8(181-192)Online publication date: 6-Feb-2014
https://dl.acm.org/doi/10.1145/2692916.2555249
Song SHollingsworth JMoreira JLarus J(2014)Designing and auto-tuning parallel 3-D FFT for computation-communication overlapProceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming10.1145/2555243.2555249(181-192)Online publication date: 6-Feb-2014
https://dl.acm.org/doi/10.1145/2555243.2555249
Song SHollingsworth JAlexandrov VGeist AEngelmann C(2014)Scaling parallel 3-D FFT with non-blocking MPI collectivesProceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems10.1109/ScalA.2014.9(1-8)Online publication date: 16-Nov-2014
https://dl.acm.org/doi/10.1109/ScalA.2014.9
Jagannathan SDonzis DStewart C(2012)Massively parallel direct numerical simulations of forced compressible turbulenceProceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond10.1145/2335755.2335819(1-8)Online publication date: 16-Jul-2012
https://dl.acm.org/doi/10.1145/2335755.2335819
Kandalla KBuluc ASubramoni HTomko KVienne JOliker LPanda D(2012)Can Network-Offload Based Non-blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms?Proceedings of the 2012 IEEE International Conference on Cluster Computing Workshops10.1109/ClusterW.2012.40(222-230)Online publication date: 24-Sep-2012
https://dl.acm.org/doi/10.1109/ClusterW.2012.40
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Implementation and performance analysis of non-blocking collective operations for MPI

Maximizing Communication---Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations

Optimizing a conjugate gradient solver with non-blocking collective operations

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations