skip to main content
10.1145/2807591.2807602acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Improving concurrency and asynchrony in multithreaded MPI applications using software offloading

Published: 15 November 2015 Publication History

Abstract

We present a new approach for multithreaded communication and asynchronous progress in MPI applications, wherein we offload communication processing to a dedicated thread. The central premise is that given the rapidly increasing core counts on modern systems, the improvements in MPI performance arising from dedicating a thread to drive communication outweigh the small loss of resources for application computation, particularly when overlap of communication and computation can be exploited. Our approach allows application threads to make MPI calls concurrently, enqueuing these as communication tasks to be processed by a dedicated communication thread. This not only guarantees progress for such communication operations, but also reduces load imbalance. Our implementation additionally significantly reduces the overhead of mutual exclusion seen in existing implementations for applications using MPI_THREAD_MULTIPLE. Our technique requires no modification to the application, and we demonstrate significant performance improvement (up to 2X) for QCD, 1-D FFT and deep learning CNN applications.

References

[1]
Using the GNI and DMAPP APIs. Technical Report S-2446-5002, Cray, Mar. 2013. URL http://docs.cray.com/books/S-2446-5002/S-2446-5002.pdf.
[2]
Mellanox Technologies. http://www.mellanox.com.
[3]
Trinity / NERSC-8 RFP. http://ofiwg.github.io/libfabric/. URL http://ofiwg.github.io/libfabric/.
[4]
Trinity / NERSC-8 RFP. http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/. URL http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/.
[5]
OSU Micro-benchmarks 4.4.1. http://mvapich.cse.ohio-state.edu/benchmarks/.
[6]
A. Amer, H. Lu, Y. Wei, P. Balaji, and S. Matsuoka. MPI+threads: Runtime contention and remedies. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 239--248, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3205-7. URL http://doi.acm.org/10.1145/2688500.2688522.
[7]
R. Babich, M. A. Clark, and B. Joó. Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics. In Proceedings of SC10: International Conference for High Performance Computing, Networking, Storage and Analysis, 2010.
[8]
P. Balaji, W. Feng, Q. Gao, R. Noronha, W. Yu, and D. K. Panda. Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster), Boston, Massachusetts, Sep. 27-30 2005.
[9]
P. Boyle. The BlueGene/Q supercomputer. Proceedings of Science, Lattice Field Theory, 2012.
[10]
R. Brightwell, K. Pedretti, and K. Underwood. Initial performance evaluation of the cray seastar interconnect. In High Performance Interconnects, 2005. Proceedings. 13th Symposium on, pages 51--57, Aug 2005.
[11]
R. Brightwell, K. Pedretti, and T. Hudson. Smartmap: Operating system support for efficient data sharing among processes on a multicore processor. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, pages 25:1--25:12, Piscataway, NJ, USA, 2008. IEEE Press. ISBN 978-1-4244-2835-9. URL http://dl.acm.org/citation.cfm?id=1413370.1413396.
[12]
J. W. Cooley and J. W. Tukey. An Algorithm for the Machine Computation of Complex Fourier Series. MATHCOMP, 19(2):297--301, 1965.
[13]
A. Coral: Collaboration of Oak Ridge and Livermore National Laboratories. DRAFT CORAL BUILD STATEMENT OF WORK. Technical Report LLNL-PROP-636244, Lawrence Livermore National Laboratory, Dec. 2013. URL https://asc.llnl.gov/CORAL/RFP_components/02_draft_CORAL_Build_SOW_12-31-13.pdf.
[14]
G. Dózsa, S. Kumar, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, J. Ratterman, and R. Thakur. Enabling concurrent multithreaded MPI communication on multicore petascale systems. In R. Keller, E. Gabriel, M. Resch, and J. Dongarra, editors, Recent Advances in the Message Passing Interface, volume 6305 of Lecture Notes in Computer Science, pages 11--20. Springer Berlin Heidelberg, 2010.
[15]
W. Feng, P. Balaji, C. Baron, L. N. Bhuyan, and D. K. Panda. Performance Characterization of a 10-Gigabit Ethernet TOE. In Proceedings of the IEEE International Symposium on High-Performance Interconnects (HotI), Palo Alto, CA, Aug. 17-19 2005.
[16]
A. Friedley, T. Hoefler, G. Bronevetsky, A. Lumsdaine, and C.-C. Ma. Ownership passing: Efficient distributed memory programming on multi-core systems. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 177--186, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1922-5. URL http://doi.acm.org/10.1145/2442516.2442534.
[17]
M. Frigo and S. G. Johnson. The Design and Implementation of FFTW. IEEEP, 93:216--231, 2005.
[18]
J. R. Hammond, S. Krishnamoorthy, S. Shende, N. A. Romero, and A. D. Malony. Performance characterization of global address space applications: a case study with NWChem. Concurrency and Computation: Practice and Experience, 24(2):135--154, 2012. ISSN 1532-0634. URL http://dx.doi.org/10.1002/cpe.1881.
[19]
M. R. Hestenes and E. Stiefel. Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards, 1952.
[20]
B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey, and W. W. III1. Lattice QCD on Intel Xeon Phi. In Proceedings of ISC13: International Conference for Super Computing, 2013.
[21]
M. Krishnan, J. Nieplocha, M. Blocksome, and B. Smith. Evaluation of remote memory access communication on the IBM Blue Gene/P supercomputer. In International Conference on Parallel Processing - Workshops, 2008. ICPP-W '08., pages 109--115, Sept 2008.
[22]
A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. In Computing Research Repository (CoRR), 2014.
[23]
S. Kumar, G. Dozsa, G. Almasi, P. Heidelberger, D. Chen, M. E. Giampapa, M. Blocksome, A. Faraj, J. Parker, J. Ratterman, B. Smith, and C. J. Archer. The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P supercomputer. In ICS '08: Proceedings of the 22nd annual international conference on Supercomputing, pages 94--103, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-158-3.
[24]
S. Kumar, A. Mamidala, D. Faraj, B. Smith, M. Blocksome, B. Cernohous, D. Miller, J. Parker, J. Ratterman, P. Heidelberger, D. Chen, and B. Steinmacher-Burrow. PAMI: A parallel active message interface for the Blue Gene/Q supercomputer. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 763--773, 2012.
[25]
S. Kumar, Y. Sun, and L. Kale. Acceleration of an asynchronous message driven programming paradigm on IBM Blue Gene/Q. In Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 689--699, May 2013.
[26]
Y. LeCun, F. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In In Computer Vision and Pattern Recognition, CVPR, 2004.
[27]
F. Petrini, W. Feng, A. Hoisie, S. Coll, and E. Frachtenberg. The Quadrics Network (QsNet): High-Performance Clustering Technology. In HotI '01, 2001.
[28]
H. Pritchard, D. Roweth, D. Henseler, and P. Cassella. Leveraging the cray linux environment core specialization feature to realize mpi asynchronous progress on cray xe systems. In Proceedings of the Cray User Group Conference, 2012.
[29]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. In International Journal of Computer Vision (IJCV), 2015.
[30]
M. Si, A. J. Pena, J. Hammond, P. Balaji, M. Takagi, and Y. Ishikawa. Casper: An asynchronous progress model for MPI RMA on manycore architectures. 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), May 2015. URL http://www.mcs.anl.gov/papers/P5221-1014.pdf.
[31]
I. S. B. G. Solution. Blue Gene/P application development redbook, 2008. http://www.redbooks.ibm.com/abstracts/sg247287.html.
[32]
P. T. P. Tang, J. Park, D. Kim, and V. Petrov. A Framework for Low-Communication 1-D FFT. In Proceedings of SC12: International Conference for High Performance Computing, Networking, Storage and Analysis, 2012.
[33]
K. Vaidyanathan, K. Pamnany, D. K. Kalamkar, A. Heinecke, M. Smelyanskiy, J. Park, D. Kim, A. Shet, B. Kaul, B. Joo, and P. Dubey. Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters. In Proceedings of IPDPS: International Parallel and Distributed Processing Symposium, 2014.
[34]
H. A. van der Vorst. BI-CGSTAB: a fast and smoothly converging variant of BI-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 1992.
[35]
R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep Image: Scaling up Image Recognition. In Computing Research Repository, 2015.

Cited By

View all
  • (2024)Measuring Thread Timing to Assess the Feasibility of Early‐Bird Message Delivery Across Systems and ScalesConcurrency and Computation: Practice and Experience10.1002/cpe.834237:1Online publication date: 12-Dec-2024
  • (2023)Measuring Thread Timing to Assess the Feasibility of Early-bird Message DeliveryProceedings of the 52nd International Conference on Parallel Processing Workshops10.1145/3605731.3605884(119-126)Online publication date: 7-Aug-2023
  • (2022)Improving Communication Asynchrony and Concurrency for Adaptive MPI Endpoints2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00007(11-21)Online publication date: Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
  • General Chair:
  • Jackie Kern,
  • Program Chair:
  • Jeffrey S. Vetter
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SC15
Sponsor:

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Measuring Thread Timing to Assess the Feasibility of Early‐Bird Message Delivery Across Systems and ScalesConcurrency and Computation: Practice and Experience10.1002/cpe.834237:1Online publication date: 12-Dec-2024
  • (2023)Measuring Thread Timing to Assess the Feasibility of Early-bird Message DeliveryProceedings of the 52nd International Conference on Parallel Processing Workshops10.1145/3605731.3605884(119-126)Online publication date: 7-Aug-2023
  • (2022)Improving Communication Asynchrony and Concurrency for Adaptive MPI Endpoints2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00007(11-21)Online publication date: Nov-2022
  • (2022)Compiler-enabled optimization of persistent MPI Operations2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00006(1-10)Online publication date: Nov-2022
  • (2022)IMB-ASYNC: a revised method and benchmark to estimate MPI-3 asynchronous progress efficiencyCluster Computing10.1007/s10586-021-03452-825:4(2683-2697)Online publication date: 15-Jan-2022
  • (2021)Solving algorithm and parallel optimization of Helmholtz equation in GRAPES modelThe 2nd International Conference on Computing and Data Science10.1145/3448734.3450901(1-6)Online publication date: 28-Jan-2021
  • (2021)Daps: A Dynamic Asynchronous Progress Stealing Model for MPI Communication2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00027(516-527)Online publication date: Sep-2021
  • (2021)An improved framework of GPU computing for CFD applications on structured grids using OpenACCJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.05.010156(64-85)Online publication date: Oct-2021
  • (2020)Overlapping MPI communications with Intel TBB computation2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00159(958-966)Online publication date: May-2020
  • (2020)Transparent Overlapping of Blocking Communication in MPI Applications2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS50907.2020.00097(744-749)Online publication date: Dec-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media