research-article

Improving concurrency and asynchrony in multithreaded MPI applications using software offloading

Authors:

Karthikeyan Vaidyanathan,

Dhiraj D. Kalamkar,

Jeff R. Hammond,

Bálint JoóAuthors Info & Claims

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 30, Pages 1 - 12

https://doi.org/10.1145/2807591.2807602

Published: 15 November 2015 Publication History

Abstract

We present a new approach for multithreaded communication and asynchronous progress in MPI applications, wherein we offload communication processing to a dedicated thread. The central premise is that given the rapidly increasing core counts on modern systems, the improvements in MPI performance arising from dedicating a thread to drive communication outweigh the small loss of resources for application computation, particularly when overlap of communication and computation can be exploited. Our approach allows application threads to make MPI calls concurrently, enqueuing these as communication tasks to be processed by a dedicated communication thread. This not only guarantees progress for such communication operations, but also reduces load imbalance. Our implementation additionally significantly reduces the overhead of mutual exclusion seen in existing implementations for applications using MPI_THREAD_MULTIPLE. Our technique requires no modification to the application, and we demonstrate significant performance improvement (up to 2X) for QCD, 1-D FFT and deep learning CNN applications.

References

[1]

Using the GNI and DMAPP APIs. Technical Report S-2446-5002, Cray, Mar. 2013. URL http://docs.cray.com/books/S-2446-5002/S-2446-5002.pdf.

[2]

Mellanox Technologies. http://www.mellanox.com.

[3]

Trinity / NERSC-8 RFP. http://ofiwg.github.io/libfabric/. URL http://ofiwg.github.io/libfabric/.

[4]

Trinity / NERSC-8 RFP. http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/. URL http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/.

[5]

OSU Micro-benchmarks 4.4.1. http://mvapich.cse.ohio-state.edu/benchmarks/.

[6]

A. Amer, H. Lu, Y. Wei, P. Balaji, and S. Matsuoka. MPI+threads: Runtime contention and remedies. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 239--248, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3205-7. URL http://doi.acm.org/10.1145/2688500.2688522.

Digital Library

[7]

R. Babich, M. A. Clark, and B. Joó. Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics. In Proceedings of SC10: International Conference for High Performance Computing, Networking, Storage and Analysis, 2010.

Digital Library

[8]

P. Balaji, W. Feng, Q. Gao, R. Noronha, W. Yu, and D. K. Panda. Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster), Boston, Massachusetts, Sep. 27-30 2005.

[9]

P. Boyle. The BlueGene/Q supercomputer. Proceedings of Science, Lattice Field Theory, 2012.

[10]

R. Brightwell, K. Pedretti, and K. Underwood. Initial performance evaluation of the cray seastar interconnect. In High Performance Interconnects, 2005. Proceedings. 13th Symposium on, pages 51--57, Aug 2005.

Digital Library

[11]

R. Brightwell, K. Pedretti, and T. Hudson. Smartmap: Operating system support for efficient data sharing among processes on a multicore processor. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, pages 25:1--25:12, Piscataway, NJ, USA, 2008. IEEE Press. ISBN 978-1-4244-2835-9. URL http://dl.acm.org/citation.cfm?id=1413370.1413396.

Digital Library

[12]

J. W. Cooley and J. W. Tukey. An Algorithm for the Machine Computation of Complex Fourier Series. MATHCOMP, 19(2):297--301, 1965.

[13]

A. Coral: Collaboration of Oak Ridge and Livermore National Laboratories. DRAFT CORAL BUILD STATEMENT OF WORK. Technical Report LLNL-PROP-636244, Lawrence Livermore National Laboratory, Dec. 2013. URL https://asc.llnl.gov/CORAL/RFP_components/02_draft_CORAL_Build_SOW_12-31-13.pdf.

[14]

G. Dózsa, S. Kumar, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, J. Ratterman, and R. Thakur. Enabling concurrent multithreaded MPI communication on multicore petascale systems. In R. Keller, E. Gabriel, M. Resch, and J. Dongarra, editors, Recent Advances in the Message Passing Interface, volume 6305 of Lecture Notes in Computer Science, pages 11--20. Springer Berlin Heidelberg, 2010.

Digital Library

[15]

W. Feng, P. Balaji, C. Baron, L. N. Bhuyan, and D. K. Panda. Performance Characterization of a 10-Gigabit Ethernet TOE. In Proceedings of the IEEE International Symposium on High-Performance Interconnects (HotI), Palo Alto, CA, Aug. 17-19 2005.

Digital Library

[16]

A. Friedley, T. Hoefler, G. Bronevetsky, A. Lumsdaine, and C.-C. Ma. Ownership passing: Efficient distributed memory programming on multi-core systems. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 177--186, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1922-5. URL http://doi.acm.org/10.1145/2442516.2442534.

Digital Library

[17]

M. Frigo and S. G. Johnson. The Design and Implementation of FFTW. IEEEP, 93:216--231, 2005.

[18]

J. R. Hammond, S. Krishnamoorthy, S. Shende, N. A. Romero, and A. D. Malony. Performance characterization of global address space applications: a case study with NWChem. Concurrency and Computation: Practice and Experience, 24(2):135--154, 2012. ISSN 1532-0634. URL http://dx.doi.org/10.1002/cpe.1881.

Digital Library

[19]

M. R. Hestenes and E. Stiefel. Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards, 1952.

[20]

B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey, and W. W. III1. Lattice QCD on Intel Xeon Phi. In Proceedings of ISC13: International Conference for Super Computing, 2013.

[21]

M. Krishnan, J. Nieplocha, M. Blocksome, and B. Smith. Evaluation of remote memory access communication on the IBM Blue Gene/P supercomputer. In International Conference on Parallel Processing - Workshops, 2008. ICPP-W '08., pages 109--115, Sept 2008.

Digital Library

[22]

A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. In Computing Research Repository (CoRR), 2014.

[23]

S. Kumar, G. Dozsa, G. Almasi, P. Heidelberger, D. Chen, M. E. Giampapa, M. Blocksome, A. Faraj, J. Parker, J. Ratterman, B. Smith, and C. J. Archer. The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P supercomputer. In ICS '08: Proceedings of the 22nd annual international conference on Supercomputing, pages 94--103, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-158-3.

Digital Library

[24]

S. Kumar, A. Mamidala, D. Faraj, B. Smith, M. Blocksome, B. Cernohous, D. Miller, J. Parker, J. Ratterman, P. Heidelberger, D. Chen, and B. Steinmacher-Burrow. PAMI: A parallel active message interface for the Blue Gene/Q supercomputer. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 763--773, 2012.

Digital Library

[25]

S. Kumar, Y. Sun, and L. Kale. Acceleration of an asynchronous message driven programming paradigm on IBM Blue Gene/Q. In Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 689--699, May 2013.

Digital Library

[26]

Y. LeCun, F. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In In Computer Vision and Pattern Recognition, CVPR, 2004.

Digital Library

[27]

F. Petrini, W. Feng, A. Hoisie, S. Coll, and E. Frachtenberg. The Quadrics Network (QsNet): High-Performance Clustering Technology. In HotI '01, 2001.

Digital Library

[28]

H. Pritchard, D. Roweth, D. Henseler, and P. Cassella. Leveraging the cray linux environment core specialization feature to realize mpi asynchronous progress on cray xe systems. In Proceedings of the Cray User Group Conference, 2012.

[29]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. In International Journal of Computer Vision (IJCV), 2015.

Digital Library

[30]

M. Si, A. J. Pena, J. Hammond, P. Balaji, M. Takagi, and Y. Ishikawa. Casper: An asynchronous progress model for MPI RMA on manycore architectures. 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), May 2015. URL http://www.mcs.anl.gov/papers/P5221-1014.pdf.

Digital Library

[31]

I. S. B. G. Solution. Blue Gene/P application development redbook, 2008. http://www.redbooks.ibm.com/abstracts/sg247287.html.

[32]

P. T. P. Tang, J. Park, D. Kim, and V. Petrov. A Framework for Low-Communication 1-D FFT. In Proceedings of SC12: International Conference for High Performance Computing, Networking, Storage and Analysis, 2012.

Digital Library

[33]

K. Vaidyanathan, K. Pamnany, D. K. Kalamkar, A. Heinecke, M. Smelyanskiy, J. Park, D. Kim, A. Shet, B. Kaul, B. Joo, and P. Dubey. Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters. In Proceedings of IPDPS: International Parallel and Distributed Processing Symposium, 2014.

Digital Library

[34]

H. A. van der Vorst. BI-CGSTAB: a fast and smoothly converging variant of BI-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 1992.

Digital Library

[35]

R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep Image: Scaling up Image Recognition. In Computing Research Repository, 2015.

Cited By

Marts WDosanjh MSchonbein WLevy SBridges P(2024)Measuring Thread Timing to Assess the Feasibility of Early‐Bird Message Delivery Across Systems and ScalesConcurrency and Computation: Practice and Experience10.1002/cpe.834237:1Online publication date: 12-Dec-2024
https://doi.org/10.1002/cpe.8342
Marts WDosanjh MSchonbein WLevy SBridges P(2023)Measuring Thread Timing to Assess the Feasibility of Early-bird Message DeliveryProceedings of the 52nd International Conference on Parallel Processing Workshops10.1145/3605731.3605884(119-126)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605731.3605884
White SKale L(2022)Improving Communication Asynchrony and Concurrency for Adaptive MPI Endpoints2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00007(11-21)Online publication date: Nov-2022
https://doi.org/10.1109/ExaMPI56604.2022.00007
Show More Cited By

Index Terms

Recommendations

Lock Contention Management in Multithreaded MPI

In this article, we investigate contention management in lock-based thread-safe MPI libraries. Specifically, we make two assumptions: (1) locks are the only form of synchronization when protecting communication paths; and (2) contention occurs, and thus ...
Eliminating contention bottlenecks in multithreaded MPI

The performance sustains with many thousands of concurrently communicating threads.A constant time overhead algorithm for MPI point-to-point communication.A thread scheduler that achieves a single write for marking a thread as runnable.A new set of ...
Advanced Thread Synchronization for Multithreaded MPI Implementations
CCGrid '17: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Concurrent multithreaded access to the Message Passing Interface (MPI) is gaining importance to support emerging hybrid MPI applications. The interoperability between threads and MPI, however, is complex and renders efficient implementations nontrivial. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2015

985 pages

ISBN:9781450337236

DOI:10.1145/2807591

General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee

Copyright © 2015 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SC15

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 15 - 20, 2015

Texas, Austin

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
503
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Marts WDosanjh MSchonbein WLevy SBridges P(2024)Measuring Thread Timing to Assess the Feasibility of Early‐Bird Message Delivery Across Systems and ScalesConcurrency and Computation: Practice and Experience10.1002/cpe.834237:1Online publication date: 12-Dec-2024
https://doi.org/10.1002/cpe.8342
Marts WDosanjh MSchonbein WLevy SBridges P(2023)Measuring Thread Timing to Assess the Feasibility of Early-bird Message DeliveryProceedings of the 52nd International Conference on Parallel Processing Workshops10.1145/3605731.3605884(119-126)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605731.3605884
White SKale L(2022)Improving Communication Asynchrony and Concurrency for Adaptive MPI Endpoints2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00007(11-21)Online publication date: Nov-2022
https://doi.org/10.1109/ExaMPI56604.2022.00007
Jammer TBischof C(2022)Compiler-enabled optimization of persistent MPI Operations2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00006(1-10)Online publication date: Nov-2022
https://doi.org/10.1109/ExaMPI56604.2022.00006
Medvedev A(2022)IMB-ASYNC: a revised method and benchmark to estimate MPI-3 asynchronous progress efficiencyCluster Computing10.1007/s10586-021-03452-825:4(2683-2697)Online publication date: 15-Jan-2022
https://doi.org/10.1007/s10586-021-03452-8
Yan WJia JZhang KHuang JWang X(2021)Solving algorithm and parallel optimization of Helmholtz equation in GRAPES modelThe 2nd International Conference on Computing and Data Science10.1145/3448734.3450901(1-6)Online publication date: 28-Jan-2021
https://dl.acm.org/doi/10.1145/3448734.3450901
Ouyang KSi MHori AChen ZBalaji P(2021)Daps: A Dynamic Asynchronous Progress Stealing Model for MPI Communication2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00027(516-527)Online publication date: Sep-2021
https://doi.org/10.1109/Cluster48925.2021.00027
Xue WJackson CRoy C(2021)An improved framework of GPU computing for CFD applications on structured grids using OpenACCJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.05.010156(64-85)Online publication date: Oct-2021
https://doi.org/10.1016/j.jpdc.2021.05.010
Barbosa CLemarinier PSergent MPapaure GPerache M(2020)Overlapping MPI communications with Intel TBB computation2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00159(958-966)Online publication date: May-2020
https://doi.org/10.1109/IPDPSW50202.2020.00159
Lescouet ABrunet ETrahay FThomas G(2020)Transparent Overlapping of Blocking Communication in MPI Applications2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS50907.2020.00097(744-749)Online publication date: Dec-2020
https://doi.org/10.1109/HPCC-SmartCity-DSS50907.2020.00097
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten