research-article

Benefits of Cross Memory Attach for MPI libraries on HPC Clusters

Author:

Jerome VienneAuthors Info & Claims

XSEDE '14: Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment

Article No.: 33, Pages 1 - 6

https://doi.org/10.1145/2616498.2616532

Published: 13 July 2014 Publication History

Abstract

With the number of cores per node increasing in modern clusters, an efficient implementation of intra-node communications is critical for application performance. MPI libraries generally use shared memory mechanisms for communication inside the node, unfortunately this approach has some limitations for large messages. The release of Linux kernel 3.2 introduced Cross Memory Attach (CMA) which is a mechanism to improve the communication between MPI processes inside the same node. But, as this feature is not enabled by default inside MPI libraries supporting it, it could be left disabled by HPC administrators which leads to a loss of performance benefits to users. In this paper, we explain how to use CMA and present an evaluation of CMA using micro-benchmarks and NAS parallel benchmarks (NPB) which are a set of applications commonly used to evaluate parallel systems.

Our performance evaluation reveals that CMA outperforms shared memory performance for large messages. Micro-benchmark level evaluations show that CMA can enhance the performance by as much as a factor of four. With NPB, we see up to 24.75% improvement in total execution time for FT and up to 24.08% for IS.

References

[1]

Intel MPI Benchmark. http://software.intel.com/en-us/articles/intel-mpi-benchmarks.

[2]

NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html.

[3]

A. R. Mamidala, A. Vishnu and D. K. Panda. Efficient Shared Memory and RDMA Based Design for MPI-Allgather over InfiniBand. In PVM/MPI User's Group Meeting, 2006.

Digital Library

[4]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. The Intl. Journal of Supercomputer Applications, 5(3):63--73, Fall 1991.

Digital Library

[5]

D. Buntinas, B. Goglin, D. Goodell, G. Mercier, and S. Moreaud. Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis. In International Conference on Parallel Processing, 2009.

Digital Library

[6]

D. Buntinas, G. Mercier, and W. Gropp. Data Transfers between Processes in an SMP System: Performance Study and Application to MPI. In International Conference on Parallel Processing 2006, pages 487--496, Aug. 2006.

Digital Library

[7]

D. Buntinas, G. Mercier, and W. Gropp. Implementation and Evaluation of Shared-Memory Communication and Synchronization Operations in MPICH2 using the Nemesis Communication Subsystem. Parallel Computing, 33(9):634--644, Sept. 2007.

Digital Library

[8]

E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In PVM/MPI Users' Group Meeting, pages 97--104, 2004.

[9]

B. Goglin and S. Moreaud. KNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework. Journal of Parallel and Distributed Computing, 73(2):176--188, 2013.

Digital Library

[10]

R. Graham and G. Shipman. MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science, volume Volume 5205/2008, pages 130--140, 2008.

Digital Library

[11]

T. Hoefler and M. Snir. Generic Topology Mapping Strategies for Large-scale Parallel Architectures. In International Conference on Supercomputing, pages 75--85, 2011.

Digital Library

[12]

W. Huang, G. Santhanaraman, H.-W. Jin, Q. Gao, and D. K. x. D. K. Panda. Design of High Performance MVAPICH2: MPI2 over InfiniBand. In CCGRID, pages 43--48, 2006.

Digital Library

[13]

H. Jin, S. Sur, L. Chai, and D. K. Panda. LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster. In International Conference on Parallel Processing, pages 184--191, 2005.

Digital Library

[14]

M. Koop, W. Huang, K. Gopalakrishnan, and D. K. Panda. Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand. In Hot Interconnects, 2008.

Digital Library

[15]

M. Luo, H. Wang, J. Vienne, and D. Panda. Redesigning MPI Shared Memory Communication for Large Multi-Core Architecture. Computer Science - Research and Development, 28(2-3), 2013.

Digital Library

[16]

T. Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. M. Squyres, and J. J. Dongarra. Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs. In International Conference on Parallel Processing, pages 532--541, 2011.

Digital Library

[17]

G. Mercier and E. Jeannot. Improving MPI Applications Performance on Multicore Clusters with Rank Reordering. In EuroMPI, pages 39--49, 2011.

Digital Library

[18]

MPI Forum. MPI: A Message Passing Interface. In Proceedings of Supercomputing, 1993.

Digital Library

[19]

H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda. Design of a Scalable InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes. In Supercomputing, pages 70:1--70:12, 2012.

Digital Library

[20]

J. Vienne, J. Chen, M. Wasi-Ur-Rahman, N. S. Islam, H. Subramoni, and D. K. Panda. Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems. In Hot Interconnects, pages 48--55, 2012.

Digital Library

[21]

F. C. Wong, R. P. Martin, R. H. Arpaci-Dusseau, and D. E. Culler. Architectural Requirements and Scalability of the NAS Parallel Benchmarks. In Supercomputing, 1999.

Digital Library

Cited By

Adam JBesnard JRoussel AJaeger JCarribault PPérache M(2024)To Share or Not to Share: A Case for MPI in Shared-MemoryRecent Advances in the Message Passing Interface10.1007/978-3-031-73370-3_6(89-102)Online publication date: 25-Sep-2024
https://doi.org/10.1007/978-3-031-73370-3_6
Katevenis GPloumidis MMarazakis M(2023)Impact of Cache Coherence on the Performance of Shared-Memory based MPI Primitives: A Case Study for Broadcast on Intel Xeon Scalable ProcessorsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605616(295-305)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605616
Peng JFang JLiu JXie MDai YYang BLi SWang ZMohror KArnold DBadia R(2023)Optimizing MPI Collectives on Shared Memory Multi-CoresProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607074(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607074
Show More Cited By

Index Terms

Benefits of Cross Memory Attach for MPI libraries on HPC Clusters
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Compiling data-parallel programs for clusters of SMPs: Research Articles
Compilers for Parallel Computers

Clusters of shared-memory multiprocessors (SMPs) have become the most promising parallel computing platforms for scientific computing. However, SMP clusters significantly increase the complexity of user application development when using the low-level ...
Overlapping communication and computation with OpenMP and MPI

Machines comprised of a distributed collection of shared memory or SMP nodes are becoming common for parallel computing. OpenMP can be combined with MPI on many such machines. Motivations for combing OpenMP and MPI are discussed. While OpenMP is ...
Exploiting Distributed-Memory and Shared-Memory Parallelism on Clusters of SMPs with Data Parallel Programs

Clusters of SMPs are hybrid-parallel architectures that combine the main concepts of distributed-memory and shared-memory parallel machines. Although SMP clusters are widely used in the high performance computing community, there exists no single ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

XSEDE '14: Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment

July 2014

445 pages

ISBN:9781450328937

DOI:10.1145/2616498

General Chair:
Scott Lathrop
National Center for Supercomputing Applications
,
Program Chair:
Jay Alameda
National Center for Supercomputing Applications

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

NSF: National Science Foundation
Drexel University
Indiana University: Indiana University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 July 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

XSEDE '14

XSEDE '14: Annual Conference of the Extreme Science and Engineering Discovery Environment

July 13 - 18, 2014

GA, Atlanta, USA

Acceptance Rates

XSEDE '14 Paper Acceptance Rate 80 of 120 submissions, 67%;

Overall Acceptance Rate 129 of 190 submissions, 68%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
474
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)4

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Adam JBesnard JRoussel AJaeger JCarribault PPérache M(2024)To Share or Not to Share: A Case for MPI in Shared-MemoryRecent Advances in the Message Passing Interface10.1007/978-3-031-73370-3_6(89-102)Online publication date: 25-Sep-2024
https://doi.org/10.1007/978-3-031-73370-3_6
Katevenis GPloumidis MMarazakis M(2023)Impact of Cache Coherence on the Performance of Shared-Memory based MPI Primitives: A Case Study for Broadcast on Intel Xeon Scalable ProcessorsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605616(295-305)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605616
Peng JFang JLiu JXie MDai YYang BLi SWang ZMohror KArnold DBadia R(2023)Optimizing MPI Collectives on Shared Memory Multi-CoresProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607074(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607074
Cho JSeo PJin H(2023)Exploiting copy engines for intra-node MPI collective communicationThe Journal of Supercomputing10.1007/s11227-023-05340-x79:16(17962-17982)Online publication date: 11-May-2023
https://doi.org/10.1007/s11227-023-05340-x
Bhat NWhite SKale L(2022)Enabling Support for Zero Copy Semantics in an Asynchronous Task-Based Programming ModelEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_39(496-505)Online publication date: 9-Jun-2022
https://doi.org/10.1007/978-3-031-06156-1_39
Nigay AMosimann LSchneider THoefler T(2020)Communication and Timing Issues with MPI VirtualizationProceedings of the 27th European MPI Users' Group Meeting10.1145/3416315.3416317(11-20)Online publication date: 21-Sep-2020
https://dl.acm.org/doi/10.1145/3416315.3416317
Jain TCooperman G(2020)CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVMSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00081(1-15)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00081
Ouyang KSi MHori AChen ZBalaji P(2020)CAB-MPI: Exploring Interprocess Work-Stealing towards Balanced MPI CommunicationSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00040(1-15)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00040
Ramesh BSuresh KSarkauskas NBayatpour MHashmi JSubramoni HPanda D(2020)Scalable MPI Collectives using SHARP: Large Scale Performance Evaluation on the TACC Frontera System2020 Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI52011.2020.00007(11-20)Online publication date: Nov-2020
https://doi.org/10.1109/ExaMPI52011.2020.00007
Chakraborty SBayatpour MHashmi JSubramoni HPanda D(2018)Cooperative rendezvous protocols for improved performance and overlapProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291694(1-13)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291694
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten