skip to main content
10.1145/2616498.2616532acmotherconferencesArticle/Chapter ViewAbstractPublication PagesxsedeConference Proceedingsconference-collections
research-article

Benefits of Cross Memory Attach for MPI libraries on HPC Clusters

Published: 13 July 2014 Publication History

Abstract

With the number of cores per node increasing in modern clusters, an efficient implementation of intra-node communications is critical for application performance. MPI libraries generally use shared memory mechanisms for communication inside the node, unfortunately this approach has some limitations for large messages. The release of Linux kernel 3.2 introduced Cross Memory Attach (CMA) which is a mechanism to improve the communication between MPI processes inside the same node. But, as this feature is not enabled by default inside MPI libraries supporting it, it could be left disabled by HPC administrators which leads to a loss of performance benefits to users. In this paper, we explain how to use CMA and present an evaluation of CMA using micro-benchmarks and NAS parallel benchmarks (NPB) which are a set of applications commonly used to evaluate parallel systems.
Our performance evaluation reveals that CMA outperforms shared memory performance for large messages. Micro-benchmark level evaluations show that CMA can enhance the performance by as much as a factor of four. With NPB, we see up to 24.75% improvement in total execution time for FT and up to 24.08% for IS.

References

[1]
Intel MPI Benchmark. http://software.intel.com/en-us/articles/intel-mpi-benchmarks.
[2]
NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html.
[3]
A. R. Mamidala, A. Vishnu and D. K. Panda. Efficient Shared Memory and RDMA Based Design for MPI-Allgather over InfiniBand. In PVM/MPI User's Group Meeting, 2006.
[4]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. The Intl. Journal of Supercomputer Applications, 5(3):63--73, Fall 1991.
[5]
D. Buntinas, B. Goglin, D. Goodell, G. Mercier, and S. Moreaud. Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis. In International Conference on Parallel Processing, 2009.
[6]
D. Buntinas, G. Mercier, and W. Gropp. Data Transfers between Processes in an SMP System: Performance Study and Application to MPI. In International Conference on Parallel Processing 2006, pages 487--496, Aug. 2006.
[7]
D. Buntinas, G. Mercier, and W. Gropp. Implementation and Evaluation of Shared-Memory Communication and Synchronization Operations in MPICH2 using the Nemesis Communication Subsystem. Parallel Computing, 33(9):634--644, Sept. 2007.
[8]
E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In PVM/MPI Users' Group Meeting, pages 97--104, 2004.
[9]
B. Goglin and S. Moreaud. KNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework. Journal of Parallel and Distributed Computing, 73(2):176--188, 2013.
[10]
R. Graham and G. Shipman. MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science, volume Volume 5205/2008, pages 130--140, 2008.
[11]
T. Hoefler and M. Snir. Generic Topology Mapping Strategies for Large-scale Parallel Architectures. In International Conference on Supercomputing, pages 75--85, 2011.
[12]
W. Huang, G. Santhanaraman, H.-W. Jin, Q. Gao, and D. K. x. D. K. Panda. Design of High Performance MVAPICH2: MPI2 over InfiniBand. In CCGRID, pages 43--48, 2006.
[13]
H. Jin, S. Sur, L. Chai, and D. K. Panda. LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster. In International Conference on Parallel Processing, pages 184--191, 2005.
[14]
M. Koop, W. Huang, K. Gopalakrishnan, and D. K. Panda. Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand. In Hot Interconnects, 2008.
[15]
M. Luo, H. Wang, J. Vienne, and D. Panda. Redesigning MPI Shared Memory Communication for Large Multi-Core Architecture. Computer Science - Research and Development, 28(2-3), 2013.
[16]
T. Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. M. Squyres, and J. J. Dongarra. Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs. In International Conference on Parallel Processing, pages 532--541, 2011.
[17]
G. Mercier and E. Jeannot. Improving MPI Applications Performance on Multicore Clusters with Rank Reordering. In EuroMPI, pages 39--49, 2011.
[18]
MPI Forum. MPI: A Message Passing Interface. In Proceedings of Supercomputing, 1993.
[19]
H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda. Design of a Scalable InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes. In Supercomputing, pages 70:1--70:12, 2012.
[20]
J. Vienne, J. Chen, M. Wasi-Ur-Rahman, N. S. Islam, H. Subramoni, and D. K. Panda. Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems. In Hot Interconnects, pages 48--55, 2012.
[21]
F. C. Wong, R. P. Martin, R. H. Arpaci-Dusseau, and D. E. Culler. Architectural Requirements and Scalability of the NAS Parallel Benchmarks. In Supercomputing, 1999.

Cited By

View all
  • (2024)To Share or Not to Share: A Case for MPI in Shared-MemoryRecent Advances in the Message Passing Interface10.1007/978-3-031-73370-3_6(89-102)Online publication date: 25-Sep-2024
  • (2023)Impact of Cache Coherence on the Performance of Shared-Memory based MPI Primitives: A Case Study for Broadcast on Intel Xeon Scalable ProcessorsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605616(295-305)Online publication date: 7-Aug-2023
  • (2023)Optimizing MPI Collectives on Shared Memory Multi-CoresProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607074(1-15)Online publication date: 12-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
XSEDE '14: Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment
July 2014
445 pages
ISBN:9781450328937
DOI:10.1145/2616498
  • General Chair:
  • Scott Lathrop,
  • Program Chair:
  • Jay Alameda
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • NSF: National Science Foundation
  • Drexel University
  • Indiana University: Indiana University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 July 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MPI
  2. Multicore processing
  3. Parallel programming
  4. Performance analysis

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

XSEDE '14

Acceptance Rates

XSEDE '14 Paper Acceptance Rate 80 of 120 submissions, 67%;
Overall Acceptance Rate 129 of 190 submissions, 68%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)4
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)To Share or Not to Share: A Case for MPI in Shared-MemoryRecent Advances in the Message Passing Interface10.1007/978-3-031-73370-3_6(89-102)Online publication date: 25-Sep-2024
  • (2023)Impact of Cache Coherence on the Performance of Shared-Memory based MPI Primitives: A Case Study for Broadcast on Intel Xeon Scalable ProcessorsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605616(295-305)Online publication date: 7-Aug-2023
  • (2023)Optimizing MPI Collectives on Shared Memory Multi-CoresProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607074(1-15)Online publication date: 12-Nov-2023
  • (2023)Exploiting copy engines for intra-node MPI collective communicationThe Journal of Supercomputing10.1007/s11227-023-05340-x79:16(17962-17982)Online publication date: 11-May-2023
  • (2022)Enabling Support for Zero Copy Semantics in an Asynchronous Task-Based Programming ModelEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_39(496-505)Online publication date: 9-Jun-2022
  • (2020)Communication and Timing Issues with MPI VirtualizationProceedings of the 27th European MPI Users' Group Meeting10.1145/3416315.3416317(11-20)Online publication date: 21-Sep-2020
  • (2020)CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVMSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00081(1-15)Online publication date: Nov-2020
  • (2020)CAB-MPI: Exploring Interprocess Work-Stealing towards Balanced MPI CommunicationSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00040(1-15)Online publication date: Nov-2020
  • (2020)Scalable MPI Collectives using SHARP: Large Scale Performance Evaluation on the TACC Frontera System2020 Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI52011.2020.00007(11-20)Online publication date: Nov-2020
  • (2018)Cooperative rendezvous protocols for improved performance and overlapProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291694(1-13)Online publication date: 11-Nov-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media