OpenSHMEM over MPI as a Performance Contender: Thorough Analysis and Optimizations

Si, Min; Fu, Huansong; Hammond, Jeff R.; Balaji, Pavan

doi:10.1007/978-3-031-04888-3_3

OpenSHMEM over MPI as a Performance Contender: Thorough Analysis and Optimizations

Min Si¹²,
Huansong Fu¹³,
Jeff R. Hammond¹⁴ &
…
Pavan Balaji¹⁵

Conference paper
First Online: 20 May 2022

321 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13159))

Abstract

OpenSHMEM is a Partitioned Global Address Space (PGAS) style programming model for one-sided scalable communication over distributed-memory systems. The community has always focused on high levels of performance for specific communication operations such as RMA, atomics, and collectives and encourages native implementations directly porting onto each network hardware in order to pursue minimal instructions from the application to the network hardware. OSHMPI is an OpenSHMEM implementation on top of MPI, which aims to provide portable support of the OpenSHMEM communication over mainstream HPC systems. Because of the generalized functionality of MPI, however, OSHMPI incurs heavy software overheads in the performance-critical path.

Why does OpenSHMEM over MPI not perform well? In order to answer this question, this paper provides an in-depth analysis of the software overheads of the OSHMPI performance-critical path, from the aspects of both the semantics and the library implementation. We also present various optimizations in the MPI and OSHMPI implementations while maintaining the full MPI functionality. For remaining performance overheads that fundamentally cannot be avoided based on the MPI-3.1 standard, we recommend extensions to the MPI standard to provide efficient support for OpenSHMEM-like PGAS programming models. We evaluate the optimized OSHMPI by comparing with the native implementation of OpenSHMEM on an Intel Broadwell cluster with the Omni-Path interconnect. The evaluation results demonstrate that the optimized OSHMPI/MPI environment can deliver performance similar to that of the native implementation.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://github.com/pmodels/oshmpi/releases/tag/v2.0b1.
2.
Flush_local locally completes all outstanding RMA operations initiated by the calling process to the remote process specified by rank on the window.
3.
Flush_all ensures all outstanding RMA operations issued by the calling process to any remote process on the window will have completed both at the local and at the remote side.
4.
Win_sync synchronizes memory updates on the specific window.
5.
http://www.mpich.org/downloads/.
6.
ofi_inject_write is used for data smaller than 64Bytes, and ofi_write is used for other data sizes. The latter only initiates a write to remote memory, but the former also guarantees local completion.
7.
fi_cntr_read reads an OFI event counter that is updated at operation completion, and fi_cntr_wait is its blocking version.
8.
MPI_PROC_NULL is an MPI predefined dummy process rank. An MPI RMA operation using MPI_PROC_NULL as the remote rank is a no-op.
9.
https://www.lcrc.anl.gov/systems/resources/bebop.
10.
We have made the following changes in SOS to ensure a fair comparison with OSHMPI/MPICH: (1) disable the OFI domain thread (set domain attribute data_progress = FI_PROGRESS_MANUAL at shmem_init) to reduce latency overhead at large data transfer; (2) reduce frequent calls to expensive fi_cntr_wait at shmem_quiet; and (3) disable bounce buffer optimization in the latency test because it increases latency overhead for medium data sizes (set environment variable SHMEM_BOUNCE_SIZE=0).

References

ARMCI-MPI. https://github.com/pmodels/armci-mpi
Mellanox ScalableSHMEM User Manual. Technical report, Mellanox Technologies Ltd. (2012)
Google Scholar
Open MPI: Open Source High Performance Computing. https://www.open-mpi.org/
Essential Guide to Distributed Memory Coarray Fortran with the Intel Fortran Compiler for Linux (2018). https://software.intel.com/content/www/us/en/develop/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential.html
Barriuso, R., Knies, A.: SHMEM User’s Guide for C. Technical report, Cray Research Inc. (1994)
Google Scholar
Bonachea, D., Duell, J.: Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations. Int. J. High Perform. Comput. Netw. 1(1–3), 91–99 (2004)
Google Scholar
Dinan, J., Balaji, P., Hammond, J.R., Krishnamoorthy, S., Tipparaju, V.: Supporting the global arrays PGAS model using MPI one-sided communication. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 739–750 (2012)
Google Scholar
Fanfarillo, A., Burnus, T., Cardellini, V., Filippone, S., Nagle, D., Rouson. D.: OpenCoarrays: open-source transport layers supporting Coarray Fortran compilers. In : Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, PGAS 2014, Association for Computing Machinery, New York, NY, USA (2014)
Google Scholar
Fürlinger, K., et al.: DASH: data structures and algorithms with support for hierarchical locality. In Euro-Par Workshops (2014)
Google Scholar
Guo, Y., et al.: Memory compression techniques for network address management in MPI. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1008–1017 (2017)
Google Scholar
Hammond, J.R., Ghosh, S., Chapman, B.M.: Implementing OpenSHMEM using MPI-3 one-sided communication. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 44–58. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05215-1_4
Chapter Google Scholar
Jose, J., Kandalla, K., Luo, M., Panda, D.K.: Supporting hybrid MPI and OpenSHMEM over InfiniBand: design and performance evaluation. In: 2012 41st International Conference on Parallel Processing, pp. 219–228 (2012)
Google Scholar
Namashivayam, N., Cernohous, B., Pou, D., Pagel, M.: Introducing cray OpenSHMEMX - a modular multi-communication layer OpenSHMEM implementation. In: Pophale, S., Imam, N., Aderholdt, F., Gorentla Venkata, M. (eds.) OpenSHMEM 2018. LNCS, vol. 11283, pp. 41–55. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-04918-8_3
Chapter Google Scholar
Weeks, H., Dosanjh, M.G.F., Bridges, P.G., Grant, R.E.: SHMEM-MT: a benchmark suite for assessing multi-threaded Shmem performance. In: Gorentla Venkata, M., Imam, N., Pophale, S., Mintz, T.M. (eds.) OpenSHMEM 2016. LNCS, vol. 10007, pp. 227–231. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50995-2_16
Chapter Google Scholar
Parzyszek, K., Nieplocha, J., Kendall, R.A.: A generalized portable SHMEM library for high performance computing. Technical report, Ames Lab., Ames (2000)
Google Scholar
Raffenetti, K., et al.: Why is MPI so slow?: Analyzing the fundamental limits in implementing MPI-3.1. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, pp. 62:1–62:12. ACM, New York (2017)
Google Scholar
Yang, C., Bland, W., Mellor-Crummey, J., Balaji. P.: Portable, MPI-interoperable Coarray Fortran. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2014, pp. 81–92. Association for Computing Machinery, New York (2014)
Google Scholar
Zhou, H., Idrees, K., Gracia, J.: Leveraging MPI-3 shared-memory extensions for efficient PGAS runtime systems. In: Träff, J.L., Hunold, S., Versaci, F. (eds.) Euro-Par 2015. LNCS, vol. 9233, pp. 373–384. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48096-0_29
Chapter Google Scholar

Download references

Acknowledgment

This research was supported by the United States Department of Defense (DoD). This material was based upon work supported by the United States Department of Energy, Office of Science, Advanced Scientific Computing Research (SC-21), under contract DE-AC02-06CH11357. The experimental resource for this paper was provided by the Laboratory Computing Resource Center on the Bebop cluster at Argonne National Laboratory.

Author information

Authors and Affiliations

Argonne National Laboratory, Lemont, USA
Min Si
Amazon, Seattle, USA
Huansong Fu
NVIDIA Corporation, Santa Clara, USA
Jeff R. Hammond
Facebook, Menlo Park, USA
Pavan Balaji

Authors

Min Si
View author publications
You can also search for this author in PubMed Google Scholar
Huansong Fu
View author publications
You can also search for this author in PubMed Google Scholar
Jeff R. Hammond
View author publications
You can also search for this author in PubMed Google Scholar
Pavan Balaji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Min Si .

Editor information

Editors and Affiliations

Los Alamos National Laboratory, Los Almos, NM, USA
Stephen Poole
NVIDIA Corporation, Santa Clara, CA, USA
Oscar Hernandez
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Matthew Baker
Stony Brook University, Stony Brook, NY, USA
Tony Curtis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Si, M., Fu, H., Hammond, J.R., Balaji, P. (2022). OpenSHMEM over MPI as a Performance Contender: Thorough Analysis and Optimizations. In: Poole, S., Hernandez, O., Baker, M., Curtis, T. (eds) OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Exascale and Smart Networks. OpenSHMEM 2021. Lecture Notes in Computer Science, vol 13159. Springer, Cham. https://doi.org/10.1007/978-3-031-04888-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-04888-3_3
Published: 20 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04887-6
Online ISBN: 978-3-031-04888-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics