Skip to main content

OpenSHMEM over MPI as a Performance Contender: Thorough Analysis and Optimizations

  • Conference paper
  • First Online:
  • 321 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13159))

Abstract

OpenSHMEM is a Partitioned Global Address Space (PGAS) style programming model for one-sided scalable communication over distributed-memory systems. The community has always focused on high levels of performance for specific communication operations such as RMA, atomics, and collectives and encourages native implementations directly porting onto each network hardware in order to pursue minimal instructions from the application to the network hardware. OSHMPI is an OpenSHMEM implementation on top of MPI, which aims to provide portable support of the OpenSHMEM communication over mainstream HPC systems. Because of the generalized functionality of MPI, however, OSHMPI incurs heavy software overheads in the performance-critical path.

Why does OpenSHMEM over MPI not perform well? In order to answer this question, this paper provides an in-depth analysis of the software overheads of the OSHMPI performance-critical path, from the aspects of both the semantics and the library implementation. We also present various optimizations in the MPI and OSHMPI implementations while maintaining the full MPI functionality. For remaining performance overheads that fundamentally cannot be avoided based on the MPI-3.1 standard, we recommend extensions to the MPI standard to provide efficient support for OpenSHMEM-like PGAS programming models. We evaluate the optimized OSHMPI by comparing with the native implementation of OpenSHMEM on an Intel Broadwell cluster with the Omni-Path interconnect. The evaluation results demonstrate that the optimized OSHMPI/MPI environment can deliver performance similar to that of the native implementation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://github.com/pmodels/oshmpi/releases/tag/v2.0b1.

  2. 2.

    Flush_local locally completes all outstanding RMA operations initiated by the calling process to the remote process specified by rank on the window.

  3. 3.

    Flush_all ensures all outstanding RMA operations issued by the calling process to any remote process on the window will have completed both at the local and at the remote side.

  4. 4.

    Win_sync synchronizes memory updates on the specific window.

  5. 5.

    http://www.mpich.org/downloads/.

  6. 6.

    ofi_inject_write is used for data smaller than 64Bytes, and ofi_write is used for other data sizes. The latter only initiates a write to remote memory, but the former also guarantees local completion.

  7. 7.

    fi_cntr_read reads an OFI event counter that is updated at operation completion, and fi_cntr_wait is its blocking version.

  8. 8.

    MPI_PROC_NULL is an MPI predefined dummy process rank. An MPI RMA operation using MPI_PROC_NULL as the remote rank is a no-op.

  9. 9.

    https://www.lcrc.anl.gov/systems/resources/bebop.

  10. 10.

    We have made the following changes in SOS to ensure a fair comparison with OSHMPI/MPICH: (1) disable the OFI domain thread (set domain attribute data_progress = FI_PROGRESS_MANUAL at shmem_init) to reduce latency overhead at large data transfer; (2) reduce frequent calls to expensive fi_cntr_wait at shmem_quiet; and (3) disable bounce buffer optimization in the latency test because it increases latency overhead for medium data sizes (set environment variable SHMEM_BOUNCE_SIZE=0).

References

  1. ARMCI-MPI. https://github.com/pmodels/armci-mpi

  2. Mellanox ScalableSHMEM User Manual. Technical report, Mellanox Technologies Ltd. (2012)

    Google Scholar 

  3. Open MPI: Open Source High Performance Computing. https://www.open-mpi.org/

  4. Essential Guide to Distributed Memory Coarray Fortran with the Intel Fortran Compiler for Linux (2018). https://software.intel.com/content/www/us/en/develop/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential.html

  5. Barriuso, R., Knies, A.: SHMEM User’s Guide for C. Technical report, Cray Research Inc. (1994)

    Google Scholar 

  6. Bonachea, D., Duell, J.: Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations. Int. J. High Perform. Comput. Netw. 1(1–3), 91–99 (2004)

    Google Scholar 

  7. Dinan, J., Balaji, P., Hammond, J.R., Krishnamoorthy, S., Tipparaju, V.: Supporting the global arrays PGAS model using MPI one-sided communication. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 739–750 (2012)

    Google Scholar 

  8. Fanfarillo, A., Burnus, T., Cardellini, V., Filippone, S., Nagle, D., Rouson. D.: OpenCoarrays: open-source transport layers supporting Coarray Fortran compilers. In : Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, PGAS 2014, Association for Computing Machinery, New York, NY, USA (2014)

    Google Scholar 

  9. Fürlinger, K., et al.: DASH: data structures and algorithms with support for hierarchical locality. In Euro-Par Workshops (2014)

    Google Scholar 

  10. Guo, Y., et al.: Memory compression techniques for network address management in MPI. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1008–1017 (2017)

    Google Scholar 

  11. Hammond, J.R., Ghosh, S., Chapman, B.M.: Implementing OpenSHMEM using MPI-3 one-sided communication. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 44–58. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05215-1_4

    Chapter  Google Scholar 

  12. Jose, J., Kandalla, K., Luo, M., Panda, D.K.: Supporting hybrid MPI and OpenSHMEM over InfiniBand: design and performance evaluation. In: 2012 41st International Conference on Parallel Processing, pp. 219–228 (2012)

    Google Scholar 

  13. Namashivayam, N., Cernohous, B., Pou, D., Pagel, M.: Introducing cray OpenSHMEMX - a modular multi-communication layer OpenSHMEM implementation. In: Pophale, S., Imam, N., Aderholdt, F., Gorentla Venkata, M. (eds.) OpenSHMEM 2018. LNCS, vol. 11283, pp. 41–55. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-04918-8_3

    Chapter  Google Scholar 

  14. Weeks, H., Dosanjh, M.G.F., Bridges, P.G., Grant, R.E.: SHMEM-MT: a benchmark suite for assessing multi-threaded Shmem performance. In: Gorentla Venkata, M., Imam, N., Pophale, S., Mintz, T.M. (eds.) OpenSHMEM 2016. LNCS, vol. 10007, pp. 227–231. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50995-2_16

    Chapter  Google Scholar 

  15. Parzyszek, K., Nieplocha, J., Kendall, R.A.: A generalized portable SHMEM library for high performance computing. Technical report, Ames Lab., Ames (2000)

    Google Scholar 

  16. Raffenetti, K., et al.: Why is MPI so slow?: Analyzing the fundamental limits in implementing MPI-3.1. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, pp. 62:1–62:12. ACM, New York (2017)

    Google Scholar 

  17. Yang, C., Bland, W., Mellor-Crummey, J., Balaji. P.: Portable, MPI-interoperable Coarray Fortran. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2014, pp. 81–92. Association for Computing Machinery, New York (2014)

    Google Scholar 

  18. Zhou, H., Idrees, K., Gracia, J.: Leveraging MPI-3 shared-memory extensions for efficient PGAS runtime systems. In: Träff, J.L., Hunold, S., Versaci, F. (eds.) Euro-Par 2015. LNCS, vol. 9233, pp. 373–384. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48096-0_29

    Chapter  Google Scholar 

Download references

Acknowledgment

This research was supported by the United States Department of Defense (DoD). This material was based upon work supported by the United States Department of Energy, Office of Science, Advanced Scientific Computing Research (SC-21), under contract DE-AC02-06CH11357. The experimental resource for this paper was provided by the Laboratory Computing Resource Center on the Bebop cluster at Argonne National Laboratory.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Si .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Si, M., Fu, H., Hammond, J.R., Balaji, P. (2022). OpenSHMEM over MPI as a Performance Contender: Thorough Analysis and Optimizations. In: Poole, S., Hernandez, O., Baker, M., Curtis, T. (eds) OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Exascale and Smart Networks. OpenSHMEM 2021. Lecture Notes in Computer Science, vol 13159. Springer, Cham. https://doi.org/10.1007/978-3-031-04888-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-04888-3_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-04887-6

  • Online ISBN: 978-3-031-04888-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics