Skip to main content

A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9397))

Abstract

An ever increased push for performance in the HPC arena has led to a multitude of hybrid architectures in both software and hardware for HPC systems. Partitioned Global Address Space (PGAS) programming model has gained a lot of attention over the last couple of years. The main advantage of PGAS model is the ease of programming provided by the abstraction of a single memory across nodes of a cluster. OpenSHMEM implementations currently implement the OpenSHMEM 1.2 specification that provides interface for one-sided, atomic, and collective operations. However, the recent trend in HPC arena in general, and Message Passing Interface (MPI) community in specific, is to use Non-Blocking Collective (NBC) communication to efficiently overlap computation with communication to save precious CPU cycles.

This work is inspired by encouraging performance numbers for NBC implementations of various MPI libraries. As the OpenSHMEM community has been discussing the use of non-blocking communication, in this paper, we propose an NBC interface for OpenSHMEM, present its design, implementation, and performance evaluation. We discuss the NBC interface that has been modeled along the lines of MPI NBC interface and requires minimal changes to the function signatures. We have designed and implemented this interface using the Unified Communication Runtime in MVAPICH2-X. In addition, we propose OpenSHMEM NBC benchmarks as an extension to the OpenSHMEM benchmarks available in the widely used OMB suite. Our performance evaluation shows that the proposed NBC implementation provides up to 96 percent overlap for different collectives with little NBC overhead.

This research is supported in part by National Science Foundation grants #OCI-1148371, #CCF-1213084, and #CNS-1419123.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Chapel The Cascade High-Productivity Language. http://chapel.cray.com/

  2. MVAPICH2-X: Unified MPI+PGAS Communication Runtime over OpenFabrics/Gen2 for Exascale Systems. http://mvapich.cse.ohio-state.edu/

  3. OpenSHMEM. http://www.openshmem.org/

  4. Awan, A. A., Hamidouche, K., Venkatesh, A., Perkins, J., Subramoni, H., Panda. D. K.: GPU-Aware design, implementation, and evaluation of non-blocking collective benchmarks (accepted for publication). In: Proceedings of the 22nd European MPI Users’ Group Meeting EuroMPI 2015. ACM, Bordeaux (2015)

    Google Scholar 

  5. Bell, C., Bonachea, D., Nishtala, R., Yelick, K.: Optimizing bandwidth limited problems using one-sided communication and overlap. In: Proceedings of the 20th International Conference on Parallel and Distributed Processing IPDPS 2006, pp. 84–84. IEEE Computer Society Washington, DC, USA(2006)

    Google Scholar 

  6. Co-Array Fortran. http://www.co-array.org

  7. Open MPI : Open Source High Performance Computing. http://www.open-mpi.org

  8. Cong, G., Almasi, G., Saraswat, V.: Fast PGAS implementation of distributed graph algorithms. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2010)

    Google Scholar 

  9. Graham, R.L., Poole, S., Shamis, P., Bloch, G., Bloch, N., Chapman, H., Kagan, M., Shahar, A., Rabinovitz, I., Shainer, G.: Overlapping computation and communication: barrier algorithms and connectx-2 core-direct capabilities. In: 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum IPDPSW, pp. 1–8, April 2010

    Google Scholar 

  10. Hilfinger, P. N., Bonachea, D., Gay, D., Graham, S., Liblit, B., Pike, G., Yelick, K.: Titanium language reference manual. Technical report, Berkeley, CA, USA (2001)

    Google Scholar 

  11. Hoefler, T., Lumsdaine, A.: Design, Implementation, and Usage of LibNBC. Technical report, Open Systems Lab, Indiana University, August 2006

    Google Scholar 

  12. Hoefler, T., Schneider, T., Lumsdaine, A.: Accurately measuring collective operations at massive scale. In: Proceedings of the 22nd IEEE International Parallel & Distributed Processing Symposium PMEO 2008 Workshop, April 2008

    Google Scholar 

  13. Hoefler, T., Squyres, J.M., Rehm, W., Lumsdaine, A.: A case for non-blocking collective operations. In: Min, G., Di Martino, B., Yang, L.T., Guo, M., Rünger, G. (eds.) ISPA Workshops 2006. LNCS, vol. 4331, pp. 155–164. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. InfiniBand Trade Association. http://www.infinibandta.com

  15. Intel MPI Benchmarks (IMB). https://software.intel.com/en-us/articles/intel-mpi-benchmarks

  16. Liu, J., Jiang, W., Wyckoff, P., Panda, D. K., Ashton, D., Buntinas, D., Gropp, B., Tooney, B.: High Performance Implementation of MPICH2 over InfiniBand with RDMA Support. In: IPDPS (2004)

    Google Scholar 

  17. Jose, J., Kandalla, K., Luo, M., Panda, D. K.: Supporting hybrid MPI and OpenSHMEM over InfiniBand: design and performance evaluation. In: Proceedings of the 2012 41st International Conference on Parallel Processing, ICPP 2012, pp. 219–228. IEEE Computer Society (2012)

    Google Scholar 

  18. Jose, J., Kandalla, K., Zhang, J., Potluri, S., Panda, D. K. D. K.: Optimizing collective communication in openshmem. In: 7th International Conference on PGAS Programming Models, p. 185 (2013)

    Google Scholar 

  19. Jose, J., Zhang, J., Venkatesh, A., Potluri, S., Panda, D.K.D.K.: A comprehensive performance evaluation of OpenSHMEM libraries on InfiniBand clusters. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 14–28. Springer, Heidelberg (2014)

    Google Scholar 

  20. Kandalla, K.C., Subramoni, H., Tomko, K., Pekurovsky, D., Panda, D.K.: A Novel functional partitioning approach to design high-performance MPI-3 non-blocking alltoallv collective on multi-core systems. In: 42nd International Conference on Parallel Processing, ICPP 2013, pp. 611–620, Lyon, France, 1–4 October 2013

    Google Scholar 

  21. Kini, S.P., Liu, J., Wu, J., Wyckoff, P., Panda, D.K.: Fast and scalable barrier using RDMA and multicast mechanisms for infiniband-based clusters. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 369–378. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  22. Lawry, W., Wilson, C., Maccabe, A.B., Brightwell, R.: COMB: a portable benchmark suite for assessing MPI overlap. In: IEEE Cluster, pp. 23–26 (2002)

    Google Scholar 

  23. Liu, J., Jiang, W., Wyckoff, P., Panda, D. K., Ashton, D., Buntinas, D., Gropp, W., Toonen, B.: Design and implementation of MPICH2 over InfiniBand with RDMA support. In: Proceedings of International Parallel and Distributed Processing Symposium, IPDPS 2004, April 2004

    Google Scholar 

  24. Liu, J., Mamidala, A., Panda, D. K.: Fast And scalable MPI-level broadcast using InfiniBand’s hardware multicast support. In: Proceedings of International Parallel and Distributed Processing Symposium, IPDPS 2004, April 2004

    Google Scholar 

  25. Mamidala, A., Liu, J., Panda, D. K.: Efficient barrier and allreduce on IBA clusters using hardware multicast and adaptive algorithms. In: IEEE Cluster Computing (2004)

    Google Scholar 

  26. MPI-3 Standard Document. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

  27. Poole, S., Shamis, P., Welch, A., Pophale, S., Venkata, M.G., Hernandez, O., Koenig, G., Curtis, T., Hsu, C.-H.: OpenSHMEM extensions and a vision for its future direction. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 149–162. Springer, Heidelberg (2014)

    Google Scholar 

  28. Rabinovitz, I., Shamis, P., Graham, R.L., Bloch, N., Shainer, G.: Network offloaded hierarchical collectives using ConnectX-2”s CORE-Direct capabilities. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 102–112. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  29. Sandia MPI Micro-Benchmark Suite (SMB). http://www.cs.sandia.gov/smb/index.html

  30. Subramoni, H., Awan, A.A., Hamidouche, K., Pekurovsky, D., Venkatesh, A., Chakraborty, S., Tomko, K., Panda, D.K.: Designing non-blocking personalized collectives with near perfect overlap for RDMA-enabled clusters. In: Kunkel, J.M., Ludwig, T. (eds.) ISC High Performance 2015. LNCS, vol. 9137, pp. 434–453. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  31. TOP 500 Supercomputer Sites. http://www.top500.org

  32. UPC Consortium. UPC Language Specifications, v1.2. Technical report LBNL-59208, Lawrence Berkeley National Lab (2005)

    Google Scholar 

  33. Yelick, K., Bonachea, D., Chen, W.-Y., Colella, P., Datta, K., Duell, J., Graham, S.L., Hargrove, P., Hilfinger, P., Husbands, P., Iancu, C., Kamil, A., Nishtala, R., Su, J., Welcome, M., Wen, T.: Productivity and performance using partitioned global address space languages. In: International Workshop on Parallel Symbolic Computation, PASCO 2007 (2007)

    Google Scholar 

  34. Zhang, J., Behzad, B., Snir, M.: Optimizing the barnes-hut algorithm in UPC. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 75:1–75:11. ACM, New York (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. A. Awan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Awan, A.A., Hamidouche, K., Chu, C.H., Panda, D. (2015). A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X. In: Gorentla Venkata, M., Shamis, P., Imam, N., Lopez, M. (eds) OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies. OpenSHMEM 2014. Lecture Notes in Computer Science(), vol 9397. Springer, Cham. https://doi.org/10.1007/978-3-319-26428-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26428-8_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26427-1

  • Online ISBN: 978-3-319-26428-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics