A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X

Awan, A. A.; Hamidouche, K.; Chu, C. H.; Panda, Dhabaleswar

doi:10.1007/978-3-319-26428-8_5

A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X

A. A. Awan¹⁷,
K. Hamidouche¹⁷,
C. H. Chu¹⁷ &
…
Dhabaleswar Panda¹⁷

Conference paper
First Online: 09 December 2015

392 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9397))

Abstract

An ever increased push for performance in the HPC arena has led to a multitude of hybrid architectures in both software and hardware for HPC systems. Partitioned Global Address Space (PGAS) programming model has gained a lot of attention over the last couple of years. The main advantage of PGAS model is the ease of programming provided by the abstraction of a single memory across nodes of a cluster. OpenSHMEM implementations currently implement the OpenSHMEM 1.2 specification that provides interface for one-sided, atomic, and collective operations. However, the recent trend in HPC arena in general, and Message Passing Interface (MPI) community in specific, is to use Non-Blocking Collective (NBC) communication to efficiently overlap computation with communication to save precious CPU cycles.

This work is inspired by encouraging performance numbers for NBC implementations of various MPI libraries. As the OpenSHMEM community has been discussing the use of non-blocking communication, in this paper, we propose an NBC interface for OpenSHMEM, present its design, implementation, and performance evaluation. We discuss the NBC interface that has been modeled along the lines of MPI NBC interface and requires minimal changes to the function signatures. We have designed and implemented this interface using the Unified Communication Runtime in MVAPICH2-X. In addition, we propose OpenSHMEM NBC benchmarks as an extension to the OpenSHMEM benchmarks available in the widely used OMB suite. Our performance evaluation shows that the proposed NBC implementation provides up to 96 percent overlap for different collectives with little NBC overhead.

This research is supported in part by National Science Foundation grants #OCI-1148371, #CCF-1213084, and #CNS-1419123.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Chapel The Cascade High-Productivity Language. http://chapel.cray.com/
MVAPICH2-X: Unified MPI+PGAS Communication Runtime over OpenFabrics/Gen2 for Exascale Systems. http://mvapich.cse.ohio-state.edu/
OpenSHMEM. http://www.openshmem.org/
Awan, A. A., Hamidouche, K., Venkatesh, A., Perkins, J., Subramoni, H., Panda. D. K.: GPU-Aware design, implementation, and evaluation of non-blocking collective benchmarks (accepted for publication). In: Proceedings of the 22nd European MPI Users’ Group Meeting EuroMPI 2015. ACM, Bordeaux (2015)
Google Scholar
Bell, C., Bonachea, D., Nishtala, R., Yelick, K.: Optimizing bandwidth limited problems using one-sided communication and overlap. In: Proceedings of the 20th International Conference on Parallel and Distributed Processing IPDPS 2006, pp. 84–84. IEEE Computer Society Washington, DC, USA(2006)
Google Scholar
Co-Array Fortran. http://www.co-array.org
Open MPI : Open Source High Performance Computing. http://www.open-mpi.org
Cong, G., Almasi, G., Saraswat, V.: Fast PGAS implementation of distributed graph algorithms. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2010)
Google Scholar
Graham, R.L., Poole, S., Shamis, P., Bloch, G., Bloch, N., Chapman, H., Kagan, M., Shahar, A., Rabinovitz, I., Shainer, G.: Overlapping computation and communication: barrier algorithms and connectx-2 core-direct capabilities. In: 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum IPDPSW, pp. 1–8, April 2010
Google Scholar
Hilfinger, P. N., Bonachea, D., Gay, D., Graham, S., Liblit, B., Pike, G., Yelick, K.: Titanium language reference manual. Technical report, Berkeley, CA, USA (2001)
Google Scholar
Hoefler, T., Lumsdaine, A.: Design, Implementation, and Usage of LibNBC. Technical report, Open Systems Lab, Indiana University, August 2006
Google Scholar
Hoefler, T., Schneider, T., Lumsdaine, A.: Accurately measuring collective operations at massive scale. In: Proceedings of the 22nd IEEE International Parallel & Distributed Processing Symposium PMEO 2008 Workshop, April 2008
Google Scholar
Hoefler, T., Squyres, J.M., Rehm, W., Lumsdaine, A.: A case for non-blocking collective operations. In: Min, G., Di Martino, B., Yang, L.T., Guo, M., Rünger, G. (eds.) ISPA Workshops 2006. LNCS, vol. 4331, pp. 155–164. Springer, Heidelberg (2006)
Chapter Google Scholar
InfiniBand Trade Association. http://www.infinibandta.com
Intel MPI Benchmarks (IMB). https://software.intel.com/en-us/articles/intel-mpi-benchmarks
Liu, J., Jiang, W., Wyckoff, P., Panda, D. K., Ashton, D., Buntinas, D., Gropp, B., Tooney, B.: High Performance Implementation of MPICH2 over InfiniBand with RDMA Support. In: IPDPS (2004)
Google Scholar
Jose, J., Kandalla, K., Luo, M., Panda, D. K.: Supporting hybrid MPI and OpenSHMEM over InfiniBand: design and performance evaluation. In: Proceedings of the 2012 41st International Conference on Parallel Processing, ICPP 2012, pp. 219–228. IEEE Computer Society (2012)
Google Scholar
Jose, J., Kandalla, K., Zhang, J., Potluri, S., Panda, D. K. D. K.: Optimizing collective communication in openshmem. In: 7th International Conference on PGAS Programming Models, p. 185 (2013)
Google Scholar
Jose, J., Zhang, J., Venkatesh, A., Potluri, S., Panda, D.K.D.K.: A comprehensive performance evaluation of OpenSHMEM libraries on InfiniBand clusters. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 14–28. Springer, Heidelberg (2014)
Google Scholar
Kandalla, K.C., Subramoni, H., Tomko, K., Pekurovsky, D., Panda, D.K.: A Novel functional partitioning approach to design high-performance MPI-3 non-blocking alltoallv collective on multi-core systems. In: 42nd International Conference on Parallel Processing, ICPP 2013, pp. 611–620, Lyon, France, 1–4 October 2013
Google Scholar
Kini, S.P., Liu, J., Wu, J., Wyckoff, P., Panda, D.K.: Fast and scalable barrier using RDMA and multicast mechanisms for infiniband-based clusters. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 369–378. Springer, Heidelberg (2003)
Chapter Google Scholar
Lawry, W., Wilson, C., Maccabe, A.B., Brightwell, R.: COMB: a portable benchmark suite for assessing MPI overlap. In: IEEE Cluster, pp. 23–26 (2002)
Google Scholar
Liu, J., Jiang, W., Wyckoff, P., Panda, D. K., Ashton, D., Buntinas, D., Gropp, W., Toonen, B.: Design and implementation of MPICH2 over InfiniBand with RDMA support. In: Proceedings of International Parallel and Distributed Processing Symposium, IPDPS 2004, April 2004
Google Scholar
Liu, J., Mamidala, A., Panda, D. K.: Fast And scalable MPI-level broadcast using InfiniBand’s hardware multicast support. In: Proceedings of International Parallel and Distributed Processing Symposium, IPDPS 2004, April 2004
Google Scholar
Mamidala, A., Liu, J., Panda, D. K.: Efficient barrier and allreduce on IBA clusters using hardware multicast and adaptive algorithms. In: IEEE Cluster Computing (2004)
Google Scholar
MPI-3 Standard Document. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
Poole, S., Shamis, P., Welch, A., Pophale, S., Venkata, M.G., Hernandez, O., Koenig, G., Curtis, T., Hsu, C.-H.: OpenSHMEM extensions and a vision for its future direction. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 149–162. Springer, Heidelberg (2014)
Google Scholar
Rabinovitz, I., Shamis, P., Graham, R.L., Bloch, N., Shainer, G.: Network offloaded hierarchical collectives using ConnectX-2”s CORE-Direct capabilities. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 102–112. Springer, Heidelberg (2010)
Chapter Google Scholar
Sandia MPI Micro-Benchmark Suite (SMB). http://www.cs.sandia.gov/smb/index.html
Subramoni, H., Awan, A.A., Hamidouche, K., Pekurovsky, D., Venkatesh, A., Chakraborty, S., Tomko, K., Panda, D.K.: Designing non-blocking personalized collectives with near perfect overlap for RDMA-enabled clusters. In: Kunkel, J.M., Ludwig, T. (eds.) ISC High Performance 2015. LNCS, vol. 9137, pp. 434–453. Springer, Heidelberg (2015)
Chapter Google Scholar
TOP 500 Supercomputer Sites. http://www.top500.org
UPC Consortium. UPC Language Specifications, v1.2. Technical report LBNL-59208, Lawrence Berkeley National Lab (2005)
Google Scholar
Yelick, K., Bonachea, D., Chen, W.-Y., Colella, P., Datta, K., Duell, J., Graham, S.L., Hargrove, P., Hilfinger, P., Husbands, P., Iancu, C., Kamil, A., Nishtala, R., Su, J., Welcome, M., Wen, T.: Productivity and performance using partitioned global address space languages. In: International Workshop on Parallel Symbolic Computation, PASCO 2007 (2007)
Google Scholar
Zhang, J., Behzad, B., Snir, M.: Optimizing the barnes-hut algorithm in UPC. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 75:1–75:11. ACM, New York (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
A. A. Awan, K. Hamidouche, C. H. Chu & Dhabaleswar Panda

Authors

A. A. Awan
View author publications
You can also search for this author in PubMed Google Scholar
K. Hamidouche
View author publications
You can also search for this author in PubMed Google Scholar
C. H. Chu
View author publications
You can also search for this author in PubMed Google Scholar
Dhabaleswar Panda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. A. Awan .

Editor information

Editors and Affiliations

Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
Manjunath Gorentla Venkata
Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
Pavel Shamis
Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
Neena Imam
Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
M. Graham Lopez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Awan, A.A., Hamidouche, K., Chu, C.H., Panda, D. (2015). A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X. In: Gorentla Venkata, M., Shamis, P., Imam, N., Lopez, M. (eds) OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies. OpenSHMEM 2014. Lecture Notes in Computer Science(), vol 9397. Springer, Cham. https://doi.org/10.1007/978-3-319-26428-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-26428-8_5
Published: 09 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26427-1
Online ISBN: 978-3-319-26428-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics