Skip to main content

Network Offloaded Hierarchical Collectives Using ConnectX-2’s CORE-Direct Capabilities

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6305))

Abstract

As the scale of High Performance Computing (HPC) systems continues to increase, demanding that we extract even more parallelism from applications, the need to move communication management away from the Central Processing Unit (CPU) becomes even greater. Moving this management to the network, frees up CPU cycles for computation, making it possible to overlap computation and communication. In this paper we continue to investigate how to best use the new CORE-Direct support added in the ConnectX-2 Host Channel Adapter (HCA) for creating high performance, asynchronous collective operations that are managed by the HCA. Specifically we consider the network topology, creating a two-level communication hierarchy, reducing the MPI_Barrier completion time by 45%, from 26.59 microseconds, when not considering network topology, to 14.72 microseconds, with the CPU based collective barrier operation completing in 19.04 microseconds. The nonblocking barrier algorithm has similar performance, with about 50% of that time available for computation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. InfiniBand Trade Association, http://www.infinibandta.org/specs

  2. Mellanox Technologies, http://www.mellanox.com/

  3. Mvapich, http://mvapich.cse.ohio-state.edu/

  4. Quadrics, http://www.quadrics.com/

  5. Top 500 Super Computer Sites, http://www.top500.org/

  6. Bhoedjang, R.A.F., Ruhl, T., Bal, H.E.: Efficient Multicast on Myrinet Using Link-Level Flow Control. In: 27th ICPP (1998)

    Google Scholar 

  7. Buntinas, D., Panda, D.K.: NIC-Based Reduction in Myrinet Clusters: Is It Beneficial. In: SAN-2002 Workshop (in conjunction with HPCA) (February 2003)

    Google Scholar 

  8. Buntinas, D., Panda, D.K., Sadayappan, P.: Fast NIC-Level Barrier over Myrinet/GM. In: Proceedings of IPDPS (2001)

    Google Scholar 

  9. Garbriel, E., et al: Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting (2004)

    Google Scholar 

  10. Graham, R.L., et al.: A Network-Failure-tolerant Message-Passing System for Terascale Clusters. In: Proceedings of ICS (June 2002)

    Google Scholar 

  11. Kumar, S., et al: The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer. In: ICS 2008: Proceedings of the 22nd annual international conference on Supercomputing, pp. 94–103. ACM, New York (2008)

    Google Scholar 

  12. Graham, R.L., Poole, S., Shamis, P., Bloch, G., Bloch, N., Chapman, H., Kagan, M., Shahar, A., Rabinovitz, I., Shainer, G.: ConnectX-2 InfiniBand Management Queues: First investigation of the new support for network offloaded collective operations. Accepted for the 10th IEEE/ACM International Symposium CCGrid (2010)

    Google Scholar 

  13. Graham, R.L., Poole, S., Shamis, P., Bloch, G., Bloch, N., Chapman, H., Kagan, M., Shahar, A., Rabinovitz, I., Shainer, G.: Overlapping Computation and Communication: Barrier Algorithms and ConnectX-2 Core-DIRECT Capabilities. Accepted to CAC (2010)

    Google Scholar 

  14. Hoefler, T., Lumsdaine, A.: Optimizing non-blocking Collective Operations for InfiniBand. In: Proceedings of the 22nd IPDPS (April 2008)

    Google Scholar 

  15. Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI. In: SC 2007: Proceedings of the SC 2007, pp. 1–10. ACM, New York (2007)

    Google Scholar 

  16. Dongarra, J., et al.: The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community. Int. J. High Perform. Comput. Appl. 23(4), 309–322 (2009)

    Article  Google Scholar 

  17. Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.F.: MagPIe: MPI’s collective communication operations for clustered wide area systems. SIGPLAN Not. 34, 131–140 (1999)

    Article  Google Scholar 

  18. Lawry, W., Wilson, C., Maccabe, A.B., Brightwell, R.: Comb: a portable benchmark suite for assessing mpi overlap. In: 2002 IEEE International Conference on Cluster Computing, pp. 472–475 (2002)

    Google Scholar 

  19. Message Passing Interface Forum. MPI: A Message-Passing Standard (June 2008)

    Google Scholar 

  20. Moody, A., Fernandez, J., Petrini, F., Panda, D.: Scalable NIC-based Reduction on Large-Scale Clusters. In: SC 2003 (November 2003)

    Google Scholar 

  21. Mraz, R.: Reducing the Variance of Point to Point Transfers in the IBM 9076 Parallel Computer. In: Proceedings of the 1994 ACM/IEEE conference on Supercomputing, pp. 620–629 (November 1994)

    Google Scholar 

  22. Sancho, J.C., Kerbyson, D.J., Barker, K.J.: Efficient Offloading of Collective Communications in Large-Scale Systems. In: IEEE International Conference on Cluster Computing, pp. 169–178 (2007)

    Google Scholar 

  23. Steffenel, L.A., Mounié, G.: A Framework for Adaptive Collective Communications for Heterogeneous Hierarchical Computing Systems. J. Comput. Syst. Sci. 74(6), 1082–1093 (2008)

    Article  MATH  Google Scholar 

  24. Tipparaju, V., Nieplocha, J., Panda, D.: Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters. In: Proceedings of the IPDPS (2003)

    Google Scholar 

  25. Yu, W., Buntinas, D., Graham, R.L., Panda, D.K.: Efficient and Scalable Barrier over Quadrics and Myrinet with a New NIC-Based Collective Message Passing Protocol. In: CAC Workshop, in Conjunction IPDPS 2004 (April 2004)

    Google Scholar 

  26. Yu, W., Buntinas, D., Panda, D.K.: High Performance and Reliable NIC-Based Multicast over Myrinet/GM-2. In: Proceedings of the IPDPS 2003 (October 2003)

    Google Scholar 

  27. Zhu, H., Goodell, D., Gropp, W., Thakur, R.: Hierarchical Collectives in MPICH2. In: Proceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 325–326. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rabinovitz, I., Shamis, P., Graham, R.L., Bloch, N., Shainer, G. (2010). Network Offloaded Hierarchical Collectives Using ConnectX-2’s CORE-Direct Capabilities. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2010. Lecture Notes in Computer Science, vol 6305. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15646-5_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15646-5_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15645-8

  • Online ISBN: 978-3-642-15646-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics