Skip to main content
Log in

Exploiting application buffer reuse to improve MPI small message transfer protocols over RDMA-enabled networks

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

To avoid the memory registration cost for small messages in MPI implementations over RDMA-enabled networks, message transfer protocols involve a copy to intermediate buffers at both sender and receiver. In this paper, we propose to eliminate the send-side copy when an application buffer is reused frequently. We show that it is more efficient to register the application buffer and use it for data transfer. The idea is examined for small message transfer protocols in MVAPICH2, including RDMA Write and Send/Receive based communications, one-sided communications and collectives. The proposed protocol adaptively falls back to the current protocol when the application does not frequently use its buffers. The performance results over InfiniBand indicate up to 14% improvement for single message latency, close to 20% improvement for one-sided operations and up to 25% improvement for collectives. In addition, the communication time in MPI applications with high buffer reuse is improved using this technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Message Passing Interface Forum: MPI, A Message Passing Interface Standard V2.2 (2011)

  2. RDMA Consortium: Remote direct memory access protocol. http://www.rdmaconsortium.org (2009). Accessed 1 August 2010

  3. InfiniBand Trade Association: InfiniBand architecture specification. http://www.infinibandta.org/ (2010). Accessed 19 July 2010

  4. Mietke, F., Rex, R., Baumgartl, R., Mehlan, T., Hoefler, T., Rehm, W.: Analysis of the memory registration process in the mellanox InfiniBand software stack. In: Proceedings of the 12th International Euro-Par Conference, Dresden, Germany, pp. 124–133 (2006). doi:10.1007/11823285_13

    Google Scholar 

  5. Magoutis, K.: Memory management support for multi-programmed remote direct memory access (RDMA) systems. In: Proceedings of the 2nd Workshop for RDMA Applications, Implementations and Technologies (RAIT-2005); held in conjunction with IEEE Cluster 2005, Burlington, MA, October (2005). doi:10.1109/CLUSTR.2005.347031

    Google Scholar 

  6. Argonne National Laboratory: MPICH2 MPI Implementation. http://www-unix.mcs.anl.gov/mpi/mpich2/ (2010). Accessed 26 July 2010

  7. Liu, J., Wu, J., Panda, D.K.: High performance RDMA-based MPI implementation over InfiniBand. In: Proceedings of the 17th Annual Conference on Supercomputing, pp. 295–304 (2003). doi:10.1145/782814.782855

    Chapter  Google Scholar 

  8. Rashti, M.J., Afsahi, A.: Improving RDMA-based MPI Eager protocol for frequently-used buffers. In: 9th Workshop on Communication Architecture for Clusters (CAC 2009). Proceedings of the 23rd International Parallel and Distributed Processing Symposium (IPDPS 2009), Rome, Italy, May 25–29 (2009). doi:10.1109/IPDPS.2009.5160895

    Google Scholar 

  9. Huang, W., Santhanaraman, G., Jin, H., Panda, D.K.: Design alternatives and performance trade-offs for implementing MPI-2 over InfiniBand. In: Proceedings of the Euro PVM/MPI Conference, pp. 191–199 (2005). doi:10.1007/11557265_27

    Google Scholar 

  10. Mietke, F., Rex, R., Mehlan, T., Hoefler, T., Rehm, W.: Reducing the impact of memory registration in InfiniBand. In: 1st Kommunikation in Clusterrechnern und Clusterverbundsystemen (KiCC) (2005)

    Google Scholar 

  11. Wyckoff, P., Wu, J.: Memory registration caching correctness. In: Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGrid’05), pp. 1008–1015 (2005). doi:10.1109/CCGRID.2005.1558671

    Google Scholar 

  12. Lever, C.: Linux Kernel Hash Table Behavior: Analysis and Improvements. Technical Report 00-1, Center for Information Technology Integration, University of Michigan (2000)

  13. Mellanox Technologies Inc.: http://www.mellanox.com/ (2010). Accessed 27 July 2010

  14. The Ohio State University, Network-based computing laboratory: MVAPICH2, MPI-2 over InfiniBand, iWARP and RoCE Project. http://mvapich.cse.ohio-state.edu/ (2010). Accessed 1 July 2010

  15. OpenFabric Alliance: OpenFabrics Enterprise Distribution (OFED). http://www.openfabrics.org (2010). Accessed 1 July 2010

  16. National Aeronautics and Space Administration: NAS Parallel Benchmarks, version 2.4. http://www.nas.nasa.gov/Resources/Software/npb.html (2010). Accessed 1 August 2010

  17. Lawrence Livermore National Laboratory: AMG 2006, ASC Sequoia Benchmarks. http://asc.llnl.gov/sequoia/benchmarks/ (2009). Accessed 1 August 2010

  18. Standard Performance Evaluation Corporation: SPEC MPI 2007 Benchmark Suite. http://www.spec.org/mpi/ (2010). Accessed 1 August 2010

  19. Faraj, A., Yuan, X.: Communication characteristics in the NAS parallel benchmarks. In: Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), pp. 724–729 (2002)

    Google Scholar 

  20. Arber, L., Pakin, S.: The impact of message-buffer alignment on communication performance. Parallel Process. Lett. 15, 49–65 (2005). doi:10.1142/S0129626405002052

    Article  MathSciNet  Google Scholar 

  21. Morrow, M.: Optimizing Memcpy improves speed. Embedded system design, April 2004. http://www.eetimes.com/design/other/4024961/Optimizing-Memcpy-improves-speed (2004). Accessed 24 July 2010

  22. Woodall, T.S., Shipman, G.M., Bosilca, G., Graham, R.L., Maccabe, A.B.: High performance RDMA protocols in HPC. In: Proceedings of the Euro PVM/MPI Conference, pp. 76–85 (2006). doi:10.1007/11846802_18

    Google Scholar 

  23. Dalessandro, D., Wyckoff, P., Montry, G.: Initial performance evaluation of the NetEffect 10 gigabit iWARP adapter. In: Proceedings of the 3rd IEEE Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies (RAIT 2006), held in conjunction with IEEE Cluster, pp. 1–7 (2006). doi:10.1109/CLUSTR.2006.311915

    Google Scholar 

  24. Goglin, B.: Decoupling Memory Pinning from the Application with Overlapped On-demand Pinning and MMU Notifiers. In: Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS2009), pp. 1–8 (2009). doi:10.1109/IPDPS.2009.5160888

  25. Ou, L., He, X., Han, J.: An efficient design for fast memory registration in RDMA. J. Netw. Comput. Appl. 32, 642–651 (2009). doi:10.1145/363095.363141

    Article  Google Scholar 

  26. Dalessandro, D., Wyckoff, P.: Memory management strategies for data serving with RDMA. In: Proceedings of the 15th Annual IEEE Symposium on High-Performance Interconnects (HotI), August 22–24 (2007). doi:10.1109/HOTI.2007.21

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmad Afsahi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rashti, M.J., Afsahi, A. Exploiting application buffer reuse to improve MPI small message transfer protocols over RDMA-enabled networks. Cluster Comput 14, 345–356 (2011). https://doi.org/10.1007/s10586-011-0165-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-011-0165-8

Keywords

Navigation