ABSTRACT
Modern architectures and communication systems software include complex hardware, communication abstractions, and optimizations that make their performance difficult to measure, model, and understand. This paper examines the ability of modified versions of the existing Netgauge communication performance measurement tool and LogGOPS performance model to accurately characterize communication behavior of modern hardware, MPI abstractions, and implementations. This includes analyzing their ability to model both GPU-aware communication in different MPI implementations and quantifying the performance characteristics of different approaches to non-contiguous data communication on modern GPU systems. This paper also applies these techniques to quantify the performance of different implementations and optimization approaches to non-contiguous data communication on a variety of systems, demonstrating that modern communication system design approaches can result in widely-varying and difficult-to-predict performance variation, even within the same hardware/communication software combination.
- Albert Alexandrov, Mihai F. Ionescu, Klaus E. Schauser, and Chris Scheiman. 1995. LogGP: Incorporating Long Messages into the LogP Model—One Step Closer towards a Realistic Model for Parallel Computation. In Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures (Santa Barbara, California, USA) (SPAA ’95). Association for Computing Machinery, New York, NY, USA, 95–105.Google ScholarDigital Library
- Nicholas Bacon. 2023. GPU Datatype Enhanced Netgauge. https://github.com/CUP-ECS/datatypes-logGPGoogle Scholar
- Amanda Bienz, Luke N. Olson, William D. Gropp, and Shelby Lockhart. 2021. Modeling Data Movement Performance on Heterogeneous Architectures. In 2021 IEEE High Performance Extreme Computing Conference (HPEC). 1–7.Google Scholar
- Dan Bonachea and Paul H Hargrove. 2019. GASNet-EX: A high-performance, portable communication library for exascale. In Languages and Compilers for Parallel Computing: 31st International Workshop, Salt Lake City, UT, USA, October 9–11, 2018, Revised Selected Papers 31. Springer, 138–158.Google Scholar
- Michael Boyer, Jiayuan Meng, and Kalyan Kumaran. 2013. Improving GPU performance prediction with data transfer modeling. In 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum. IEEE, 1097–1106.Google ScholarDigital Library
- David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. 1993. LogP: Towards a Realistic Model of Parallel Computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 1–12.Google ScholarDigital Library
- Keira Haskins, Patrick Bridges, Kurt Ferreira, and Scott Levy. 2021. A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications.Technical Report. Sandia National Laboratory, Albuquerque, NM.Google Scholar
- Torsten Hoefler, Torsten Mehlan, Andrew Lumsdaine, and Wolfgang Rehm. 2007. Netgauge: A Network Performance Measurement Framework. In Proceedings of High Performance Computing and Communications, HPCC’07 (Houston, USA), Vol. 4782. Springer, 659–671.Google ScholarCross Ref
- Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine. 2010. LogGOPSim: simulating large-scale applications in the LogGOPS model. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. 597–604.Google ScholarDigital Library
- Fumihiko Ino, Noriyuki Fujimoto, and Kenichi Hagihara. 2001. LogGPS: a parallel computational model for synchronization analysis. In Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming. 133–142.Google ScholarDigital Library
- Argonne National Laboratory. 2020. Yaksa : High-performance Noncontiguous Data Management. https://www.yaksa.org/.Google Scholar
- Lawrence Berkeley National Laboratory. 2023. GASNet-EX API Description. https://gasnet.lbl.gov/docs/GASNet-EX.txtGoogle Scholar
- Message Passing Interface Forum. 2021. MPI: A Message-Passing Interface Standard Version 4.0. https://www.mpi-forum.org/docs/Google Scholar
- Csaba Andras Moritz. 1998. Cost Modeling and Analysis: Towards Optimal Resource Utilization in Parallel Computer Systems. Ph. D. Thesis, Royal Institute of Technology (1998).Google Scholar
- NVIDIA. 2022. Faster memory transfers between CPU and GPU with GDRCopy. https://developer.nvidia.com/gdrcopyGoogle Scholar
- OpenUCX. 2023. Data type routines. https://openucx.readthedocs.io/en/master/api.html#data-type-routinesGoogle Scholar
- Dhabaleswar K Panda, Karen Tomko, Karl Schulz, and Amitava Majumdar. 2013. The MVAPICH project: Evolution and sustainability of an open source production quality MPI library for HPC. In Workshop on Sustainable Software for Science: Practice and Experiences, held in conjunction with Int’l Conference on Supercomputing (WSSPE).Google Scholar
- Carl Pearson, Kun Wu, I-Hsin Chung, Jinjun Xiong, and Wen-Mei Hwu. 2021. TEMPI: An interposed MPI library with a canonical representation of CUDA-aware datatypes. In Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing. 95–106.Google ScholarDigital Library
- Rong Shi, Xiaoyi Lu, Sreeram Potluri, Khaled Hamidouche, Jie Zhang, and Dhabaleswar K Panda. 2014. Hand: A hybrid approach to accelerate non-contiguous data movement using MPI datatypes on GPU clusters. In 2014 43rd International Conference on Parallel Processing. IEEE, 221–230.Google ScholarDigital Library
- Xian-He Sun 2003. Improving the performance of MPI derived datatypes by optimizing memory-access cost. In 2003 Proceedings IEEE International Conference on Cluster Computing. IEEE, 412–419.Google Scholar
- Kaushik Kandadi Suresh, Kawthar Shafie Khorassani, Chen Chun Chen, Bharath Ramesh, Mustafa Abduljabbar, Aamir Shafi, Hari Subramoni, and Dhabaleswar K Panda. 2022. Network Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries. In 2022 IEEE Symposium on High-Performance Interconnects (HOTI). IEEE, 13–20.Google Scholar
- Ben Van Werkhoven, Jason Maassen, Frank J Seinstra, and Henri E Bal. 2014. Performance models for CPU-GPU data transfers. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 11–20.Google ScholarDigital Library
- Hao Wang, Sreeram Potluri, Miao Luo, Ashish Kumar Singh, Xiangyong Ouyang, Sayantan Sur, and Dhabaleswar K Panda. 2011. Optimized non-contiguous MPI datatype communication for GPU clusters: Design, implementation and evaluation with MVAPICH2. In 2011 IEEE International Conference on Cluster Computing. IEEE, 308–316.Google ScholarDigital Library
Index Terms
- Evaluating the Viability of LogGP for Modeling MPI Performance with Non-contiguous Datatypes on Modern Architectures
Recommendations
LogGP Performance Evaluation of MPI
HPDC '98: Proceedings of the 7th IEEE International Symposium on High Performance Distributed ComputingUsers of parallel machines need good performance evaluations for several communication patterns in order to develop efficient message-passing applications. LogGP is a simple parallel machine model that reflects the important parameters required to ...
Performance Modeling and Evaluation of MPI
Users of parallel machines need to have a good grasp for how different communication patterns and styles affect the performance of message-passing applications. LogGP is a simple performance model that reflects the most important parameters required to ...
Comments