ABSTRACT
Modern high-performance computing systems feature myriad compute units connected via a complex network of nonuniform links. While extensive connectivity can improve communication latency and bandwidth between components, it requires careful orchestration. However, heterogeneous programming models generally expose a flat view of the hardware where all components are connected in a star topology through a uniform, non-descriptive link to the central processing unit. This discrepancy between actual architecture and the simplified abstraction most often employed by programmers results in suboptimal utilization of the complex system interconnects.
In this paper, we provide an automated framework that utilizes complex hardware links while preserving the simplified abstraction level for the user. Through the decomposition of user-issued memory operations into architecture-aware subtasks, we automatically exploit generally underused connections of the system. The utilized links can be targeted by the user, but it is complex, machine-specific, and cumbersome to do so in a portable way. The operations we support include moving, distribution, and consolidation of memory across the node. For each of them, our AutoStrategizer framework proposes a task graph that transparently improves performance, in terms of latency or bandwidth, compared with naive strategies. For our evaluation, we integrated the AutoStrategizer as a C++ library into the LLVM-OpenMP runtime infrastructure. We demonstrate that some memory operations can be improved by a factor of 6 × compared with naive versions. Integrated into LLVM/OpenMP, our AutoStrategizer accelerates cross-device memory movement by a factor of ≈ 2 × for large transfers, resulting in 4 × end-to-end execution time decrease for a scientific proxy application.
- Patrick Atkinson and Simon McIntosh-Smith. 2017. On the Performance of Parallel Tasking Runtimes for an Irregular Fast Multipole Method Application. In Scaling OpenMP for Exascale Performance and Portability, Bronis R. de Supinski, Stephen L. Olivier, Christian Terboven, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer International Publishing, Cham, 92–106.Google Scholar
- Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. 23, 2 (Feb. 2011), 187–198.Google ScholarDigital Library
- Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, and Dhabaleswar K Panda. 2018. Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?. In Proceedings of the 25th European MPI Users’ Group Meeting (Barcelona, Spain) (EuroMPI’18, Article 2). Association for Computing Machinery, New York, NY, USA, 1–9.Google ScholarDigital Library
- Alan Ayala, Stanimire Tomov, Miroslav Stoyanov, and Jack Dongarra. 2021. Scalability Issues in FFT Computation. In Parallel Computing Technologies, Victor Malyshkin (Ed.). Springer International Publishing, Cham, 279–287.Google Scholar
- John Bachan, Scott B Baden, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Dan Bonachea, Paul H Hargrove, and Hadia Ahmed. 2019. UPC++: A high-performance communication framework for asynchronous computation. Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019 (2019), 963–973.Google ScholarCross Ref
- OpenMP Architecture Review Board. [n. d.]. OpenMP API Specification. https://www.osti.gov/servlets/purl/1648853Google Scholar
- Rodrigo Ceccato, Hervé Yviquel, Marcio Pereira, Alan Souza, and Guido Araujo. 2022. Implementing the Broadcast Operation in a Distributed Task-based Runtime. In 2022 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW). 25–32.Google ScholarCross Ref
- Alexandre Denis, Emmanuel Jeannot, Philippe Swartvagher, and Samuel Thibault. 2020. Using Dynamic Broadcasts to improve Task-Based Runtime Performances. (2020).Google Scholar
- Dawson Fox, Jose Monsalve Diaz, and Xiaoming Li. 2023. On Memory Codelets: Prefetching, Recoding, Moving and Streaming Data. arXiv:arXiv:2302.00115Google Scholar
- Dawson Fox, Jose M Monsalve Diaz, and Xiaoming Li. 2022. Chiplets and the Codelet Model. arXiv:arXiv:2209.06083Google Scholar
- Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, Lion Levi, Alex Margolin, Tamir Ronen, Alexander Shpiner, Oded Wertheim, and Eitan Zahavi. 2016. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). 1–10.Google ScholarCross Ref
- Ivan Grasso, Simone Pellegrini, Biagio Cosenza, and Thomas Fahringer. 2013. LibWater: Heterogeneous distributed computing made easy. Proceedings of the International Conference on Supercomputing (2013), 161–170.Google Scholar
- Khalid Hasanov and Alexey Lastovetsky. 2017. Hierarchical redesign of classic MPI reduction algorithms. J. Supercomput. 73, 2 (2017), 713–725.Google ScholarDigital Library
- Torsten Hoefler and Timo Schneider. 2012. Communication-centric optimizations by dynamically detecting collective operations. ACM SIGPLAN Notices 47, 8 (2012), 305–306.Google ScholarDigital Library
- Torsten Hoefler, Christian Siebert, and Wolfgang Rehm. 2007. A practically constant-time MPI broadcast algorithm for large-scale InfiniBand clusters with multicast. Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM (2007), 1–8.Google ScholarCross Ref
- Sylvain Jeaugey. 2017. Nccl 2.0. In GPU Technology Conference (GTC), Vol. 2.Google Scholar
- Laxmikant V Kale and Sanjeev Krishnan. 1993. Charm++: A portable concurrent object oriented system based on C++. Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA Part F1296 (1993), 91–108.Google ScholarDigital Library
- Jeongnim Kim. 2021. miniQMC - QMCPACK Miniapp. https://github.com/QMCPACK/miniqmc.Google Scholar
- Jeongnim Kim, Andrew D Baczewski, Todd D Beaudet, Anouar Benali, M Chandler Bennett, Mark A Berrill, Nick S Blunt, Edgar Josué Landinez Borda, Michele Casula, David M Ceperley, Simone Chiesa, Bryan K Clark, Raymond C Clay, Kris T Delaney, Mark Dewing, Kenneth P Esler, Hongxia Hao, Olle Heinonen, Paul R C Kent, Jaron T Krogel, Ilkka Kylänpää, Ying Wai Li, M Graham Lopez, Ye Luo, Fionn D Malone, Richard M Martin, Amrita Mathuriya, Jeremy McMinis, Cody A Melton, Lubos Mitas, Miguel A Morales, Eric Neuscamman, William D Parker, Sergio D Pineda Flores, Nichols A Romero, Brenda M Rubenstein, Jacqueline A R Shea, Hyeondeok Shin, Luke Shulenburger, Andreas F Tillack, Joshua P Townsend, Norm M Tubman, Brett Van Der Goetz, Jordan E Vincent, D ChangMo Yang, Yubo Yang, Shuai Zhang, and Luning Zhao. 2018. QMCPACK: an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids. Journal of Physics: Condensed Matter 30, 19 (apr 2018), 195901. https://doi.org/10.1088/1361-648X/aab9c3Google ScholarCross Ref
- Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, and Vivek Sarkar. 2014. HabaneroUPC++: A compiler-free PGAS library. ACM International Conference Proceeding Series 2014-Octob (2014).Google ScholarDigital Library
- Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. 2019. A large-scale study of MPI usage in open-source HPC applications. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society.Google ScholarDigital Library
- Alexander Margolin and Amnon Barak. 2019. RDMA-Based Library for Collective Operations in MPI. Proceedings of ExaMPI 2019: Workshop on Exascale MPI - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis (2019), 39–46.Google ScholarCross Ref
- Rajesh Nishtala, Yili Zheng, Paul H Hargrove, and Katherine A Yelick. 2011. Tuning collective communication for Partitioned Global Address Space programming models. Parallel Comput. 37, 9 (2011), 576–591.Google ScholarDigital Library
- Emin Nuriyev and Alexey Lastovetsky. 2020. Accurate runtime selection of optimal MPI collective algorithms using analytical performance modelling. (2020), 1–15.Google Scholar
- Atmn Patel and Johannes Doerfert. 2022. Remote OpenMP Offloading. In High Performance Computing - 37th International Conference, ISC High Performance 2022, Hamburg, Germany, May 29 - June 2, 2022, Proceedings(Lecture Notes in Computer Science, Vol. 13289), Ana Lucia Varbanescu, Abhinav Bhatele, Piotr Luszczek, and Marc Baboulin (Eds.). Springer, 315–333. https://doi.org/10.1007/978-3-031-07312-0_16Google ScholarDigital Library
- Jelena Pješivac-Grbović, Thara Angskun, George Bosilca, Graham E Fagg, Edgar Gabriel, and Jack J Dongarra. 2004. Performance analysis of MPI collective operations. Tertiary Education and Management 10, 2 (2004), 127–143.Google Scholar
- Diego Roa and Rodrigo Ceccato. 2023. AutoStrategizer Repository. https://github.com/Darptolus/auto-strategizer-artifacts.Google Scholar
- Paul K. Romano, Nicholas E. Horelik, Bryan R. Herman, Adam G. Nelson, Benoit Forget, and Kord Smith. 2015. OpenMC: A state-of-the-art Monte Carlo code for research and development. Annals of Nuclear Energy 82 (2015), 90–97. https://doi.org/10.1016/j.anucene.2014.07.048 Joint International Conference on Supercomputing in Nuclear Applications and Monte Carlo 2013, SNA + MC 2013. Pluri- and Trans-disciplinarity, Towards New Modeling and Numerical Simulation Paradigms.Google ScholarCross Ref
- Peter Sanders, Jochen Speck, and Jesper Larsson Träff. 2009. Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Comput. 35, 12 (Dec. 2009), 581–594.Google ScholarDigital Library
- Guy L Steele, Xiaowei Shen, Josep Torrellas, Mark Tuckerman, Eric J Bohm, Laxmikant V Kalé, Glenn Martyna, Pen-Chung Yew, H Peter Hofstee, Matthew Sottile, Bruce Hendrickson, Bradford L Chamberlain, Laxmikant V Kalé, Martin Schulz, Charles E Leiserson, Thomas L Sterling, Daniel P Siewiorek, E D Gehringer, Robert W Numrich, Cédric Bastoul, Robert Geijn, Jesperlarsson Träff, Dhabaleswar K Panda, Sayantan Sur, Hari Subramoni, Krishna Kandalla, Laxmikant V Kalé, Pritish Jetley, Patrick H Worley, Mariana Vertenstein, Anthony P Craig, Geoffrey Fox, John C Hart, Michael G Burke, Kathleen Knobe, Ryan Newton, Vivek Sarkar, John Reppy, Pedro J Garcia, Guy L Steele, Guy L Steele, Guy L Steele, John Swensen, M’hamed Souli, Timothy Prince, Jason Wang, Michael Dungworth, James Harrell, Michael Levine, Stephen Nelson, Steven Oberlin, Steven P Reinhardt, James L Schwarzmeier, Larry Kaplan, Jeff Brooks, Gerry Kirschner, Dennis Abts, A W Roscoe, Jim Davies, Monty Denneau, and Michael Schlansker. 2011. Chapel (Cray Inc. HPCS Language). In Encyclopedia of Parallel Computing. Springer US, Boston, MA, 249–256.Google Scholar
- Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19, 1 (2005), 49–66.Google ScholarDigital Library
- Shilei Tian, Johannes Doerfert, and Barbara Chapman. 2020. Concurrent execution of deferred OpenMP target tasks with hidden helper threads. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 41–56.Google Scholar
- Jesper Larsson Träff and Andreas Ripke. 2008. Optimal broadcast for fully connected processor-node networks. J. Parallel Distrib. Comput. 68, 7 (July 2008), 887–901.Google ScholarDigital Library
- John R. Tramm, Andrew R. Siegel, Benoit Forget, and Colin Josey. 2014. Performance Analysis of a Reduced Data Movement Algorithm for Neutron Cross Section Data in Monte Carlo Simulations. In EASC 2014 - Solving Software Challenges for Exascale. Stockholm. https://doi.org/10.1007/978-3-319-15976-8_3Google ScholarCross Ref
- John R Tramm, Andrew R Siegel, Tanzima Islam, and Martin Schulz. 2014. XSBench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis. In PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future. Kyoto. https://www.mcs.anl.gov/papers/P5064-0114.pdfGoogle Scholar
- Manjunath Gorentla Venkata, Pavel Shamis, Rahul Sampath, Richard L Graham, and Joshua S Ladd. 2013. Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation. Proceedings - IEEE International Conference on Cluster Computing, ICCC (2013).Google ScholarCross Ref
- Hervé Yviquel, Marcio Pereira, Emílio Francesquini, Guilherme Valarini, Gustavo Leite, Pedro Rosso, Rodrigo Ceccato, Carla Cusihualpa, Vitoria Dias, Sandro Rigo, Alan Souza, and Guido Araujo. 2023. The OpenMP Cluster Programming Model. In Workshop Proceedings of the 51st International Conference on Parallel Processing (Bordeaux, France) (ICPP Workshops ’22, Article 17). Association for Computing Machinery, New York, NY, USA, 1–11.Google Scholar
Index Terms
- Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware Strategies
Recommendations
Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble Execution
ICPP Workshops '23: Proceedings of the 52nd International Conference on Parallel Processing WorkshopsGPUs are renowned for their exceptional computational acceleration capabilities achieved through massive parallelism. However, utilizing GPUs for computation requires manual identification of code regions suitable for offloading, data transfer ...
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Parallel strategies for 2D Discrete Wavelet Transform in shared memory systems and GPUs
In this work, we analyze the behavior of several parallel algorithms developed to compute the two-dimensional discrete wavelet transform using both OpenMP over a multicore platform and CUDA over a GPU. The proposed parallel algorithms are based on both ...
Comments