skip to main content
10.1145/3624062.3624609acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Results Replicated / v1.1

Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware Strategies

Published:12 November 2023Publication History

ABSTRACT

Modern high-performance computing systems feature myriad compute units connected via a complex network of nonuniform links. While extensive connectivity can improve communication latency and bandwidth between components, it requires careful orchestration. However, heterogeneous programming models generally expose a flat view of the hardware where all components are connected in a star topology through a uniform, non-descriptive link to the central processing unit. This discrepancy between actual architecture and the simplified abstraction most often employed by programmers results in suboptimal utilization of the complex system interconnects.

In this paper, we provide an automated framework that utilizes complex hardware links while preserving the simplified abstraction level for the user. Through the decomposition of user-issued memory operations into architecture-aware subtasks, we automatically exploit generally underused connections of the system. The utilized links can be targeted by the user, but it is complex, machine-specific, and cumbersome to do so in a portable way. The operations we support include moving, distribution, and consolidation of memory across the node. For each of them, our AutoStrategizer framework proposes a task graph that transparently improves performance, in terms of latency or bandwidth, compared with naive strategies. For our evaluation, we integrated the AutoStrategizer as a C++ library into the LLVM-OpenMP runtime infrastructure. We demonstrate that some memory operations can be improved by a factor of 6 × compared with naive versions. Integrated into LLVM/OpenMP, our AutoStrategizer accelerates cross-device memory movement by a factor of ≈ 2 × for large transfers, resulting in 4 × end-to-end execution time decrease for a scientific proxy application.

References

  1. Patrick Atkinson and Simon McIntosh-Smith. 2017. On the Performance of Parallel Tasking Runtimes for an Irregular Fast Multipole Method Application. In Scaling OpenMP for Exascale Performance and Portability, Bronis R. de Supinski, Stephen L. Olivier, Christian Terboven, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer International Publishing, Cham, 92–106.Google ScholarGoogle Scholar
  2. Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. 23, 2 (Feb. 2011), 187–198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, and Dhabaleswar K Panda. 2018. Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?. In Proceedings of the 25th European MPI Users’ Group Meeting (Barcelona, Spain) (EuroMPI’18, Article 2). Association for Computing Machinery, New York, NY, USA, 1–9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Alan Ayala, Stanimire Tomov, Miroslav Stoyanov, and Jack Dongarra. 2021. Scalability Issues in FFT Computation. In Parallel Computing Technologies, Victor Malyshkin (Ed.). Springer International Publishing, Cham, 279–287.Google ScholarGoogle Scholar
  5. John Bachan, Scott B Baden, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Dan Bonachea, Paul H Hargrove, and Hadia Ahmed. 2019. UPC++: A high-performance communication framework for asynchronous computation. Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019 (2019), 963–973.Google ScholarGoogle ScholarCross RefCross Ref
  6. OpenMP Architecture Review Board. [n. d.]. OpenMP API Specification. https://www.osti.gov/servlets/purl/1648853Google ScholarGoogle Scholar
  7. Rodrigo Ceccato, Hervé Yviquel, Marcio Pereira, Alan Souza, and Guido Araujo. 2022. Implementing the Broadcast Operation in a Distributed Task-based Runtime. In 2022 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW). 25–32.Google ScholarGoogle ScholarCross RefCross Ref
  8. Alexandre Denis, Emmanuel Jeannot, Philippe Swartvagher, and Samuel Thibault. 2020. Using Dynamic Broadcasts to improve Task-Based Runtime Performances. (2020).Google ScholarGoogle Scholar
  9. Dawson Fox, Jose Monsalve Diaz, and Xiaoming Li. 2023. On Memory Codelets: Prefetching, Recoding, Moving and Streaming Data. arXiv:arXiv:2302.00115Google ScholarGoogle Scholar
  10. Dawson Fox, Jose M Monsalve Diaz, and Xiaoming Li. 2022. Chiplets and the Codelet Model. arXiv:arXiv:2209.06083Google ScholarGoogle Scholar
  11. Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, Lion Levi, Alex Margolin, Tamir Ronen, Alexander Shpiner, Oded Wertheim, and Eitan Zahavi. 2016. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). 1–10.Google ScholarGoogle ScholarCross RefCross Ref
  12. Ivan Grasso, Simone Pellegrini, Biagio Cosenza, and Thomas Fahringer. 2013. LibWater: Heterogeneous distributed computing made easy. Proceedings of the International Conference on Supercomputing (2013), 161–170.Google ScholarGoogle Scholar
  13. Khalid Hasanov and Alexey Lastovetsky. 2017. Hierarchical redesign of classic MPI reduction algorithms. J. Supercomput. 73, 2 (2017), 713–725.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Torsten Hoefler and Timo Schneider. 2012. Communication-centric optimizations by dynamically detecting collective operations. ACM SIGPLAN Notices 47, 8 (2012), 305–306.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Torsten Hoefler, Christian Siebert, and Wolfgang Rehm. 2007. A practically constant-time MPI broadcast algorithm for large-scale InfiniBand clusters with multicast. Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM (2007), 1–8.Google ScholarGoogle ScholarCross RefCross Ref
  16. Sylvain Jeaugey. 2017. Nccl 2.0. In GPU Technology Conference (GTC), Vol. 2.Google ScholarGoogle Scholar
  17. Laxmikant V Kale and Sanjeev Krishnan. 1993. Charm++: A portable concurrent object oriented system based on C++. Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA Part F1296 (1993), 91–108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jeongnim Kim. 2021. miniQMC - QMCPACK Miniapp. https://github.com/QMCPACK/miniqmc.Google ScholarGoogle Scholar
  19. Jeongnim Kim, Andrew D Baczewski, Todd D Beaudet, Anouar Benali, M Chandler Bennett, Mark A Berrill, Nick S Blunt, Edgar Josué Landinez Borda, Michele Casula, David M Ceperley, Simone Chiesa, Bryan K Clark, Raymond C Clay, Kris T Delaney, Mark Dewing, Kenneth P Esler, Hongxia Hao, Olle Heinonen, Paul R C Kent, Jaron T Krogel, Ilkka Kylänpää, Ying Wai Li, M Graham Lopez, Ye Luo, Fionn D Malone, Richard M Martin, Amrita Mathuriya, Jeremy McMinis, Cody A Melton, Lubos Mitas, Miguel A Morales, Eric Neuscamman, William D Parker, Sergio D Pineda Flores, Nichols A Romero, Brenda M Rubenstein, Jacqueline A R Shea, Hyeondeok Shin, Luke Shulenburger, Andreas F Tillack, Joshua P Townsend, Norm M Tubman, Brett Van Der Goetz, Jordan E Vincent, D ChangMo Yang, Yubo Yang, Shuai Zhang, and Luning Zhao. 2018. QMCPACK: an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids. Journal of Physics: Condensed Matter 30, 19 (apr 2018), 195901. https://doi.org/10.1088/1361-648X/aab9c3Google ScholarGoogle ScholarCross RefCross Ref
  20. Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, and Vivek Sarkar. 2014. HabaneroUPC++: A compiler-free PGAS library. ACM International Conference Proceeding Series 2014-Octob (2014).Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. 2019. A large-scale study of MPI usage in open-source HPC applications. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Alexander Margolin and Amnon Barak. 2019. RDMA-Based Library for Collective Operations in MPI. Proceedings of ExaMPI 2019: Workshop on Exascale MPI - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis (2019), 39–46.Google ScholarGoogle ScholarCross RefCross Ref
  23. Rajesh Nishtala, Yili Zheng, Paul H Hargrove, and Katherine A Yelick. 2011. Tuning collective communication for Partitioned Global Address Space programming models. Parallel Comput. 37, 9 (2011), 576–591.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Emin Nuriyev and Alexey Lastovetsky. 2020. Accurate runtime selection of optimal MPI collective algorithms using analytical performance modelling. (2020), 1–15.Google ScholarGoogle Scholar
  25. Atmn Patel and Johannes Doerfert. 2022. Remote OpenMP Offloading. In High Performance Computing - 37th International Conference, ISC High Performance 2022, Hamburg, Germany, May 29 - June 2, 2022, Proceedings(Lecture Notes in Computer Science, Vol. 13289), Ana Lucia Varbanescu, Abhinav Bhatele, Piotr Luszczek, and Marc Baboulin (Eds.). Springer, 315–333. https://doi.org/10.1007/978-3-031-07312-0_16Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jelena Pješivac-Grbović, Thara Angskun, George Bosilca, Graham E Fagg, Edgar Gabriel, and Jack J Dongarra. 2004. Performance analysis of MPI collective operations. Tertiary Education and Management 10, 2 (2004), 127–143.Google ScholarGoogle Scholar
  27. Diego Roa and Rodrigo Ceccato. 2023. AutoStrategizer Repository. https://github.com/Darptolus/auto-strategizer-artifacts.Google ScholarGoogle Scholar
  28. Paul K. Romano, Nicholas E. Horelik, Bryan R. Herman, Adam G. Nelson, Benoit Forget, and Kord Smith. 2015. OpenMC: A state-of-the-art Monte Carlo code for research and development. Annals of Nuclear Energy 82 (2015), 90–97. https://doi.org/10.1016/j.anucene.2014.07.048 Joint International Conference on Supercomputing in Nuclear Applications and Monte Carlo 2013, SNA + MC 2013. Pluri- and Trans-disciplinarity, Towards New Modeling and Numerical Simulation Paradigms.Google ScholarGoogle ScholarCross RefCross Ref
  29. Peter Sanders, Jochen Speck, and Jesper Larsson Träff. 2009. Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Comput. 35, 12 (Dec. 2009), 581–594.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Guy L Steele, Xiaowei Shen, Josep Torrellas, Mark Tuckerman, Eric J Bohm, Laxmikant V Kalé, Glenn Martyna, Pen-Chung Yew, H Peter Hofstee, Matthew Sottile, Bruce Hendrickson, Bradford L Chamberlain, Laxmikant V Kalé, Martin Schulz, Charles E Leiserson, Thomas L Sterling, Daniel P Siewiorek, E D Gehringer, Robert W Numrich, Cédric Bastoul, Robert Geijn, Jesperlarsson Träff, Dhabaleswar K Panda, Sayantan Sur, Hari Subramoni, Krishna Kandalla, Laxmikant V Kalé, Pritish Jetley, Patrick H Worley, Mariana Vertenstein, Anthony P Craig, Geoffrey Fox, John C Hart, Michael G Burke, Kathleen Knobe, Ryan Newton, Vivek Sarkar, John Reppy, Pedro J Garcia, Guy L Steele, Guy L Steele, Guy L Steele, John Swensen, M’hamed Souli, Timothy Prince, Jason Wang, Michael Dungworth, James Harrell, Michael Levine, Stephen Nelson, Steven Oberlin, Steven P Reinhardt, James L Schwarzmeier, Larry Kaplan, Jeff Brooks, Gerry Kirschner, Dennis Abts, A W Roscoe, Jim Davies, Monty Denneau, and Michael Schlansker. 2011. Chapel (Cray Inc. HPCS Language). In Encyclopedia of Parallel Computing. Springer US, Boston, MA, 249–256.Google ScholarGoogle Scholar
  31. Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19, 1 (2005), 49–66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Shilei Tian, Johannes Doerfert, and Barbara Chapman. 2020. Concurrent execution of deferred OpenMP target tasks with hidden helper threads. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 41–56.Google ScholarGoogle Scholar
  33. Jesper Larsson Träff and Andreas Ripke. 2008. Optimal broadcast for fully connected processor-node networks. J. Parallel Distrib. Comput. 68, 7 (July 2008), 887–901.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. John R. Tramm, Andrew R. Siegel, Benoit Forget, and Colin Josey. 2014. Performance Analysis of a Reduced Data Movement Algorithm for Neutron Cross Section Data in Monte Carlo Simulations. In EASC 2014 - Solving Software Challenges for Exascale. Stockholm. https://doi.org/10.1007/978-3-319-15976-8_3Google ScholarGoogle ScholarCross RefCross Ref
  35. John R Tramm, Andrew R Siegel, Tanzima Islam, and Martin Schulz. 2014. XSBench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis. In PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future. Kyoto. https://www.mcs.anl.gov/papers/P5064-0114.pdfGoogle ScholarGoogle Scholar
  36. Manjunath Gorentla Venkata, Pavel Shamis, Rahul Sampath, Richard L Graham, and Joshua S Ladd. 2013. Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation. Proceedings - IEEE International Conference on Cluster Computing, ICCC (2013).Google ScholarGoogle ScholarCross RefCross Ref
  37. Hervé Yviquel, Marcio Pereira, Emílio Francesquini, Guilherme Valarini, Gustavo Leite, Pedro Rosso, Rodrigo Ceccato, Carla Cusihualpa, Vitoria Dias, Sandro Rigo, Alan Souza, and Guido Araujo. 2023. The OpenMP Cluster Programming Model. In Workshop Proceedings of the 51st International Conference on Parallel Processing (Bordeaux, France) (ICPP Workshops ’22, Article 17). Association for Computing Machinery, New York, NY, USA, 1–11.Google ScholarGoogle Scholar

Index Terms

  1. Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware Strategies

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
        November 2023
        2180 pages
        ISBN:9798400707858
        DOI:10.1145/3624062

        Copyright © 2023 ACM

        Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 November 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited
      • Article Metrics

        • Downloads (Last 12 months)32
        • Downloads (Last 6 weeks)4

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format