research-article

Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware Strategies

Authors:
Diego A. Roa Perdomo

Electrical and Computer Engineering, University of Delaware - CAPSL, United States of America and Argonne National Laboratory (ANL), United States of America

Electrical and Computer Engineering, University of Delaware - CAPSL, United States of America and Argonne National Laboratory (ANL), United States of America

0000-0003-2961-1772
View Profile

,
Rodrigo Ceccato

University of Campinas, Brazil and Argonne National Laboratory, USA

University of Campinas, Brazil and Argonne National Laboratory, USA

0000-0001-5830-0733
View Profile

,
Rémy Neveu

University of Campinas, Brazil and Argonne National Laboratory, USA

University of Campinas, Brazil and Argonne National Laboratory, USA

0009-0007-5191-1532
View Profile

,
Hervé Yviquel

University of Campinas, Brazil

University of Campinas, Brazil

0000-0003-1214-3431
View Profile

,
Xiaoming Li

University of Delaware, United States of America

University of Delaware, United States of America

0000-0002-5079-3219
View Profile

,
Jose M. Monsalve Diaz

Argonne National Laboratory, United States of America

Argonne National Laboratory, United States of America

0000-0001-6875-1685
View Profile

,
Johannes Doerfert

Lawrence Livermore National Laboratory, United States of America

Lawrence Livermore National Laboratory, United States of America

0000-0001-7870-8963
View Profile

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and AnalysisNovember 2023Pages 1958–1967https://doi.org/10.1145/3624062.3624609

Published:12 November 2023Publication History

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

Pages 1958–1967

ABSTRACT

Modern high-performance computing systems feature myriad compute units connected via a complex network of nonuniform links. While extensive connectivity can improve communication latency and bandwidth between components, it requires careful orchestration. However, heterogeneous programming models generally expose a flat view of the hardware where all components are connected in a star topology through a uniform, non-descriptive link to the central processing unit. This discrepancy between actual architecture and the simplified abstraction most often employed by programmers results in suboptimal utilization of the complex system interconnects.

In this paper, we provide an automated framework that utilizes complex hardware links while preserving the simplified abstraction level for the user. Through the decomposition of user-issued memory operations into architecture-aware subtasks, we automatically exploit generally underused connections of the system. The utilized links can be targeted by the user, but it is complex, machine-specific, and cumbersome to do so in a portable way. The operations we support include moving, distribution, and consolidation of memory across the node. For each of them, our AutoStrategizer framework proposes a task graph that transparently improves performance, in terms of latency or bandwidth, compared with naive strategies. For our evaluation, we integrated the AutoStrategizer as a C++ library into the LLVM-OpenMP runtime infrastructure. We demonstrate that some memory operations can be improved by a factor of 6 × compared with naive versions. Integrated into LLVM/OpenMP, our AutoStrategizer accelerates cross-device memory movement by a factor of ≈ 2 × for large transfers, resulting in 4 × end-to-end execution time decrease for a scientific proxy application.

References

Patrick Atkinson and Simon McIntosh-Smith. 2017. On the Performance of Parallel Tasking Runtimes for an Irregular Fast Multipole Method Application. In Scaling OpenMP for Exascale Performance and Portability, Bronis R. de Supinski, Stephen L. Olivier, Christian Terboven, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer International Publishing, Cham, 92–106.Google Scholar
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. 23, 2 (Feb. 2011), 187–198.Google ScholarDigital Library
Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, and Dhabaleswar K Panda. 2018. Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?. In Proceedings of the 25th European MPI Users’ Group Meeting (Barcelona, Spain) (EuroMPI’18, Article 2). Association for Computing Machinery, New York, NY, USA, 1–9.Google ScholarDigital Library
Alan Ayala, Stanimire Tomov, Miroslav Stoyanov, and Jack Dongarra. 2021. Scalability Issues in FFT Computation. In Parallel Computing Technologies, Victor Malyshkin (Ed.). Springer International Publishing, Cham, 279–287.Google Scholar
John Bachan, Scott B Baden, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Dan Bonachea, Paul H Hargrove, and Hadia Ahmed. 2019. UPC++: A high-performance communication framework for asynchronous computation. Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019 (2019), 963–973.Google ScholarCross Ref
OpenMP Architecture Review Board. [n. d.]. OpenMP API Specification. https://www.osti.gov/servlets/purl/1648853Google Scholar
Rodrigo Ceccato, Hervé Yviquel, Marcio Pereira, Alan Souza, and Guido Araujo. 2022. Implementing the Broadcast Operation in a Distributed Task-based Runtime. In 2022 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW). 25–32.Google ScholarCross Ref
Alexandre Denis, Emmanuel Jeannot, Philippe Swartvagher, and Samuel Thibault. 2020. Using Dynamic Broadcasts to improve Task-Based Runtime Performances. (2020).Google Scholar
Dawson Fox, Jose Monsalve Diaz, and Xiaoming Li. 2023. On Memory Codelets: Prefetching, Recoding, Moving and Streaming Data. arXiv:arXiv:2302.00115Google Scholar
Dawson Fox, Jose M Monsalve Diaz, and Xiaoming Li. 2022. Chiplets and the Codelet Model. arXiv:arXiv:2209.06083Google Scholar
Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, Lion Levi, Alex Margolin, Tamir Ronen, Alexander Shpiner, Oded Wertheim, and Eitan Zahavi. 2016. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). 1–10.Google ScholarCross Ref
Ivan Grasso, Simone Pellegrini, Biagio Cosenza, and Thomas Fahringer. 2013. LibWater: Heterogeneous distributed computing made easy. Proceedings of the International Conference on Supercomputing (2013), 161–170.Google Scholar
Khalid Hasanov and Alexey Lastovetsky. 2017. Hierarchical redesign of classic MPI reduction algorithms. J. Supercomput. 73, 2 (2017), 713–725.Google ScholarDigital Library
Torsten Hoefler and Timo Schneider. 2012. Communication-centric optimizations by dynamically detecting collective operations. ACM SIGPLAN Notices 47, 8 (2012), 305–306.Google ScholarDigital Library
Torsten Hoefler, Christian Siebert, and Wolfgang Rehm. 2007. A practically constant-time MPI broadcast algorithm for large-scale InfiniBand clusters with multicast. Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM (2007), 1–8.Google ScholarCross Ref
Sylvain Jeaugey. 2017. Nccl 2.0. In GPU Technology Conference (GTC), Vol. 2.Google Scholar
Laxmikant V Kale and Sanjeev Krishnan. 1993. Charm++: A portable concurrent object oriented system based on C++. Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA Part F1296 (1993), 91–108.Google ScholarDigital Library
Jeongnim Kim. 2021. miniQMC - QMCPACK Miniapp. https://github.com/QMCPACK/miniqmc.Google Scholar
Jeongnim Kim, Andrew D Baczewski, Todd D Beaudet, Anouar Benali, M Chandler Bennett, Mark A Berrill, Nick S Blunt, Edgar Josué Landinez Borda, Michele Casula, David M Ceperley, Simone Chiesa, Bryan K Clark, Raymond C Clay, Kris T Delaney, Mark Dewing, Kenneth P Esler, Hongxia Hao, Olle Heinonen, Paul R C Kent, Jaron T Krogel, Ilkka Kylänpää, Ying Wai Li, M Graham Lopez, Ye Luo, Fionn D Malone, Richard M Martin, Amrita Mathuriya, Jeremy McMinis, Cody A Melton, Lubos Mitas, Miguel A Morales, Eric Neuscamman, William D Parker, Sergio D Pineda Flores, Nichols A Romero, Brenda M Rubenstein, Jacqueline A R Shea, Hyeondeok Shin, Luke Shulenburger, Andreas F Tillack, Joshua P Townsend, Norm M Tubman, Brett Van Der Goetz, Jordan E Vincent, D ChangMo Yang, Yubo Yang, Shuai Zhang, and Luning Zhao. 2018. QMCPACK: an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids. Journal of Physics: Condensed Matter 30, 19 (apr 2018), 195901. https://doi.org/10.1088/1361-648X/aab9c3Google ScholarCross Ref
Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, and Vivek Sarkar. 2014. HabaneroUPC++: A compiler-free PGAS library. ACM International Conference Proceeding Series 2014-Octob (2014).Google ScholarDigital Library
Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. 2019. A large-scale study of MPI usage in open-source HPC applications. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society.Google ScholarDigital Library
Alexander Margolin and Amnon Barak. 2019. RDMA-Based Library for Collective Operations in MPI. Proceedings of ExaMPI 2019: Workshop on Exascale MPI - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis (2019), 39–46.Google ScholarCross Ref
Rajesh Nishtala, Yili Zheng, Paul H Hargrove, and Katherine A Yelick. 2011. Tuning collective communication for Partitioned Global Address Space programming models. Parallel Comput. 37, 9 (2011), 576–591.Google ScholarDigital Library
Emin Nuriyev and Alexey Lastovetsky. 2020. Accurate runtime selection of optimal MPI collective algorithms using analytical performance modelling. (2020), 1–15.Google Scholar
Atmn Patel and Johannes Doerfert. 2022. Remote OpenMP Offloading. In High Performance Computing - 37th International Conference, ISC High Performance 2022, Hamburg, Germany, May 29 - June 2, 2022, Proceedings(Lecture Notes in Computer Science, Vol. 13289), Ana Lucia Varbanescu, Abhinav Bhatele, Piotr Luszczek, and Marc Baboulin (Eds.). Springer, 315–333. https://doi.org/10.1007/978-3-031-07312-0_16Google ScholarDigital Library
Jelena Pješivac-Grbović, Thara Angskun, George Bosilca, Graham E Fagg, Edgar Gabriel, and Jack J Dongarra. 2004. Performance analysis of MPI collective operations. Tertiary Education and Management 10, 2 (2004), 127–143.Google Scholar
Diego Roa and Rodrigo Ceccato. 2023. AutoStrategizer Repository. https://github.com/Darptolus/auto-strategizer-artifacts.Google Scholar
Paul K. Romano, Nicholas E. Horelik, Bryan R. Herman, Adam G. Nelson, Benoit Forget, and Kord Smith. 2015. OpenMC: A state-of-the-art Monte Carlo code for research and development. Annals of Nuclear Energy 82 (2015), 90–97. https://doi.org/10.1016/j.anucene.2014.07.048 Joint International Conference on Supercomputing in Nuclear Applications and Monte Carlo 2013, SNA + MC 2013. Pluri- and Trans-disciplinarity, Towards New Modeling and Numerical Simulation Paradigms.Google ScholarCross Ref
Peter Sanders, Jochen Speck, and Jesper Larsson Träff. 2009. Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Comput. 35, 12 (Dec. 2009), 581–594.Google ScholarDigital Library
Guy L Steele, Xiaowei Shen, Josep Torrellas, Mark Tuckerman, Eric J Bohm, Laxmikant V Kalé, Glenn Martyna, Pen-Chung Yew, H Peter Hofstee, Matthew Sottile, Bruce Hendrickson, Bradford L Chamberlain, Laxmikant V Kalé, Martin Schulz, Charles E Leiserson, Thomas L Sterling, Daniel P Siewiorek, E D Gehringer, Robert W Numrich, Cédric Bastoul, Robert Geijn, Jesperlarsson Träff, Dhabaleswar K Panda, Sayantan Sur, Hari Subramoni, Krishna Kandalla, Laxmikant V Kalé, Pritish Jetley, Patrick H Worley, Mariana Vertenstein, Anthony P Craig, Geoffrey Fox, John C Hart, Michael G Burke, Kathleen Knobe, Ryan Newton, Vivek Sarkar, John Reppy, Pedro J Garcia, Guy L Steele, Guy L Steele, Guy L Steele, John Swensen, M’hamed Souli, Timothy Prince, Jason Wang, Michael Dungworth, James Harrell, Michael Levine, Stephen Nelson, Steven Oberlin, Steven P Reinhardt, James L Schwarzmeier, Larry Kaplan, Jeff Brooks, Gerry Kirschner, Dennis Abts, A W Roscoe, Jim Davies, Monty Denneau, and Michael Schlansker. 2011. Chapel (Cray Inc. HPCS Language). In Encyclopedia of Parallel Computing. Springer US, Boston, MA, 249–256.Google Scholar
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19, 1 (2005), 49–66.Google ScholarDigital Library
Shilei Tian, Johannes Doerfert, and Barbara Chapman. 2020. Concurrent execution of deferred OpenMP target tasks with hidden helper threads. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 41–56.Google Scholar
Jesper Larsson Träff and Andreas Ripke. 2008. Optimal broadcast for fully connected processor-node networks. J. Parallel Distrib. Comput. 68, 7 (July 2008), 887–901.Google ScholarDigital Library
John R. Tramm, Andrew R. Siegel, Benoit Forget, and Colin Josey. 2014. Performance Analysis of a Reduced Data Movement Algorithm for Neutron Cross Section Data in Monte Carlo Simulations. In EASC 2014 - Solving Software Challenges for Exascale. Stockholm. https://doi.org/10.1007/978-3-319-15976-8_3Google ScholarCross Ref
John R Tramm, Andrew R Siegel, Tanzima Islam, and Martin Schulz. 2014. XSBench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis. In PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future. Kyoto. https://www.mcs.anl.gov/papers/P5064-0114.pdfGoogle Scholar
Manjunath Gorentla Venkata, Pavel Shamis, Rahul Sampath, Richard L Graham, and Joshua S Ladd. 2013. Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation. Proceedings - IEEE International Conference on Cluster Computing, ICCC (2013).Google ScholarCross Ref
Hervé Yviquel, Marcio Pereira, Emílio Francesquini, Guilherme Valarini, Gustavo Leite, Pedro Rosso, Rodrigo Ceccato, Carla Cusihualpa, Vitoria Dias, Sandro Rigo, Alan Souza, and Guido Araujo. 2023. The OpenMP Cluster Programming Model. In Workshop Proceedings of the 51st International Conference on Parallel Processing (Bordeaux, France) (ICPP Workshops ’22, Article 17). Association for Computing Machinery, New York, NY, USA, 1–11.Google Scholar

Index Terms

Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware Strategies
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Computing methodologies
  1. Parallel computing methodologies

Recommendations

Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble Execution
ICPP Workshops '23: Proceedings of the 52nd International Conference on Parallel Processing Workshops

GPUs are renowned for their exceptional computational acceleration capabilities achieved through massive parallelism. However, utilizing GPUs for computation requires manual identification of code regions suitable for offloading, data transfer ...
Read More
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Read More
Parallel strategies for 2D Discrete Wavelet Transform in shared memory systems and GPUs

In this work, we analyze the behavior of several parallel algorithms developed to compute the two-dimensional discrete wavelet transform using both OpenMP over a multicore platform and CUDA over a GPU. The proposed parallel algorithms are based on both ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
November 2023
2180 pages
ISBN:9798400707858
DOI:10.1145/3624062

Copyright © 2023 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Results Replicated / v1.1
Author Tags
Collective Operations
Dataflow
GPU
LLVM
Memory Transfers
OpenMP
Tasks
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 32
  Total Downloads
- Downloads (Last 12 months)32
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware Strategies

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble Execution

A performance study of general-purpose applications on graphics processors using CUDA

Parallel strategies for 2D Discrete Wavelet Transform in shared memory systems and GPUs