Strided DMA for Multidimensional Array Copy and Transpose

Glines, Mark; Pirgov, Peter; Mullin, Lenore; Khan, Rishi

doi:10.1007/978-3-031-10461-9_26

Mark Glines¹⁰,
Peter Pirgov¹⁰,
Lenore Mullin¹¹ &
…
Rishi Khan¹⁰

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 506))

Included in the following conference series:

Science and Information Conference

760 Accesses

Abstract

Many applications require moving subsets of multidimensional arrays across memory hierarchies of a computing system (MPI ranks, DRAM, GPU, etc.). While hardware supports efficient offload of contiguous data movement, non-contiguous data requires significantly more CPU orchestration. We test a series of multidimensional array copy and transpose microbenchmarks on two platforms: NERSC Perlmutter, and ORNL Summit, and find that for some scenarios, bandwidth is impacted up to 8-fold. We emulate a multidimensional array direct memory access (DMA) copy and transpose engine using a GPU kernel. This DMA can more effectively prefetch and write-combine non-contiguous multidimensional array data, reducing latency and improving bandwidth. We propose a reconfigurable DMA engine that supports multiple strides and discuss how it can offload multidimensional array copy and transpose. Further, this DMA engine can use the stride information to better inform policies of higher level memory hierarchies to maximize bandwidth.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

CUDA Samples: CUDA toolkit documentation. https://docs.nvidia.com/cuda/cuda-samples/index.html. Accessed 22 Nov 2021
MVAPICH: Benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/. Accessed 22 Nov 2021
Bailey, D.H.: FFTs in external or hierarchical memory. In: Supercomputing 1989: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pp. 234–242 (1989)
Google Scholar
Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297–301 (1965)
Article MathSciNet Google Scholar
Cormen, T.H., Nicol, D.M.: Performing Out-of-Core FFTs on Parallel Disk Systems. Technical report, USA (1997)
Google Scholar
Di Girolamo, S., et al.: Network-accelerated non-contiguous memory transfers. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019. Association for Computing Machinery, New York (2019)
Google Scholar
Gropp, W., Hoefler, T., Thakur, R., Träff, J.L.: Performance expectations and guidelines for MPI derived datatypes. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 150–159. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24449-0_18
Chapter Google Scholar
Gu, L., Siegel, J., Li, X.: Using GPUs to compute large out-of-card FFTs. In: Proceedings of the International Conference on Supercomputing, ICS 2011, pp. 255–264. Association for Computing Machinery, New York (2011)
Google Scholar
Hoefler, T., Di Girolamo, S., Taranov, K., Grant, R.E., Brightwell, R.: SPIN: high-performance streaming processing in the network. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017. Association for Computing Machinery, New York (2017)
Google Scholar
Hunt, H., Mullin, L., Rosenkrantz, D., Raynolds, J.: A Transformation–Based Approach for the Design of Parallel/Distributed Scientific Software: the FFT. CoRR, abs/0811.2535 (2008)
Google Scholar
Nieplocha, J., Tipparaju, V., Krishnan, M., Panda, D.K.: High performance remote memory access communication: the ARMCI approach. Int. J. High Perform. Comput. Appl. 20(2), 233–253 (2006)
Article Google Scholar
Pirgov, P., Mullin, L., Khan: Out-of-GPU FFT: a case study in GPU prefetching. In: 2021 Ninth International Symposium on Computing and Networking Workshops (CANDARW) (2021)
Google Scholar
Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6853, pp. 90–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23397-5_10
Chapter Google Scholar
Suresh, K.K., et al.: Performance characterization of network mechanisms for non-contiguous data transfers in MPI. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 896–905 (2020)
Google Scholar
Tinetti, F.G.: Parallel programming: techniques and applications using networked workstations and parallel computers. J. Comput. Sci. Technol. (2000)
Google Scholar
Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. Society for Industrial and Applied Mathematics, USA (1992)
Google Scholar

Download references

Acknowledgments

We thank the anonymous reviewers for their helpful suggestions. We thank Robert Searles and Mathew Colgrove from NVIDIA for help troubleshooting and debugging CUDA issues on Cori and Summit supercomputers. This work is supported by the Department of Energy under Grant No. DE-SC0021516. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.

Author information

Authors and Affiliations

Extreme Scale Solutions, Newark, DE, 19702, USA
Mark Glines, Peter Pirgov & Rishi Khan
MoA: Provably Optimal Tensors, Arlington, VA, 22206, USA
Lenore Mullin

Authors

Mark Glines
View author publications
You can also search for this author in PubMed Google Scholar
Peter Pirgov
View author publications
You can also search for this author in PubMed Google Scholar
Lenore Mullin
View author publications
You can also search for this author in PubMed Google Scholar
Rishi Khan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rishi Khan .

Editor information

Editors and Affiliations

Saga University, Saga, Japan
Kohei Arai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Glines, M., Pirgov, P., Mullin, L., Khan, R. (2022). Strided DMA for Multidimensional Array Copy and Transpose. In: Arai, K. (eds) Intelligent Computing. SAI 2022. Lecture Notes in Networks and Systems, vol 506. Springer, Cham. https://doi.org/10.1007/978-3-031-10461-9_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-10461-9_26
Published: 07 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10460-2
Online ISBN: 978-3-031-10461-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Strided DMA for Multidimensional Array Copy and Transpose