Abstract
Many applications require moving subsets of multidimensional arrays across memory hierarchies of a computing system (MPI ranks, DRAM, GPU, etc.). While hardware supports efficient offload of contiguous data movement, non-contiguous data requires significantly more CPU orchestration. We test a series of multidimensional array copy and transpose microbenchmarks on two platforms: NERSC Perlmutter, and ORNL Summit, and find that for some scenarios, bandwidth is impacted up to 8-fold. We emulate a multidimensional array direct memory access (DMA) copy and transpose engine using a GPU kernel. This DMA can more effectively prefetch and write-combine non-contiguous multidimensional array data, reducing latency and improving bandwidth. We propose a reconfigurable DMA engine that supports multiple strides and discuss how it can offload multidimensional array copy and transpose. Further, this DMA engine can use the stride information to better inform policies of higher level memory hierarchies to maximize bandwidth.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
CUDA Samples: CUDA toolkit documentation. https://docs.nvidia.com/cuda/cuda-samples/index.html. Accessed 22 Nov 2021
MVAPICH: Benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/. Accessed 22 Nov 2021
Bailey, D.H.: FFTs in external or hierarchical memory. In: Supercomputing 1989: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pp. 234–242 (1989)
Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297–301 (1965)
Cormen, T.H., Nicol, D.M.: Performing Out-of-Core FFTs on Parallel Disk Systems. Technical report, USA (1997)
Di Girolamo, S., et al.: Network-accelerated non-contiguous memory transfers. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019. Association for Computing Machinery, New York (2019)
Gropp, W., Hoefler, T., Thakur, R., Träff, J.L.: Performance expectations and guidelines for MPI derived datatypes. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 150–159. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24449-0_18
Gu, L., Siegel, J., Li, X.: Using GPUs to compute large out-of-card FFTs. In: Proceedings of the International Conference on Supercomputing, ICS 2011, pp. 255–264. Association for Computing Machinery, New York (2011)
Hoefler, T., Di Girolamo, S., Taranov, K., Grant, R.E., Brightwell, R.: SPIN: high-performance streaming processing in the network. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017. Association for Computing Machinery, New York (2017)
Hunt, H., Mullin, L., Rosenkrantz, D., Raynolds, J.: A Transformation–Based Approach for the Design of Parallel/Distributed Scientific Software: the FFT. CoRR, abs/0811.2535 (2008)
Nieplocha, J., Tipparaju, V., Krishnan, M., Panda, D.K.: High performance remote memory access communication: the ARMCI approach. Int. J. High Perform. Comput. Appl. 20(2), 233–253 (2006)
Pirgov, P., Mullin, L., Khan: Out-of-GPU FFT: a case study in GPU prefetching. In: 2021 Ninth International Symposium on Computing and Networking Workshops (CANDARW) (2021)
Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6853, pp. 90–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23397-5_10
Suresh, K.K., et al.: Performance characterization of network mechanisms for non-contiguous data transfers in MPI. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 896–905 (2020)
Tinetti, F.G.: Parallel programming: techniques and applications using networked workstations and parallel computers. J. Comput. Sci. Technol. (2000)
Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. Society for Industrial and Applied Mathematics, USA (1992)
Acknowledgments
We thank the anonymous reviewers for their helpful suggestions. We thank Robert Searles and Mathew Colgrove from NVIDIA for help troubleshooting and debugging CUDA issues on Cori and Summit supercomputers. This work is supported by the Department of Energy under Grant No. DE-SC0021516. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Glines, M., Pirgov, P., Mullin, L., Khan, R. (2022). Strided DMA for Multidimensional Array Copy and Transpose. In: Arai, K. (eds) Intelligent Computing. SAI 2022. Lecture Notes in Networks and Systems, vol 506. Springer, Cham. https://doi.org/10.1007/978-3-031-10461-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-10461-9_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10460-2
Online ISBN: 978-3-031-10461-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)