Skip to main content

Strided DMA for Multidimensional Array Copy and Transpose

  • Conference paper
  • First Online:
Intelligent Computing (SAI 2022)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 506))

Included in the following conference series:

  • 760 Accesses

Abstract

Many applications require moving subsets of multidimensional arrays across memory hierarchies of a computing system (MPI ranks, DRAM, GPU, etc.). While hardware supports efficient offload of contiguous data movement, non-contiguous data requires significantly more CPU orchestration. We test a series of multidimensional array copy and transpose microbenchmarks on two platforms: NERSC Perlmutter, and ORNL Summit, and find that for some scenarios, bandwidth is impacted up to 8-fold. We emulate a multidimensional array direct memory access (DMA) copy and transpose engine using a GPU kernel. This DMA can more effectively prefetch and write-combine non-contiguous multidimensional array data, reducing latency and improving bandwidth. We propose a reconfigurable DMA engine that supports multiple strides and discuss how it can offload multidimensional array copy and transpose. Further, this DMA engine can use the stride information to better inform policies of higher level memory hierarchies to maximize bandwidth.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. CUDA Samples: CUDA toolkit documentation. https://docs.nvidia.com/cuda/cuda-samples/index.html. Accessed 22 Nov 2021

  2. MVAPICH: Benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/. Accessed 22 Nov 2021

  3. Bailey, D.H.: FFTs in external or hierarchical memory. In: Supercomputing 1989: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pp. 234–242 (1989)

    Google Scholar 

  4. Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297–301 (1965)

    Article  MathSciNet  Google Scholar 

  5. Cormen, T.H., Nicol, D.M.: Performing Out-of-Core FFTs on Parallel Disk Systems. Technical report, USA (1997)

    Google Scholar 

  6. Di Girolamo, S., et al.: Network-accelerated non-contiguous memory transfers. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019. Association for Computing Machinery, New York (2019)

    Google Scholar 

  7. Gropp, W., Hoefler, T., Thakur, R., Träff, J.L.: Performance expectations and guidelines for MPI derived datatypes. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 150–159. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24449-0_18

    Chapter  Google Scholar 

  8. Gu, L., Siegel, J., Li, X.: Using GPUs to compute large out-of-card FFTs. In: Proceedings of the International Conference on Supercomputing, ICS 2011, pp. 255–264. Association for Computing Machinery, New York (2011)

    Google Scholar 

  9. Hoefler, T., Di Girolamo, S., Taranov, K., Grant, R.E., Brightwell, R.: SPIN: high-performance streaming processing in the network. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017. Association for Computing Machinery, New York (2017)

    Google Scholar 

  10. Hunt, H., Mullin, L., Rosenkrantz, D., Raynolds, J.: A Transformation–Based Approach for the Design of Parallel/Distributed Scientific Software: the FFT. CoRR, abs/0811.2535 (2008)

    Google Scholar 

  11. Nieplocha, J., Tipparaju, V., Krishnan, M., Panda, D.K.: High performance remote memory access communication: the ARMCI approach. Int. J. High Perform. Comput. Appl. 20(2), 233–253 (2006)

    Article  Google Scholar 

  12. Pirgov, P., Mullin, L., Khan: Out-of-GPU FFT: a case study in GPU prefetching. In: 2021 Ninth International Symposium on Computing and Networking Workshops (CANDARW) (2021)

    Google Scholar 

  13. Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6853, pp. 90–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23397-5_10

    Chapter  Google Scholar 

  14. Suresh, K.K., et al.: Performance characterization of network mechanisms for non-contiguous data transfers in MPI. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 896–905 (2020)

    Google Scholar 

  15. Tinetti, F.G.: Parallel programming: techniques and applications using networked workstations and parallel computers. J. Comput. Sci. Technol. (2000)

    Google Scholar 

  16. Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. Society for Industrial and Applied Mathematics, USA (1992)

    Google Scholar 

Download references

Acknowledgments

We thank the anonymous reviewers for their helpful suggestions. We thank Robert Searles and Mathew Colgrove from NVIDIA for help troubleshooting and debugging CUDA issues on Cori and Summit supercomputers. This work is supported by the Department of Energy under Grant No. DE-SC0021516. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rishi Khan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Glines, M., Pirgov, P., Mullin, L., Khan, R. (2022). Strided DMA for Multidimensional Array Copy and Transpose. In: Arai, K. (eds) Intelligent Computing. SAI 2022. Lecture Notes in Networks and Systems, vol 506. Springer, Cham. https://doi.org/10.1007/978-3-031-10461-9_26

Download citation

Publish with us

Policies and ethics