skip to main content
10.1145/2807591.2807611acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Memory access patterns: the missing piece of the multi-GPU puzzle

Published: 15 November 2015 Publication History

Abstract

With the increased popularity of multi-GPU nodes in modern HPC clusters, it is imperative to develop matching programming paradigms for their efficient utilization. In order to take advantage of the local GPUs and the low-latency high-throughput interconnects that link them, programmers need to meticulously adapt parallel applications with respect to load balancing, boundary conditions and device synchronization. This paper presents MAPS-Multi, an automatic multi-GPU partitioning framework that distributes the workload based on the underlying memory access patterns. The framework consists of host- and device-level APIs that allow programs to efficiently run on a variety of GPU and multi-GPU architectures. The framework implements several layers of code optimization, device abstraction, and automatic inference of inter-GPU memory exchanges. The paper demonstrates that the performance of MAPS-Multi achieves near-linear scaling on fundamental computational operations, as well as real-world applications in deep learning and multivariate analysis.

References

[1]
M. Ament, G. Knittel, D. Weiskopf, and W. Straßer. A parallel preconditioned conjugate gradient solver for the Poisson problem on a multi-GPU platform. In Parallel, Distributed and Network-Based Processing (PDP), 2010 18th Euromicro International Conference on, pages 583--592, 2010.
[2]
K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.
[3]
J. P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences, 101(12):4164--4169, 2004.
[4]
B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl., 21(3):291--312, 2007.
[5]
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA '05, pages 519--538, 2005.
[6]
A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning, pages 1337--1345, 2013.
[7]
R. Collobert, S. Bengio, and J. Mariéthoz. Torch: a modular machine learning software library. Technical report, IDIAP, 2002.
[8]
CUB Library Documentation, 2015. http://nvlabs.github.io/cub/.
[9]
CUBLAS Library Documentation, 2015. http://docs.nvidia.com/cuda/cublas/.
[10]
NVIDIA cuDNN Deep Learning Library, 2015. http://developer.nvidia.com/cuDNN.
[11]
CUFFT Library Documentation, 2015. http://docs.nvidia.com/cuda/cufft/.
[12]
M. De Wael, S. Marr, B. De Fraine, T. Van Cutsem, and W. De Meuter. Partitioned global address space languages. ACM Comput. Surv., 47(4):62:1--62:27, 2015.
[13]
J. Enmyren and C. W. Kessler. SkePU: A multi-backend skeleton programming library for multi-GPU systems. In Proceedings of the Fourth International Workshop on High-Level Parallel Programming and Applications, HLPP '10, pages 5--14, 2010.
[14]
M. Gardner. Mathematical games: The fantastic combinations of John Conway's new solitaire game "life". Scientific American, 223(4):120--123, 1970.
[15]
High Performance Fortran language specification. SIGPLAN Fortran Forum, 12(4):1--86, 1993.
[16]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675--678, 2014.
[17]
J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 277--288, 2011.
[18]
A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.
[19]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278--2324, 1998.
[20]
Y. LeCun and C. Cortes. The MNIST database of handwritten digits, 1998. http://yann.lecun.com/exdb/mnist.
[21]
MAPS Framework Documentation, 2015. http://maps-gpu.github.io/.
[22]
E. Mejía-Roa, D. Tabas-Madrid, J. Setoain, C. García, F. Tirado, and A. Pascual-Montano. NMF-mGPU: non-negative matrix factorization on multi-GPU systems. BMC Bioinformatics, 16(1):43, 2015.
[23]
T. Ramashekar and U. Bondhugula. Automatic data allocation and buffer management for multi-GPU machines. ACM Trans. Archit. Code Optim., 10(4):60:1--60:26, 2013.
[24]
E. Rubin, E. Levy, A. Barak, and T. Ben-Nun. MAPS: Optimizing massively parallel applications using device-level memory abstraction. ACM Trans. Archit. Code Optim., 11(4):44:1--44:22, 2014.
[25]
E. Rustico, G. Bilotta, A. Herault, C. Del Negro, and G. Gallo. Advances in multi-GPU smoothed particle hydrodynamics simulations. IEEE Trans. Parallel Distrib. Syst., 25(1):43--52, 2014.
[26]
M. L. Sætra and A. R. Brodtkorb. Shallow water simulations on multiple GPUs. In Applied Parallel and Scientific Computing, volume 7134 of Lecture Notes in Computer Science, pages 56--66. Springer, 2012.
[27]
S. Schaetz and M. Uecker. A multi-GPU programming library for real-time applications. In Proceedings of the 12th International Conference on Algorithms and Architectures for Parallel Processing - Part I, ICA3PP'12, pages 114--128. Springer-Verlag, 2012.
[28]
L. Snyder. Programming Guide to ZPL. MIT Press, Cambridge, MA, 1999.
[29]
M. Steuwer, P. Kegel, and S. Gorlatch. Towards high-level programming of multi-GPU systems using the SkelCL library. In Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International, pages 1858--1865, 2012.
[30]
J. C. Thibault and I. Senocak. CUDA implementation of a Navier-Stokes solver on multi-GPU desktop platforms for incompressible flows. In Proceedings of the 47th AIAA Aerospace Sciences Meeting, 2009.
[31]
UPC Consortium. UPC Language and Library Specifications, v1.3. Technical report, Lawrence Berkeley National Lab, 2013.
[32]
C. G. Xanthis, I. E. Venetis, and A. H. Aletras. High performance MRI simulations of motion on multi-GPU systems. Journal of Cardiovascular Magnetic Resonance, 16(1):48, 2014.

Cited By

View all
  • (2024)CUDASTF: Bridging the Gap Between CUDA and Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00049(1-17)Online publication date: 17-Nov-2024
  • (2023)SEECHIP: A Scalable and Energy-Efficient Chiplet-based GPU Architecture Using Photonic LinksProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605626(566-575)Online publication date: 7-Aug-2023
  • (2022)Optimizing Aggregate Computation of Graph Neural Networks with on-GPU Interpreter-Style ProgrammingProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569690(83-95)Online publication date: 8-Oct-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
  • General Chair:
  • Jackie Kern,
  • Program Chair:
  • Jeffrey S. Vetter
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. memory access patterns
  2. multi-GPU programming

Qualifiers

  • Research-article

Funding Sources

Conference

SC15
Sponsor:

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)67
  • Downloads (Last 6 weeks)2
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)CUDASTF: Bridging the Gap Between CUDA and Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00049(1-17)Online publication date: 17-Nov-2024
  • (2023)SEECHIP: A Scalable and Energy-Efficient Chiplet-based GPU Architecture Using Photonic LinksProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605626(566-575)Online publication date: 7-Aug-2023
  • (2022)Optimizing Aggregate Computation of Graph Neural Networks with on-GPU Interpreter-Style ProgrammingProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569690(83-95)Online publication date: 8-Oct-2022
  • (2022)Lightning: Scaling the GPU Programming Model Beyond a Single GPU2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00054(492-503)Online publication date: May-2022
  • (2022)Scalable and accurate multi-GPU-based image reconstruction of large-scale ptychography dataScientific Reports10.1038/s41598-022-09430-312:1Online publication date: 29-Mar-2022
  • (2021)Topology-aware optimizations for multi-GPU ptychographic image reconstructionProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460380(354-366)Online publication date: 3-Jun-2021
  • (2021)Comparing LLC-Memory Traffic between CPU and GPU Architectures2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)10.1109/RSDHA54838.2021.00007(8-16)Online publication date: Nov-2021
  • (2021)LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs2021 IEEE International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS51552.2021.9605411(1-8)Online publication date: Oct-2021
  • (2021)Silicon Photonic Flex-LIONS for Reconfigurable Multi-GPU SystemsJournal of Lightwave Technology10.1109/JLT.2021.305271339:4(1212-1220)Online publication date: 15-Feb-2021
  • (2021)Efficient multi-GPU shared memory via automatic optimization of fine-grained transfersProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00020(139-152)Online publication date: 14-Jun-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media