research-article

Open Access

Exploring Page-based RDMA for Irregular GPU Workloads. A case study on NVMe-backed GNN Execution

Authors:
Benjamin Wagley

Colorado School of Mines, United States

Colorado School of Mines, United States

0009-0008-6376-2420
View Profile

,
Pak Markthub

NVIDIA, Japan

NVIDIA, Japan

0000-0002-0000-7291
View Profile

,
James Crea

Colorado School of Mines, United States

Colorado School of Mines, United States

0009-0002-3646-5531
View Profile

,
Bo Wu

Colorado School of Mines, United States

Colorado School of Mines, United States

0009-0001-1696-4272
View Profile

,
Mehmet Esat Belviranli

Colorado School of Mines, United States

Colorado School of Mines, United States

0000-0001-9434-9833
View Profile

GPGPU '24: Proceedings of the 16th Workshop on General Purpose Processing Using GPUMarch 2024Pages 7–12https://doi.org/10.1145/3649411.3649413

Published:28 April 2024Publication History

GPGPU '24: Proceedings of the 16th Workshop on General Purpose Processing Using GPU

Pages 7–12

ABSTRACT

Paged memory systems for GPUs like NVIDIA’s Unified Virtual Memory, offer a simple method for programmers to create out-of-core programs on GPUs. In the case of storage backed approaches, these systems can even handle larger than host memory systems as NVMe is used to back GPU memory through RDMA. However, paged memory systems can struggle with irregular access patterns. In this work, we analyze the limitations of paged, RDMA-backed GPU memory for out-of-core, irregular workloads, through a case study of GNN training. We highlight the key limitations of these systems that must be overcome before the true potential of RDMA backed GPU memory can be realized in a paged memory architecture.

References

Tyler Allen and Rong Ge. 2021. In-Depth Analyses of Unified Virtual Memory System for GPU Accelerated Computing. In SC21: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14. https://doi.org/10.1145/3458817.3480855Google ScholarDigital Library
Jack Choquette. 2023. NVIDIA Hopper H100 GPU: Scaling Performance. IEEE Micro 43, 3 (2023), 9–17. https://doi.org/10.1109/MM.2023.3256796Google ScholarDigital Library
David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. 2015. Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems 28 (2015).Google Scholar
Thomas Gaudelet, Ben Day, Arian R Jamasb, Jyothish Soman, Cristian Regep, Gertrude Liu, Jeremy BR Hayter, Richard Vickers, Charles Roberts, Jian Tang, 2021. Utilizing graph machine learning within drug discovery and development. Briefings in bioinformatics 22, 6 (2021), bbab159.Google Scholar
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems 30 (2017).Google Scholar
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems 33 (2020), 22118–22133.Google Scholar
Youjie Li, Amar Phanishayee, Derek Murray, Jakub Tarnawski, and Nam Sung Kim. [n. d.]. Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers. ([n. d.]).Google Scholar
Pak Markthub. 2019. Improving GPU-NVMe Data Transfer in Unified Virtual Memory Space. Technical Report.Google Scholar
Pak Markthub, Mehmet E Belviranli, Seyong Lee, Jeffrey S Vetter, and Satoshi Matsuoka. 2018. DRAGON: breaking GPU memory capacity limits with direct NVM access. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 414–426.Google ScholarDigital Library
Seung Won Min, Kun Wu, Sitao Huang, Mert Hidayetoğlu, Jinjun Xiong, Eiman Ebrahimi, Deming Chen, and Wen-mei Hwu. 2021. Large graph convolutional network training with gpu-oriented data communication architecture. arXiv preprint arXiv:2103.03330 (2021).Google Scholar
Jeongmin Brian Park, Vikram Sharma Mailthody, Zaid Qureshi, and Wen-mei Hwu. 2023. Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses. arXiv preprint arXiv:2306.16384 (2023).Google Scholar
Benjamin Sanchez-Lengeling, Emily Reif, Adam Pearce, and Alexander B. Wiltschko. 2021. A Gentle Introduction to Graph Neural Networks. Distill (2021). https://doi.org/10.23915/distill.00033 https://distill.pub/2021/gnn-intro.Google ScholarCross Ref
Yingxia Shao, Hongzheng Li, Xizhi Gu, Hongbo Yin, Yawen Li, Xupeng Miao, Wentao Zhang, Bin Cui, and Lei Chen. 2022. Distributed Graph Neural Network Training: A Survey. arXiv preprint arXiv:2211.00216 (2022).Google Scholar
Roger Waleffe, Jason Mohoney, Theodoros Rekatsinas, and Shivaram Venkataraman. 2023. MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks. In Eighteenth European Conference on Computer Systems (EuroSys’ 23).Google Scholar
Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 974–983.Google ScholarDigital Library
Dalong Zhang, Xin Huang, Ziqi Liu, Zhiyang Hu, Xianzheng Song, Zhibang Ge, Zhiqiang Zhang, Lin Wang, Jun Zhou, Yang Shuang, 2020. Agl: a scalable system for industrial-purpose graph machine learning. arXiv preprint arXiv:2003.02454 (2020).Google Scholar

Index Terms

Exploring Page-based RDMA for Irregular GPU Workloads. A case study on NVMe-backed GNN Execution

Recommendations

Accelerating Performance of GPU-based Workloads Using CXL
FlexScience '23: Proceedings of the 13th Workshop on AI and Scientific Computing at Scale using Flexible Computing

High-performance computing (HPC) workloads such as scientific simulations and deep learning (DL) running across multi-GPU systems are memory and data-intensive, relying on the main memory to complement its limited onboard high-bandwidth memory (HBM). To ...
Read More
Exploring Efficient Architectures on Remote In-Memory NVM over RDMA
Special Issue ESWEEK 2021, CASES 2021, CODES+ISSS 2021 and EMSOFT 2021
Efficiently accessing remote file data remains a challenging problem for data processing systems. Development of technologies in non-volatile dual in-line memory modules (NVDIMMs), in-memory file systems, and RDMA networks provide new opportunities ...
Read More
ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

In this paper, we propose ScaleStore, a novel distributed storage engine that exploits DRAM caching, NVMe storage, and RDMA networking to achieve high performance, cost-efficiency, and scalability at the same time. Using low latency RDMA messages, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

GPGPU '24: Proceedings of the 16th Workshop on General Purpose Processing Using GPU
March 2024
37 pages
ISBN:9798400718175
DOI:10.1145/3649411

Copyright © 2024 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 April 2024
Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate57of129submissions,44%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 28
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)28
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Exploring Page-based RDMA for Irregular GPU Workloads. A case study on NVMe-backed GNN Execution

GPGPU '24: Proceedings of the 16th Workshop on General Purpose Processing Using GPU

ABSTRACT

References

Cited By

Index Terms

Recommendations

Accelerating Performance of GPU-based Workloads Using CXL

Exploring Efficient Architectures on Remote In-Memory NVM over RDMA

ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Exploring Page-based RDMA for Irregular GPU Workloads. A case study on NVMe-backed GNN Execution

GPGPU '24: Proceedings of the 16th Workshop on General Purpose Processing Using GPU

ABSTRACT

References

Cited By

Index Terms

Recommendations

Accelerating Performance of GPU-based Workloads Using CXL

Exploring Efficient Architectures on Remote In-Memory NVM over RDMA

ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media