research-article

DGCL: an efficient communication library for distributed GNN training

Authors:

Fan YuAuthors Info & Claims

EuroSys '21: Proceedings of the Sixteenth European Conference on Computer Systems

Pages 130 - 144

https://doi.org/10.1145/3447786.3456233

Published: 21 April 2021 Publication History

Abstract

Graph neural networks (GNNs) have gained increasing popularity in many areas such as e-commerce, social networks and bio-informatics. Distributed GNN training is essential for handling large graphs and reducing the execution time. However, for distributed GNN training, a peer-to-peer communication strategy suffers from high communication overheads. Also, different GPUs require different remote vertex embeddings, which leads to an irregular communication pattern and renders existing communication planning solutions unsuitable. We propose the distributed graph communication library (DGCL) for efficient GNN training on multiple GPUs. At the heart of DGCL is a communication planning algorithm tailored for GNN training, which jointly considers fully utilizing fast links, fusing communication, avoiding contention and balancing loads on different links. DGCL can be easily adopted to extend existing single-GPU GNN systems to distributed training. We conducted extensive experiments on different datasets and network configurations to compare DGCL with alternative communication schemes. In our experiments, DGCL reduces the communication time of the peer-to-peer communication by 77.5% on average and the training time for an epoch by up to 47%.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 265--283.

Digital Library

[2]

Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2017. Groute: An asynchronous multi-GPU programming model for irregular computations. ACM SIGPLAN Notices 52, 8 (2017), 235--248.

Digital Library

[3]

Jehoshua Bruck and Ching-Tien Ho. 1993. Efficient global combine operations in multi-port message-passing systems. Parallel Processing Letters 3, 04 (1993), 335--346.

[4]

Rong Chen, Jiaxin Shi, Binyu Zang, and Haibing Guan. 2014. Bipartite-oriented distributed graph partitioning for big learning. In Proceedings of 5th Asia-Pacific Workshop on Systems. 1--7.

Digital Library

[5]

Yuh-Rong Chen, Sridhar Radhakrishnan, Sudarshan Dhall, and Suleyman Karabuk. 2013. On multi-stream multi-source multicast routing. Computer Networks 57, 15 (2013), 2916--2930.

Digital Library

[6]

Minsik Cho, Ulrich Finkler, and David Kung. 2019. BlueConnect: Novel Hierarchical All-Reduce on Multi-tired Network for Deep Learning. In Proceedings of the Conference on Systems and Machine Learning (SysML).

[7]

Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428 (2019).

[8]

Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in neural information processing systems. 1024--1034.

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[10]

Mikael Henaff, Joan Bruna, and Yann LeCun. 2015. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163 (2015).

[11]

Huawei. 2020. MindSpore. https://e.huawei.com/us/products/cloud-computing-dc/atlas/mindspore.

[12]

Zhihao Jia, Yongkee Kwon, Galen Shipman, Pat McCormick, Mattan Erez, and Alex Aiken. 2017. A distributed multi-gpu system for fast graph processing. Proceedings of the VLDB Endowment 11, 3 (2017), 297--310.

Digital Library

[13]

Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken. 2020. Improving the accuracy, scalability, and performance of graph neural networks with roc. Proceedings of Machine Learning and Systems (MLSys) (2020), 187--198.

[14]

George Karypis. 1997. METIS: Unstructured graph partitioning and sparse matrix ordering system. Technical report (1997).

[15]

George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing 20, 1 (1998), 359--392.

Digital Library

[16]

Min-Soo Kim, Kyuhyeon An, Himchan Park, Hyunseok Seo, and Jinwook Kim. 2016. Gts: A fast and scalable graph processing method based on streaming topology to gpus. In Proceedings of the 2016 International Conference on Management of Data. 447--461.

Digital Library

[17]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).

[18]

Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. 2010. Signed networks in social media. In Proceedings of the SIGCHI conference on human factors in computing systems. 1361--1370.

Digital Library

[19]

Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and Michael W Mahoney. 2009. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics 6, 1 (2009), 29--123.

[20]

Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015).

[21]

Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, and Yafei Dai. 2019. Neugraph: parallel deep neural network computation on large graphs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 443--458.

[22]

NVIDIA. 2020. DGX Systems. https://www.nvidia.com/en-sg/data-center/dgx-systems. [Online; accessed 8-Oct-2020].

[23]

NVIDIA. 2020. GPUDirect RDMA. https://docs.nvidia.com/cuda/gpudirect-rdma. [Online; accessed 8-Oct-2020].

[24]

NVIDIA. 2020. NVIDIA Collective communications library (NCCL). https://https://developer.nvidia.com/nccl. [Online; accessed 8-Oct-2020].

[25]

NVIDIA. 2020. NVLink and NVSwitch. https://www.nvidia.com/en-sg/data-center/nvlink. [Online; accessed 8-Oct-2020].

[26]

Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, and John D Owens. 2017. Multi-GPU graph analytics. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 479--490.

[27]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems. 8026--8037.

[28]

Pitch Patarasuk and Xin Yuan. 2007. Bandwidth efficient all-reduce operation on tree topologies. In 2007 IEEE International Parallel and Distributed Processing Symposium. IEEE, 1--8.

[29]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel and Distrib. Comput. 69, 2 (2009), 117--124.

Digital Library

[30]

Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. 472--488.

Digital Library

[31]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[32]

Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. 2016. Learning Multiagent Communication with Backpropagation. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 2244--2252. https://proceedings.neurips.cc/paper/2016/hash/55b1927fdafef39c48e5b73b5d61ea60-Abstract.html

[33]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).

[34]

Chu-Fu Wang, Chun-Teng Liang, and Rong-Hong Jan. 2002. Heuristic algorithms for packing of multiple-group multicasting. Computers & Operations Research 29, 7 (2002), 905--924.

Digital Library

[35]

Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, et al. 2019. Deep graph library: Towards efficient and scalable deep learning on graphs. arXiv preprint arXiv:1909.01315 (2019).

[36]

Wikipedia. 2020. InfiniBand. https://en.wikipedia.org/wiki/InfiniBand. [Online; accessed 8-Oct-2020].

[37]

Wikipedia. 2020. Steiner tree problem. https://en.wikipedia.org/wiki/Steiner_tree_problem. [Online; accessed 8-Oct-2020].

[38]

Yidi Wu, Kaihao Ma, Zhenkun Cai, Tatiana Jin, Boyang Li, Chenguang Zheng, James Cheng, and Fan Yu. 2021. Seastar: Vertex-Centric Programming for Graph Neural Networks. In Proceedings of the Fourteenth EuroSys Conference 2021, April 26-28, 2021. ACM.

Digital Library

[39]

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).

[40]

Hongxia Yang. 2019. Aligraph: A comprehensive graph neural network platform. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3165--3166.

Digital Library

[41]

Jaewon Yang and Jure Leskovec. 2015. Defining and evaluating network communities based on ground-truth. Knowledge and Information Systems 42, 1 (2015), 181--213.

Digital Library

[42]

Yu Zhang, Xiaofei Liao, Hai Jin, Bingsheng He, Haikun Liu, and Lin Gu. 2019. DiGraph: An efficient path-based iterative directed graph processing system on multiple GPUs. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 601--614.

Digital Library

[43]

Jianlong Zhong and Bingsheng He. 2013. Medusa: Simplified graph processing on GPUs. IEEE Transactions on Parallel and Distributed Systems 25, 6 (2013), 1543--1552.

Digital Library

Cited By

Ma KLiu RYan XCai ZSong XWang MLi YCheng J(2025)Adaptive Parallel Training for Graph Neural NetworksProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710883(29-42)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710883
Peng JLiu QChen ZShao YShen YChen LCao J(2025)From Sancus to Sancus $$^q$$: staleness and quantization-aware full-graph decentralized training in graph neural networksThe VLDB Journal10.1007/s00778-024-00897-234:2Online publication date: 31-Jan-2025
https://doi.org/10.1007/s00778-024-00897-2
Fang ZSun QWang QChen LGao Y(2025)Distributed Temporal Graph Neural Network Learning over Large-Scale Dynamic GraphsDatabase Systems for Advanced Applications10.1007/978-981-97-5779-4_4(51-66)Online publication date: 11-Jan-2025
https://doi.org/10.1007/978-981-97-5779-4_4
Show More Cited By

Recommendations

Adaptive Parallel Training for Graph Neural Networks
PPoPP '25: Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

There are several strategies to parallelize graph neural network (GNN) training over multiple GPUs. We observe that there is no consistent winner (i.e., with the shortest running time), and the optimal strategy depends on the graph dataset, GNN model, ...
Toward the analysis of graph neural networks
ICSE-NIER '22: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results

Graph Neural Networks (GNNs) have recently emerged as an effective framework for representing and analyzing graph-structured data. GNNs have been applied to many real-world problems such as knowledge graph analysis, social networks recommendation, and ...
HongTu: Scalable Full-Graph GNN Training on Multiple GPUs
PACMMOD

Full-graph training on graph neural networks (GNN) has emerged as a promising training method for its effectiveness. Full-graph training requires extensive memory and computation resources. To accelerate this training process, researchers have proposed ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '21: Proceedings of the Sixteenth European Conference on Computer Systems

April 2021

631 pages

ISBN:9781450383349

DOI:10.1145/3447786

General Chairs:
Antonio Barbalace
The University of Edinburgh
,
Pramod Bhatotia
Technical University of Munich
,
Program Chairs:
Lorenzo Alvisi
Cornell University
,
Cristian Cadar
Imperial College London

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 April 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

EuroSys '21

Sponsor:

SIGOPS

EuroSys '21: Sixteenth European Conference on Computer Systems

April 26 - 28, 2021

Online Event, United Kingdom

Acceptance Rates

EuroSys '21 Paper Acceptance Rate 38 of 181 submissions, 21%;

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

60
Total Citations
View Citations
1,475
Total Downloads

Downloads (Last 12 months)181
Downloads (Last 6 weeks)24

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ma KLiu RYan XCai ZSong XWang MLi YCheng J(2025)Adaptive Parallel Training for Graph Neural NetworksProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710883(29-42)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710883
Peng JLiu QChen ZShao YShen YChen LCao J(2025)From Sancus to Sancus $$^q$$: staleness and quantization-aware full-graph decentralized training in graph neural networksThe VLDB Journal10.1007/s00778-024-00897-234:2Online publication date: 31-Jan-2025
https://doi.org/10.1007/s00778-024-00897-2
Fang ZSun QWang QChen LGao Y(2025)Distributed Temporal Graph Neural Network Learning over Large-Scale Dynamic GraphsDatabase Systems for Advanced Applications10.1007/978-981-97-5779-4_4(51-66)Online publication date: 11-Jan-2025
https://doi.org/10.1007/978-981-97-5779-4_4
Ai XYuan HLing ZWang QZhang YFu ZChen CGu YYu G(2024)NeutronTP: Load-Balanced Distributed Full-Graph GNN Training with Tensor ParallelismProceedings of the VLDB Endowment10.14778/3705829.370583718:2(173-186)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.14778/3705829.3705837
Shen YChen LFang JZhang XGao SYin H(2024)Efficient Training of Graph Neural Networks on Large GraphsProceedings of the VLDB Endowment10.14778/3685800.368584417:12(4237-4240)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685844
Sheng ZZhang WTao YCui B(2024)OUTRE: An OUT-of-Core De-REdundancy GNN Training Framework for Massive Graphs within A Single MachineProceedings of the VLDB Endowment10.14778/3681954.368197617:11(2960-2973)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681976
Wang KXu YLuo S(2024)TIGER: Training Inductive Graph Neural Network for Large-Scale Knowledge Graph ReasoningProceedings of the VLDB Endowment10.14778/3675034.367503917:10(2459-2472)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.14778/3675034.3675039
Huang KJiang HWang MXiao GWipf DSong XGan QHuang ZZhai JZhang Z(2024)FreshGNN: Reducing Memory Access via Stable Historical Embeddings for Graph Neural Network TrainingProceedings of the VLDB Endowment10.14778/3648160.364818417:6(1473-1486)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648184
Yuan HLiu YZhang YAi XWang QChen CGu YYu G(2024)Comprehensive Evaluation of GNN Training Systems: A Data Management PerspectiveProceedings of the VLDB Endowment10.14778/3648160.364816717:6(1241-1254)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648167
Procaccini MSahebi AGiorgi R(2024)A survey of graph convolutional networks (GCNs) in FPGA-based acceleratorsJournal of Big Data10.1186/s40537-024-01022-411:1Online publication date: 11-Nov-2024
https://doi.org/10.1186/s40537-024-01022-4
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten