skip to main content
10.1145/3514221.3526134acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

NeutronStar: Distributed GNN Training with Hybrid Dependency Management

Authors Info & Claims
Published:11 June 2022Publication History

ABSTRACT

GNN's training needs to resolve issues of vertex dependencies, i.e., each vertex representation's update depends on its neighbors. Existing distributed GNN systems adopt either a dependencies-cached approach or a dependencies-communicated approach. Having made intensive experiments and analysis, we find that a decision to choose one or the other approach for the best performance is determined by a set of factors, including graph inputs, model configurations, and an underlying computing cluster environment. If various GNN trainings are supported solely by one approach, the performance results are often suboptimal. We study related factors for each GNN training before its execution to choose the best-fit approach accordingly. We propose a hybrid dependency-handling approach that adaptively takes the merits of the two approaches at runtime. Based on the hybrid approach, we further develop a distributed GNN training system called NeutronStar, which makes high performance GNN trainings in an automatic way. NeutronStar is also empowered by effective optimizations in CPU-GPU computation and data processing. Our experimental results on 16-node Aliyun cluster demonstrate that NeutronStar achieves 1.81X-14.25X speedup over existing GNN systems including DistDGL and ROC.

References

  1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2--4, 2016. 265--283.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Lars Backstrom, Daniel P. Huttenlocher, Jon M. Kleinberg, and Xiangyang Lan. 2006. Group formation in large social networks: membership, growth, and evolution. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20--23, 2006. 44--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Zhenkun Cai, Xiao Yan, Yidi Wu, Kaihao Ma, James Cheng, and Fan Yu. 2021. DGCL: an efficient communication library for distributed GNN training. In EuroSys '21: Sixteenth European Conference on Computer Systems, Online Event, United Kingdom, April 26--28, 2021, Antonio Barbalace, Pramod Bhatotia, Lorenzo Alvisi, and Cristian Cadar (Eds.). ACM, 130--144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. 2015. PowerLyra: differentiated graph computation and partitioning on skewed graphs. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys 2015, Bordeaux, France, April 21--24, 2015. 1:1--1:15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5--10, 2016, Barcelona, Spain, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 3837--3845.Google ScholarGoogle Scholar
  6. DGL 2020. Deep Graph Library:towards efficient and scalable deep learning on graphs. https://www.dgl.ai/.Google ScholarGoogle Scholar
  7. Euler 2019. Euler. https://github.com/alibaba/euler/wiki/System-Introduction.Google ScholarGoogle Scholar
  8. Wenfei Fan, Jingbo Xu, Yinghui Wu, Wenyuan Yu, Jiaxin Jiang, Zeyu Zheng, Bohan Zhang, Yang Cao, and Chao Tian. 2017. Parallelizing Sequential Graph Computations. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017. 495--510.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Matthias Fey and Jan Eric Lenssen. 2019. Fast Graph Representation Learning with PyTorch Geometric. CoRR abs/1903.02428 (2019). arXiv:1903.02428 http://arxiv.org/abs/1903.02428Google ScholarGoogle Scholar
  10. Swapnil Gandhi and Anand Padmanabha Iyer. 2021. P3: Distributed Deep Graph Learning at Scale. In 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021, July 14--16, 2021. 551--568.Google ScholarGoogle Scholar
  11. William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA. 1024--1034.Google ScholarGoogle Scholar
  12. Lin Hu, Lei Zou, and Yu Liu. 2021. Accelerating Triangle Counting on GPU. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021. 736--748.Google ScholarGoogle Scholar
  13. Zhihao Jia, Yongkee Kwon, Galen M. Shipman, Patrick S. McCormick, Mattan Erez, and Alex Aiken. 2017. A Distributed Multi-GPU System for Fast Graph Processing. Proc. VLDB Endow. 11, 3 (2017), 297--310. https://doi.org/10.14778/3157794.3157799Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken. 2020. Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2--4, 2020.Google ScholarGoogle Scholar
  15. George Karypis and Vipin Kumar. 1996. Parallel Multilevel Graph Partitioning. In Proceedings of IPPS '96, The 10th International Parallel Processing Symposium, April 15--19, 1996, Honolulu, Hawaii, USA. 314--319.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. George Karypis and Vipin Kumar. 1998. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput. 20, 1 (1998), 359--392. https://doi.org/10.1137/S1064827595287997Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings.Google ScholarGoogle Scholar
  18. Jérôme Kunegis. 2013. KONECT: the Koblenz network collection. In 22nd International World Wide Web Conference, WWW '13, Rio de Janeiro, Brazil, May 13--17, 2013, Companion Volume. 1343--1350.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue B. Moon. 2010. What is Twitter, a social network or a news media?. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26--30, 2010. 591--600.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Chen Lei. 2021. Deep Learning and Practice with MindSpore. Springer. https://doi.org/10.1007/978--981--16--2233--5Google ScholarGoogle Scholar
  21. Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. 2009. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Math. 6, 1 (2009), 29--123.Google ScholarGoogle ScholarCross RefCross Ref
  22. Houyi Li, Yongchao Liu, Yongyong Li, Bin Huang, Peng Zhang, Guowei Zhang, Xintan Zeng, Kefeng Deng, Wenguang Chen, and Changhua He. 2021. Graph- Theta: A Distributed Graph Neural Network Learning System With Flexible Training Strategy. CoRR abs/2104.10569 (2021).Google ScholarGoogle Scholar
  23. Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proc. VLDB Endow. 13, 12 (2020), 3005--3018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. 2016. Gated Graph Sequence Neural Networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2--4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).Google ScholarGoogle Scholar
  25. Husong Liu, Shengliang Lu, Xinyu Chen, and Bingsheng He. 2020. G3: When Graph Neural Networks Meet Parallel Graph Processing Systems on GPUs. Proc. VLDB Endow. 13, 12 (2020), 2813--2816.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Qi Liu, Maximilian Nickel, and Douwe Kiela. 2019. Hyperbolic Graph Neural Networks. CoRR abs/1910.12892 (2019).Google ScholarGoogle Scholar
  27. Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, and Yafei Dai. 2019. NeuGraph: Parallel Deep Neural Network Computation on Large Graphs. In 2019 USENIX Annual Technical Conference, USENIX ATC 2019, Renton, WA, USA, July 10--12, 2019. 443--458.Google ScholarGoogle Scholar
  28. Diego Marcheggiani and Ivan Titov. 2017. Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9--11, 2017, Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, 1506--1515.Google ScholarGoogle ScholarCross RefCross Ref
  29. Vasimuddin Md, Sanchit Misra, Guixiang Ma, Ramanarayan Mohanty, Evangelos Georganas, Alexander Heinecke, Dhiraj D. Kalamkar, Nesreen K. Ahmed, and Sasikanth Avancha. 2021. DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks. CoRR abs/2104.06700 (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Seungwon Min, Kun Wu, Sitao Huang, Mert Hidayetoglu, Jinjun Xiong, Eiman Ebrahimi, Deming Chen, and Wen-mei W. Hwu. 2021. Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture. Proc. VLDB Endow. 14, 11 (2021), 2087--2100.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8--14 December 2019, Vancouver, BC, Canada. 8024--8035.Google ScholarGoogle Scholar
  32. PyTorch 2020. Tensors and Dynamic neural networks in Python with strong GPU acceleration. https://pytorch.org/.Google ScholarGoogle Scholar
  33. Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi-Rad. 2008. Collective Classification in Network Data. AI Mag. 29, 3 (2008), 93--106.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. George M. Slota, Sivasankaran Rajamanickam, Karen D. Devine, and Kamesh Madduri. 2017. Partitioning Trillion-Edge Graphs in Minutes. In 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, May 29 - June 2, 2017. 646--655.Google ScholarGoogle Scholar
  35. Lubos Takac and Michal Zabovsky. 2012. Data Analysis in Public Social Networks. International Scientific Conference & International Workshop Present Day Trends of Innovations, May 28--29 (2012).Google ScholarGoogle Scholar
  36. John Thorpe, Yifan Qiao, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, Jinliang Wei, Keval Vora, Ravi Netravali, Miryung Kim, and Guoqing Harry Xu. 2021. Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads. In 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021, July 14--16, 2021. 495--514.Google ScholarGoogle Scholar
  37. Alok Tripathy, Katherine A. Yelick, and Aydin Buluç. 2020. Reducing communication in graph neural network training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9--19, 2020. 70.Google ScholarGoogle ScholarCross RefCross Ref
  38. Alok Tripathy, Katherine A. Yelick, and Aydin Buluç. 2020. Reducing Communication in Graph Neural Network Training. CoRR abs/2005.03300 (2020).Google ScholarGoogle Scholar
  39. Charalampos E. Tsourakakis, Christos Gkantsidis, Bozidar Radunovic, and Milan Vojnovic. 2014. FENNEL: streaming graph partitioning for massive scale graphs. In Seventh ACM International Conference on Web Search and Data Mining, WSDM 2014, New York, NY, USA, February 24--28, 2014. 333--342.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.Google ScholarGoogle Scholar
  41. Hao Wang, Liang Geng, Rubao Lee, Kaixi Hou, Yanfeng Zhang, and Xiaodong Zhang. 2019. SEP-graph: finding shortest execution paths for graph processing under a hybrid framework on GPU. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16--20, 2019. 38--52.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Lei Wang, Qiang Yin, Chao Tian, Jianbang Yang, Rong Chen, Wenyuan Yu, Zihang Yao, and Jingren Zhou. 2021. FlexGraph: a flexible and efficient distributed framework for GNN training. In EuroSys '21: Sixteenth European Conference on Computer Systems, Online Event, United Kingdom, April 26--28, 2021, Antonio Barbalace, Pramod Bhatotia, Lorenzo Alvisi, and Cristian Cadar (Eds.). ACM, 67--82.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng Guo, Hao Zhang, Haibin Lin, Junbo Zhao, Jinyang Li, Alexander J. Smola, and Zheng Zhang. 2019. Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs. CoRR abs/1909.01315 (2019). arXiv:1909.01315 http://arxiv.org/abs/1909.01315Google ScholarGoogle Scholar
  44. Qiange Wang, Yanfeng Zhang, Hao Wang, Liang Geng, Rubao Lee, Xiaodong Zhang, and Ge Yu. 2020. Automating Incremental and Asynchronous Evaluation for Recursive Aggregate Data Processing. In In Proceedings of the 2020 International Conference on Management of Data. 2439--2454.Google ScholarGoogle Scholar
  45. Yuke Wang, Boyuan Feng, Gushu Li, Shuangchen Li, Lei Deng, Yuan Xie, and Yufei Ding. 2021. GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs. In 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021, July 14--16, 2021. 515--531.Google ScholarGoogle Scholar
  46. Shiwen Wu, Wentao Zhang, Fei Sun, and Bin Cui. 2020. Graph Neural Networks in Recommender Systems: A Survey. CoRR abs/2011.02260 (2020).Google ScholarGoogle Scholar
  47. Yidi Wu, Kaihao Ma, Zhenkun Cai, Tatiana Jin, Boyang Li, Chengguang Zheng, James Cheng, and Fan Yu. 2021. Seastar: vertex-centric programming for graph neural networks. In EuroSys '21: Sixteenth European Conference on Computer Systems, Online Event, United Kingdom, April 26--28, 2021, Antonio Barbalace, Pramod Bhatotia, Lorenzo Alvisi, and Cristian Cadar (Eds.). ACM, 359--375.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2019. A Comprehensive Survey on Graph Neural Networks. CoRR abs/1901.00596 (2019). arXiv:1901.00596 http://arxiv.org/abs/1901.00596Google ScholarGoogle Scholar
  49. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019. https://openreview.net/forum?id=ryGs6iA5KmGoogle ScholarGoogle Scholar
  50. Jaewon Yang and Jure Leskovec. 2015. Defining and evaluating network communities based on ground-truth. Knowl. Inf. Syst. 42, 1 (2015), 181--213.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Chenzi Zhang, Fan Wei, Qin Liu, Zhihao Gavin Tang, and Zhenguo Li. 2017. Graph Edge Partitioning via Neighborhood Heuristic. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017. 605--614.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Dalong Zhang, Xin Huang, Ziqi Liu, Jun Zhou, Zhiyang Hu, Xianzheng Song, Zhibang Ge, Lin Wang, Zhiqiang Zhang, and Yuan Qi. 2020. AGL: A Scalable System for Industrial-purpose Graph Machine Learning. Proc. VLDB Endow. 13, 12 (2020), 3125--3137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang, and George Karypis. 2020. DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs. CoRR abs/2010.05337 (2020).Google ScholarGoogle Scholar
  54. Rong Zhu, Kun Zhao, Hongxia Yang, Wei Lin, Chang Zhou, Baole Ai, Yong Li, and Jingren Zhou. 2019. AliGraph: A Comprehensive Graph Neural Network Platform. PVLDB 12, 12 (2019), 2094--2105. https://doi.org/10.14778/3352063.3352127Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. 2016. Gemini: A Computation-Centric Distributed Graph Processing System. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2--4, 2016. 301--316.Google ScholarGoogle Scholar

Index Terms

  1. NeutronStar: Distributed GNN Training with Hybrid Dependency Management

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data
        June 2022
        2597 pages
        ISBN:9781450392495
        DOI:10.1145/3514221

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 June 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader