research-article

MSCCLang: Microsoft Collective Communication Language

Authors:

Madanlal Musuvathi,

Olli Saarikivi,

Yifan XiongAuthors Info & Claims

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

Pages 502 - 514

https://doi.org/10.1145/3575693.3575724

Published: 30 January 2023 Publication History

Abstract

Machine learning models with millions or billions of parameters are increasingly trained and served on large multi-GPU systems. As models grow in size and execute on more GPUs, collective communication becomes a bottleneck. Custom collective algorithms optimized for both particular network topologies and application-specific communication patterns can alleviate this bottleneck and help these applications scale. However, implementing correct and efficient custom algorithms is challenging.

This paper introduces MSCCLang, a system for programmable GPU communication. MSCCLang provides a domain specific language for writing collective communication algorithms and an optimizing compiler for lowering them to an executable form, which can be executed efficiently and flexibly in an interpreter-based runtime. We used MSCCLang to write novel collective algorithms for AllReduce and AllToAll that are up to 1.9× and 1.3× faster than hand-optimized implementations, respectively.

References

[1]

2022. AI and Compute. https://openai.com/blog/ai-and-compute/

[2]

Michael Barnett, Rick Littlefield, David G Payne, and Robert van de Geijn. 1993. Global combine on mesh architectures with wormhole routing. In [1993] Proceedings Seventh International Parallel Processing Symposium. 156–162.

[3]

Shahid H Bokhari and Harry Berryman. 1992. Complete exchange on a circuit switched mesh. In 1992 Proceedings Scalable High Performance Computing Conference. 300–301.

[4]

Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. 2021. Synthesizing optimal collective algorithms. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 62–75.

Digital Library

[5]

Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert Van De Geijn. 2007. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19, 13 (2007), 1749–1783.

[6]

Minsik Cho, Ulrich Finkler, and David Kung. 2019. BlueConnect: Novel hierarchical all-reduce on multi-tired network for deep learning. In Proceedings of the 2nd SysML Conference.

[7]

Minsik Cho, Ulrich Finkler, Mauricio Serrano, David Kung, and Hillery Hunter. 2019. BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy. IBM Journal of Research and Development, 63, 6 (2019), 1:1–1:11.

[8]

2017. Cooperative Groups: Flexible CUDA Thread Programming. https://developer.nvidia.com/blog/cooperative-groups/

[9]

2021. DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression. https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/

[10]

Wei Deng, Junwei Pan, Tian Zhou, Deguang Kong, Aaron Flores, and Guang Lin. 2021. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM ’21). Association for Computing Machinery, New York, NY, USA. 922–930. isbn:9781450382977 https://doi.org/10.1145/3437963.3441727

Digital Library

[11]

Jack Dongarra. 2013. MPI: A message-passing interface standard version 3.0. High Performance Computing Center Stuttgart (HLRS), 2, 5 (2013), 32.

[12]

David Gelernter. 1985. Generative Communication in Linda. ACM Trans. Program. Lang. Syst., 7, 1 (1985), January, 80–112. issn:0164-0925 https://doi.org/10.1145/2363.2433

Digital Library

[13]

Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, and Vladimir Koushnir. 2016. Scalable hierarchical aggregation protocol (SHArP): a hardware architecture for efficient data reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). 1–10.

[14]

William Gropp, Ewing Lusk, and Anthony Skjellum. 2014. Using MPI: Portable Parallel Programming with the Message-Passing Interface. The MIT Press. isbn:0262527391

Digital Library

[15]

Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2019. TicTac: Accelerating distributed deep learning with communication scheduling. March.

[16]

Jahanzeb Maqbool Hashmi and Dhabaleswar K Panda. 2021. BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs. In High Performance Computing: 36th International Conference, ISC High Performance 2021, Virtual Event, June 24–July 2, 2021, Proceedings. 12728, 18.

[17]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR.2016.90

[18]

Zhenhao He, Daniele Parravicini, Lucian Petrica, Kenneth O’Brien, Gustavo Alonso, and Michaela Blott. 2021. ACCL: FPGA-Accelerated Collectives over 100 Gbps TCP-IP. In 2021 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC). 33–43.

[19]

2021. Intel Concurrent Collections for C++. https://icnc.github.io/

[20]

Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. March.

[21]

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 463–479.

Digital Library

[22]

ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. ATP: In-network Aggregation for Multi-tenant Learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 741–761.

[23]

Charles Leiserson and Aske Plaat. 1997. Programming Parallel Applications in Cilk. Siam News, 07.

[24]

Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. 2020. PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud. In Proceedings of Machine Learning and Systems 2020. 82–97.

[25]

2021. Megatron GPT-3 Large Model Inference with Triton and ONNX Runtime. https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31578/

[26]

2022. NVIDIA GPUDirect: Enhancing Data Movement and Access for GPUs. https://developer.nvidia.com/gpudirect

[27]

2022. NVIDIA Collective Communication Library (NCCL). https://github.com/nvidia/nccl

[28]

2022. ONNX Runtime Mixture of Experts. https://github.com/pytorch/ort

[29]

2022. Parameter counts in Machine Learning. https://www.alignmentforum.org/posts/GzoWcYibWYwJva8aL/parameter-counts-in-machine-learning

[30]

Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed dnn training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 16–29.

Digital Library

[31]

Jelena Pješivac-Grbović, Thara Angskun, George Bosilca, Graham E Fagg, Edgar Gabriel, and Jack J Dongarra. 2007. Performance analysis of MPI collective operations. Cluster Computing, 10, 2 (2007), 127–143.

Digital Library

[32]

2022. PyTorch on Azure. https://azure.microsoft.com/en-us/resources/developers/pytorch/

[33]

2022. ROCm Communication Collectives Library (RCCL). https://github.com/ROCmSoftwarePlatform/rccl

[34]

Peter Sanders and Jesper Larsson Träff. 2002. The hierarchical factor algorithm for all-to-all communication. In European Conference on Parallel Processing. 799–803.

[35]

Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtarik. 2021. Scaling Distributed Machine Learning with In-Network Aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 785–808. isbn:978-1-939133-21-2 https://www.usenix.org/conference/nsdi21/presentation/sapio

[36]

Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan RK Ports, and Peter Richtárik. 2019. Scaling distributed machine learning with in-network aggregation. arXiv preprint arXiv:1903.06701.

[37]

David S Scott. 1991. Efficient all-to-all communication patterns in hypercube and mesh topologies. In The Sixth Distributed Memory Computing Conference, 1991. Proceedings. 398–399.

[38]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arxiv:1802.05799.

[39]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR, abs/1909.08053 (2019), arXiv:1909.08053. arxiv:1909.08053

[40]

Steve Sistare, Rolf Vandevaart, and Eugene Loh. 1999. Optimization of MPI collectives on clusters of large-scale SMP’s. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing. 23–es.

Digital Library

[41]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications, 19, 1 (2005), 49–66.

Digital Library

[42]

Vinod Tipparaju, Jarek Nieplocha, and Dhabaleswar Panda. 2003. Fast collective operations using shared and remote memory access protocols on clusters. In Proceedings International Parallel and Distributed Processing Symposium. 10–pp.

[43]

Jesper Larsson Träff. 2002. Improved MPI all-to-all communication on a Giganet SMP cluster. In European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting. 392–400.

[44]

Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Nikhil Devanur, Jorgen Thelin, and Ion Stoica. 2020. Blink: Fast and generic collectives for distributed ml. Proceedings of Machine Learning and Systems, 2 (2020), 172–186.

[45]

Ningning Xie, Tamara Norman, Dominik Grewe, and Dimitrios Vytiniotis. 2021. Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning. arXiv preprint arXiv:2110.10548.

[46]

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). 181–193.

Cited By

Cheng SLin SDiao LWu HWang SSi CLiu ZZhao XDu JLin WYou YEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep LearningProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707223(198-213)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707223
Li XGuo CQian KZhang MYang MXu M(2024)Near-Lossless Gradient Compression for Data-Parallel Distributed DNN TrainingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698541(977-994)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698541
Li BWang XWang JLiu YGong YLu HDang WZhang WHuang XChen MChen JHe CLiu YHu XLiu CJi XXia YLi XHe ZWang YZou X(2024)TCCL: Co-optimizing Collective Communication and Traffic Routing for GPU-centric ClustersProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673799(48-53)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672198.3673799
Show More Cited By

Index Terms

MSCCLang: Microsoft Collective Communication Language
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. Context specific languages
      1. Domain specific languages
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Communications management
2. Theory of computation
  1. Models of computation
    1. Concurrency

Recommendations

Synthesizing optimal collective algorithms
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahl's bottleneck of data-parallel training.

This paper introduces SCCL (for Synthesized ...
gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious ...
Optimization of Collective Communication Operations in MPICH

We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

January 2023

947 pages

ISBN:9781450399166

DOI:10.1145/3575693

General Chair:
Tor M. Aamodt
University of British Columbia, Canada
,
Program Chairs:
Natalie Enright Jerger
University of Toronto, Canada
,
Michael Swift
University of Wisconsin-Madison, USA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '23

Sponsor:

ASPLOS '23: 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

March 25 - 29, 2023

BC, Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
1,071
Total Downloads

Downloads (Last 12 months)384
Downloads (Last 6 weeks)19

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cheng SLin SDiao LWu HWang SSi CLiu ZZhao XDu JLin WYou YEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep LearningProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707223(198-213)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707223
Li XGuo CQian KZhang MYang MXu M(2024)Near-Lossless Gradient Compression for Data-Parallel Distributed DNN TrainingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698541(977-994)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698541
Li BWang XWang JLiu YGong YLu HDang WZhang WHuang XChen MChen JHe CLiu YHu XLiu CJi XXia YLi XHe ZWang YZou X(2024)TCCL: Co-optimizing Collective Communication and Traffic Routing for GPU-centric ClustersProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673799(48-53)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672198.3673799
Liu XArzani BKakarla SZhao LLiu VCastro MKandula SMarshall LSekar VYu MSeneviratne AVeitch D(2024)Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow ProblemProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672249(16-37)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672249
Gangidi AMiao RZheng SBondu SGoes GMorsy HPuri RRiftadi MShetty AYang JZhang SFernandez MGandham SZeng HSekar VYu MSeneviratne AVeitch D(2024)RDMA over Ethernet for Distributed Training at Meta ScaleProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672233(57-70)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672233
Lall SCaşcaval CIzzard MSpalink T(2024)Logical Synchrony and the Bittide MechanismIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.344473935:11(1936-1948)Online publication date: Nov-2024
https://doi.org/10.1109/TPDS.2024.3444739
Won WElavazhagan MSrinivasan SGupta SKrishna T(2024)TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00068(856-870)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00068
Noh SHong JLim CPark SKim JKim HKim YLee J(2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00027
Yoo JWon WCowan MJiang NKlenk BSridharan SKrishna T(2024)Towards a Standardized Representation for Deep Learning Collective Algorithms2024 IEEE Symposium on High-Performance Interconnects (HOTI)10.1109/HOTI63208.2024.00017(33-36)Online publication date: 21-Aug-2024
https://doi.org/10.1109/HOTI63208.2024.00017
Pati SAga SIslam MJayasena NSinclair M(2023)Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00026(140-153)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IISWC59245.2023.00026

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten