SAGE: toward on-the-fly gradient compression ratio scaling

Yoon, Daegun; Jeong, Minjoong; Oh, Sangyoon

doi:10.1007/s11227-023-05120-7

SAGE: toward on-the-fly gradient compression ratio scaling

Published: 25 February 2023

Volume 79, pages 11387–11409, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Daegun Yoon¹,
Minjoong Jeong² &
Sangyoon Oh¹

334 Accesses
Explore all metrics

Abstract

Gradient sparsification is widely adopted in distributed training; however, it suffers from a trade-off between computation and communication. The prevalent Top-k sparsifier has a hard constraint on computational overhead while achieving the desired gradient compression ratio. Conversely, the hard-threshold sparsifier eliminates computational constraints but fail to achieve the targeted compression ratio. Motivated by this tradeoff, we designed a novel threshold-based sparsifier called SAGE, which achieves a compression ratio close to that of the Top-k sparsifier with negligible computational overhead. SAGE scales the compression ratio by deriving an adjustable threshold based on each iteration’s heuristics. Experimental results show that SAGE achieves a compression ratio closer to the desired ratio than a hard-threshold sparsifier without exacerbating the accuracy of model training. In terms of computation time for gradient selection, SAGE achieves a speedup of up to \(23.62\times\) over the Top-k sparsifier.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visualizing and Understanding Convolutional Networks

Bolstering stochastic gradient descent with model building

Article Open access 15 April 2024

A survey of the recent architectures of deep convolutional neural networks

Article 21 April 2020

Data availability statement

The data presented in this study are publicly available at https://github.com/kljp/sage/.

Notes

Several articles use the term compression ratio which is the inverse of the density. For example, the compression ratio is 1000\(\times\) if density is 0.001. For a broader understanding, we used the compression ratio in our paper title.
SAGE short for sparsity-adjustable gradient exchange.
https://github.com/kljp/sage/
We denote L2 norm as \({\Vert }{\cdot }{\Vert }\) and use the squared value to evaluate the error, as in a prior study [23].

References

Floridi Luciano, Chiriatti Massimo (2020) Gpt-3: its nature, scope, limits, and consequences. Minds Mach 30(4):681–694
Article Google Scholar
Shoeybi Mohammad, Patwary Mostofa, Puri Raul, LeGresley Patrick, Casper Jared, Catanzaro Bryan (2019) Megatron-lm: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053
Rosset Corby (2020) Turing-nlg: A 17-billion-parameter language model by microsoft. Microsoft Blog
Lin Yujun, Han Song, Mao Huizi, Wang Yu, Dally William J (2017) Deep gradient compression: reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887
Shi Shaohuai, Wang Qiang, Zhao Kaiyong, Tang Zhenheng, Wang Yuxin, Huang Xiang, Chu Xiaowen (2019) A distributed synchronous sgd algorithm with global top-k sparsification for low bandwidth networks. In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), IEEE, pp 2238–2247
Chen Chia-Yu, Choi Jungwook, Brand Daniel, Agrawal Ankur, Zhang Wei, Gopalakrishnan Kailash (2018) Adacomp: Adaptive residual gradient compression for data-parallel distributed training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 32
Shi Shaohuai, Chu Xiaowen, Cheung Ka Chun, See Simon (2019) Understanding top-k sparsification in distributed deep learning. arXiv preprint arXiv:1911.08772
Shi Shaohuai, Tang Zhenheng, Wang Qiang, Zhao Kaiyong, Chu Xiaowen (2019) Layer-wise adaptive gradient sparsification for distributed deep learning with convergence guarantees. arXiv preprint arXiv:1911.08727
Shi Shaohuai, Wang Qiang, Chu Xiaowen, Li Bo, Qin Yang, Liu Ruihao, Zhao Xinxiao (2020) Communication-efficient distributed deep learning with merged gradient sparsification on gpus. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, pp 406–415
Chen Chia-Yu, Ni Jiamin, Songtao Lu, Cui Xiaodong, Chen Pin-Yu, Sun Xiao, Wang Naigang, Venkataramani Swagath, Srinivasan Vijayalakshmi, Zhang Wei, Gopalakrishnan Kailash (2020) Scalecom: scalable sparsified gradient compression for communication-efficient distributed training. Adv Neural Inf Process Syst 33:13551–13563
Google Scholar
Alistarh Dan, Grubic Demjan, Li Jerry, Tomioka Ryota, Vojnovic Milan (2017) Qsgd: communication-efficient sgd via gradient quantization and encoding. Adv Neural Inf Process Syst, 30
Yuqing Du, Yang Sheng, Huang Kaibin (2020) High-dimensional stochastic gradient quantization for communication-efficient edge learning. IEEE Trans Signal Process 68:2128–2142
Article MathSciNet MATH Google Scholar
Mordido Gonçalo, Van Keirsbilck Matthijs, Keller Alexander (2020) Monte carlo gradient quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp 718–719
Faghri Fartash, Tabrizian Iman, Markov Ilia, Alistarh Dan, Roy Daniel M, Ramezani-Kebrya Ali (2020) Adaptive gradient quantization for data-parallel sgd. Adv Neural Inf Process Syst 33:3174–3185
Google Scholar
Tang Hanlin, Li Yao, Liu Ji, Yan Ming (2021) Errorcompensatedx: error compensation for variance reduced algorithms. Adv Neural Inf Process Syst 34:18102–18113
Google Scholar
Abrahamyan Lusine, Chen Yiming, Bekoulis Giannis, Deligiannis Nikos (2021) Learned gradient compression for distributed deep learning. IEEE Trans Neural Netw Learn Syst 33(12):7330–7344
Article MathSciNet Google Scholar
Aji Alham Fikri, Heafield Kenneth (2017) Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021
Alistarh Dan, Hoefler Torsten, Johansson Mikael, Konstantinov Nikola, Khirirat Sarit, Renggli Cédric (2018) The convergence of sparsified gradient methods. Adv Neural Inf Process Syst 31
Shi Shaohuai, Zhao Kaiyong, Wang Qiang, Tang Zhenheng, Chu Xiaowen (2019) A convergence analysis of distributed sgd with communication-efficient gradient sparsification. In: IJCAI, pp 3411–3417
Shanbhag Anil, Pirk Holger, Madden Samuel (2018) Efficient top-k query processing on massively parallel hardware. In: Proceedings of the 2018 International Conference on Management of Data, pp 1557–1570
Gaihre Anil, Zheng Da, Weitze Scott, Li Lingda, Song Shuaiwen Leon, Ding Caiwen, Li Xiaoye S, Liu Hang (2021) Dr. top-k: delegate-centric top-k on gpus. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 1–14
Abdelmoniem Ahmed M, Elzanaty Ahmed, Alouini Mohamed-Slim, Canini Marco (2021) An efficient statistical-based gradient compression technique for distributed training systems. Proc Mach Learn Syst 3:297–322
Google Scholar
Sahu Atal, Dutta Aritra, Abdelmoniem Ahmed M, Banerjee Trambak, Canini Marco, Kalnis Panos (2021) Rethinking gradient sparsification as total error minimization. Adv Neural Inf Proc Syst 34:8133–8146
Google Scholar
Luebke David (2008) Cuda: Scalable parallel programming for high-performance scientific computing. In: 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro. IEEE, pp 836–838
Sidco compressor source code (2021) https://github.com/sands-lab/SIDCo/blob/main/compression.py/
Seide Frank, Fu Hao, Droppo Jasha, Li Gang, Yu Dong (2014) 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In: Fifteenth Annual Conference of the International Speech Communication Association
Yamane Taro (1967) Statistics: an introduction analysis. Harper & Row
Israel Glenn D (1992) Determining sample size. Fact Sheet PEOD-6, University of Florida
He Kaiming, Zhang Xiangyu, Ren Shaoqing, Sun Jian (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Krizhevsky Alex (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto
Hochreiter Sepp, Schmidhuber Jürgen (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Merity Stephen, Xiong Caiming, Bradbury James, Socher Richard (2016) Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843
He Xiangnan, Liao Lizi, Zhang Hanwang, Nie Liqiang, Hu Xia, Chua Tat-Seng (2017) Neural collaborative filtering. In: Proceedings of the 26th International Conference on World Wide Web, pp 173–182
Movielens 20m dataset (2015) https://grouplens.org/datasets/movielens/20m/
Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Kopf Andreas, Yang Edward, DeVito Zachary, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, Chintala Soumith (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32:8024–8035
Google Scholar
Hard-threshold sparsifier source code (2021) https://github.com/sands-lab/rethinking-sparsification/
Neural collaborative filtering source code (2018) https://github.com/yihong-chen/neural-collaborative-filtering/

Download references

Acknowledgements

This work was jointly supported by the BK21 FOUR program (NRF5199991014091), the Basic Science Research Program (2022R1F1A1062779) of National Research Foundation (NRF) of Korea, the Korea Institute of Science and Technology Information (KISTI) (TS-2022-RE-0019), and (KSC-2022-CRE-0406).

Author information

Authors and Affiliations

Department of Artificial Intelligence, Ajou University, Suwon, 16499, Republic of Korea
Daegun Yoon & Sangyoon Oh
Supercomputing Application Center, Korea Institute of Science and Technology Information, Daejeon, 34141, Republic of Korea
Minjoong Jeong

Authors

Daegun Yoon
View author publications
You can also search for this author in PubMed Google Scholar
Minjoong Jeong
View author publications
You can also search for this author in PubMed Google Scholar
Sangyoon Oh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

DY and SO wrote the main manuscript text. DY and MJ carried out the experiment. All authors reviewed the manuscript.

Corresponding author

Correspondence to Sangyoon Oh.

Ethics declarations

Conflict of interest

Daegun Yoon, Minjoong Jeong, and Sangyoon Oh declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yoon, D., Jeong, M. & Oh, S. SAGE: toward on-the-fly gradient compression ratio scaling. J Supercomput 79, 11387–11409 (2023). https://doi.org/10.1007/s11227-023-05120-7

Download citation

Accepted: 15 February 2023
Published: 25 February 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s11227-023-05120-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SAGE: toward on-the-fly gradient compression ratio scaling

Abstract

Access this article

Similar content being viewed by others

Visualizing and Understanding Convolutional Networks

Bolstering stochastic gradient descent with model building

A survey of the recent architectures of deep convolutional neural networks

Data availability statement

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SAGE: toward on-the-fly gradient compression ratio scaling

Abstract

Access this article

Similar content being viewed by others

Visualizing and Understanding Convolutional Networks

Bolstering stochastic gradient descent with model building

A survey of the recent architectures of deep convolutional neural networks

Data availability statement

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation