Abstract
Gradient sparsification is widely adopted in distributed training; however, it suffers from a trade-off between computation and communication. The prevalent Top-k sparsifier has a hard constraint on computational overhead while achieving the desired gradient compression ratio. Conversely, the hard-threshold sparsifier eliminates computational constraints but fail to achieve the targeted compression ratio. Motivated by this tradeoff, we designed a novel threshold-based sparsifier called SAGE, which achieves a compression ratio close to that of the Top-k sparsifier with negligible computational overhead. SAGE scales the compression ratio by deriving an adjustable threshold based on each iteration’s heuristics. Experimental results show that SAGE achieves a compression ratio closer to the desired ratio than a hard-threshold sparsifier without exacerbating the accuracy of model training. In terms of computation time for gradient selection, SAGE achieves a speedup of up to \(23.62\times\) over the Top-k sparsifier.
Similar content being viewed by others
Data availability statement
The data presented in this study are publicly available at https://github.com/kljp/sage/.
Notes
Several articles use the term compression ratio which is the inverse of the density. For example, the compression ratio is 1000\(\times\) if density is 0.001. For a broader understanding, we used the compression ratio in our paper title.
SAGE short for sparsity-adjustable gradient exchange.
We denote L2 norm as \({\Vert }{\cdot }{\Vert }\) and use the squared value to evaluate the error, as in a prior study [23].
References
Floridi Luciano, Chiriatti Massimo (2020) Gpt-3: its nature, scope, limits, and consequences. Minds Mach 30(4):681–694
Shoeybi Mohammad, Patwary Mostofa, Puri Raul, LeGresley Patrick, Casper Jared, Catanzaro Bryan (2019) Megatron-lm: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053
Rosset Corby (2020) Turing-nlg: A 17-billion-parameter language model by microsoft. Microsoft Blog
Lin Yujun, Han Song, Mao Huizi, Wang Yu, Dally William J (2017) Deep gradient compression: reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887
Shi Shaohuai, Wang Qiang, Zhao Kaiyong, Tang Zhenheng, Wang Yuxin, Huang Xiang, Chu Xiaowen (2019) A distributed synchronous sgd algorithm with global top-k sparsification for low bandwidth networks. In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), IEEE, pp 2238–2247
Chen Chia-Yu, Choi Jungwook, Brand Daniel, Agrawal Ankur, Zhang Wei, Gopalakrishnan Kailash (2018) Adacomp: Adaptive residual gradient compression for data-parallel distributed training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 32
Shi Shaohuai, Chu Xiaowen, Cheung Ka Chun, See Simon (2019) Understanding top-k sparsification in distributed deep learning. arXiv preprint arXiv:1911.08772
Shi Shaohuai, Tang Zhenheng, Wang Qiang, Zhao Kaiyong, Chu Xiaowen (2019) Layer-wise adaptive gradient sparsification for distributed deep learning with convergence guarantees. arXiv preprint arXiv:1911.08727
Shi Shaohuai, Wang Qiang, Chu Xiaowen, Li Bo, Qin Yang, Liu Ruihao, Zhao Xinxiao (2020) Communication-efficient distributed deep learning with merged gradient sparsification on gpus. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, pp 406–415
Chen Chia-Yu, Ni Jiamin, Songtao Lu, Cui Xiaodong, Chen Pin-Yu, Sun Xiao, Wang Naigang, Venkataramani Swagath, Srinivasan Vijayalakshmi, Zhang Wei, Gopalakrishnan Kailash (2020) Scalecom: scalable sparsified gradient compression for communication-efficient distributed training. Adv Neural Inf Process Syst 33:13551–13563
Alistarh Dan, Grubic Demjan, Li Jerry, Tomioka Ryota, Vojnovic Milan (2017) Qsgd: communication-efficient sgd via gradient quantization and encoding. Adv Neural Inf Process Syst, 30
Yuqing Du, Yang Sheng, Huang Kaibin (2020) High-dimensional stochastic gradient quantization for communication-efficient edge learning. IEEE Trans Signal Process 68:2128–2142
Mordido Gonçalo, Van Keirsbilck Matthijs, Keller Alexander (2020) Monte carlo gradient quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp 718–719
Faghri Fartash, Tabrizian Iman, Markov Ilia, Alistarh Dan, Roy Daniel M, Ramezani-Kebrya Ali (2020) Adaptive gradient quantization for data-parallel sgd. Adv Neural Inf Process Syst 33:3174–3185
Tang Hanlin, Li Yao, Liu Ji, Yan Ming (2021) Errorcompensatedx: error compensation for variance reduced algorithms. Adv Neural Inf Process Syst 34:18102–18113
Abrahamyan Lusine, Chen Yiming, Bekoulis Giannis, Deligiannis Nikos (2021) Learned gradient compression for distributed deep learning. IEEE Trans Neural Netw Learn Syst 33(12):7330–7344
Aji Alham Fikri, Heafield Kenneth (2017) Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021
Alistarh Dan, Hoefler Torsten, Johansson Mikael, Konstantinov Nikola, Khirirat Sarit, Renggli Cédric (2018) The convergence of sparsified gradient methods. Adv Neural Inf Process Syst 31
Shi Shaohuai, Zhao Kaiyong, Wang Qiang, Tang Zhenheng, Chu Xiaowen (2019) A convergence analysis of distributed sgd with communication-efficient gradient sparsification. In: IJCAI, pp 3411–3417
Shanbhag Anil, Pirk Holger, Madden Samuel (2018) Efficient top-k query processing on massively parallel hardware. In: Proceedings of the 2018 International Conference on Management of Data, pp 1557–1570
Gaihre Anil, Zheng Da, Weitze Scott, Li Lingda, Song Shuaiwen Leon, Ding Caiwen, Li Xiaoye S, Liu Hang (2021) Dr. top-k: delegate-centric top-k on gpus. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 1–14
Abdelmoniem Ahmed M, Elzanaty Ahmed, Alouini Mohamed-Slim, Canini Marco (2021) An efficient statistical-based gradient compression technique for distributed training systems. Proc Mach Learn Syst 3:297–322
Sahu Atal, Dutta Aritra, Abdelmoniem Ahmed M, Banerjee Trambak, Canini Marco, Kalnis Panos (2021) Rethinking gradient sparsification as total error minimization. Adv Neural Inf Proc Syst 34:8133–8146
Luebke David (2008) Cuda: Scalable parallel programming for high-performance scientific computing. In: 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro. IEEE, pp 836–838
Sidco compressor source code (2021) https://github.com/sands-lab/SIDCo/blob/main/compression.py/
Seide Frank, Fu Hao, Droppo Jasha, Li Gang, Yu Dong (2014) 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In: Fifteenth Annual Conference of the International Speech Communication Association
Yamane Taro (1967) Statistics: an introduction analysis. Harper & Row
Israel Glenn D (1992) Determining sample size. Fact Sheet PEOD-6, University of Florida
He Kaiming, Zhang Xiangyu, Ren Shaoqing, Sun Jian (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Krizhevsky Alex (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto
Hochreiter Sepp, Schmidhuber Jürgen (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Merity Stephen, Xiong Caiming, Bradbury James, Socher Richard (2016) Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843
He Xiangnan, Liao Lizi, Zhang Hanwang, Nie Liqiang, Hu Xia, Chua Tat-Seng (2017) Neural collaborative filtering. In: Proceedings of the 26th International Conference on World Wide Web, pp 173–182
Movielens 20m dataset (2015) https://grouplens.org/datasets/movielens/20m/
Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Kopf Andreas, Yang Edward, DeVito Zachary, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, Chintala Soumith (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32:8024–8035
Hard-threshold sparsifier source code (2021) https://github.com/sands-lab/rethinking-sparsification/
Neural collaborative filtering source code (2018) https://github.com/yihong-chen/neural-collaborative-filtering/
Acknowledgements
This work was jointly supported by the BK21 FOUR program (NRF5199991014091), the Basic Science Research Program (2022R1F1A1062779) of National Research Foundation (NRF) of Korea, the Korea Institute of Science and Technology Information (KISTI) (TS-2022-RE-0019), and (KSC-2022-CRE-0406).
Author information
Authors and Affiliations
Contributions
DY and SO wrote the main manuscript text. DY and MJ carried out the experiment. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
Daegun Yoon, Minjoong Jeong, and Sangyoon Oh declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yoon, D., Jeong, M. & Oh, S. SAGE: toward on-the-fly gradient compression ratio scaling. J Supercomput 79, 11387–11409 (2023). https://doi.org/10.1007/s11227-023-05120-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05120-7