Skip to main content
Log in

SAGE: toward on-the-fly gradient compression ratio scaling

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Gradient sparsification is widely adopted in distributed training; however, it suffers from a trade-off between computation and communication. The prevalent Top-k sparsifier has a hard constraint on computational overhead while achieving the desired gradient compression ratio. Conversely, the hard-threshold sparsifier eliminates computational constraints but fail to achieve the targeted compression ratio. Motivated by this tradeoff, we designed a novel threshold-based sparsifier called SAGE, which achieves a compression ratio close to that of the Top-k sparsifier with negligible computational overhead. SAGE scales the compression ratio by deriving an adjustable threshold based on each iteration’s heuristics. Experimental results show that SAGE achieves a compression ratio closer to the desired ratio than a hard-threshold sparsifier without exacerbating the accuracy of model training. In terms of computation time for gradient selection, SAGE achieves a speedup of up to \(23.62\times\) over the Top-k sparsifier.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability statement

The data presented in this study are publicly available at https://github.com/kljp/sage/.

Notes

  1. Several articles use the term compression ratio which is the inverse of the density. For example, the compression ratio is 1000\(\times\) if density is 0.001. For a broader understanding, we used the compression ratio in our paper title.

  2. SAGE short for sparsity-adjustable gradient exchange.

  3. https://github.com/kljp/sage/

  4. We denote L2 norm as \({\Vert }{\cdot }{\Vert }\) and use the squared value to evaluate the error, as in a prior study [23].

References

  1. Floridi Luciano, Chiriatti Massimo (2020) Gpt-3: its nature, scope, limits, and consequences. Minds Mach 30(4):681–694

    Article  Google Scholar 

  2. Shoeybi Mohammad, Patwary Mostofa, Puri Raul, LeGresley Patrick, Casper Jared, Catanzaro Bryan (2019) Megatron-lm: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053

  3. Rosset Corby (2020) Turing-nlg: A 17-billion-parameter language model by microsoft. Microsoft Blog

  4. Lin Yujun, Han Song, Mao Huizi, Wang Yu, Dally William J (2017) Deep gradient compression: reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887

  5. Shi Shaohuai, Wang Qiang, Zhao Kaiyong, Tang Zhenheng, Wang Yuxin, Huang Xiang, Chu Xiaowen (2019) A distributed synchronous sgd algorithm with global top-k sparsification for low bandwidth networks. In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), IEEE, pp 2238–2247

  6. Chen Chia-Yu, Choi Jungwook, Brand Daniel, Agrawal Ankur, Zhang Wei, Gopalakrishnan Kailash (2018) Adacomp: Adaptive residual gradient compression for data-parallel distributed training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 32

  7. Shi Shaohuai, Chu Xiaowen, Cheung Ka Chun, See Simon (2019) Understanding top-k sparsification in distributed deep learning. arXiv preprint arXiv:1911.08772

  8. Shi Shaohuai, Tang Zhenheng, Wang Qiang, Zhao Kaiyong, Chu Xiaowen (2019) Layer-wise adaptive gradient sparsification for distributed deep learning with convergence guarantees. arXiv preprint arXiv:1911.08727

  9. Shi Shaohuai, Wang Qiang, Chu Xiaowen, Li Bo, Qin Yang, Liu Ruihao, Zhao Xinxiao (2020) Communication-efficient distributed deep learning with merged gradient sparsification on gpus. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, pp 406–415

  10. Chen Chia-Yu, Ni Jiamin, Songtao Lu, Cui Xiaodong, Chen Pin-Yu, Sun Xiao, Wang Naigang, Venkataramani Swagath, Srinivasan Vijayalakshmi, Zhang Wei, Gopalakrishnan Kailash (2020) Scalecom: scalable sparsified gradient compression for communication-efficient distributed training. Adv Neural Inf Process Syst 33:13551–13563

    Google Scholar 

  11. Alistarh Dan, Grubic Demjan, Li Jerry, Tomioka Ryota, Vojnovic Milan (2017) Qsgd: communication-efficient sgd via gradient quantization and encoding. Adv Neural Inf Process Syst, 30

  12. Yuqing Du, Yang Sheng, Huang Kaibin (2020) High-dimensional stochastic gradient quantization for communication-efficient edge learning. IEEE Trans Signal Process 68:2128–2142

    Article  MathSciNet  MATH  Google Scholar 

  13. Mordido Gonçalo, Van Keirsbilck Matthijs, Keller Alexander (2020) Monte carlo gradient quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp 718–719

  14. Faghri Fartash, Tabrizian Iman, Markov Ilia, Alistarh Dan, Roy Daniel M, Ramezani-Kebrya Ali (2020) Adaptive gradient quantization for data-parallel sgd. Adv Neural Inf Process Syst 33:3174–3185

    Google Scholar 

  15. Tang Hanlin, Li Yao, Liu Ji, Yan Ming (2021) Errorcompensatedx: error compensation for variance reduced algorithms. Adv Neural Inf Process Syst 34:18102–18113

    Google Scholar 

  16. Abrahamyan Lusine, Chen Yiming, Bekoulis Giannis, Deligiannis Nikos (2021) Learned gradient compression for distributed deep learning. IEEE Trans Neural Netw Learn Syst 33(12):7330–7344

    Article  MathSciNet  Google Scholar 

  17. Aji Alham Fikri, Heafield Kenneth (2017) Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021

  18. Alistarh Dan, Hoefler Torsten, Johansson Mikael, Konstantinov Nikola, Khirirat Sarit, Renggli Cédric (2018) The convergence of sparsified gradient methods. Adv Neural Inf Process Syst 31

  19. Shi Shaohuai, Zhao Kaiyong, Wang Qiang, Tang Zhenheng, Chu Xiaowen (2019) A convergence analysis of distributed sgd with communication-efficient gradient sparsification. In: IJCAI, pp 3411–3417

  20. Shanbhag Anil, Pirk Holger, Madden Samuel (2018) Efficient top-k query processing on massively parallel hardware. In: Proceedings of the 2018 International Conference on Management of Data, pp 1557–1570

  21. Gaihre Anil, Zheng Da, Weitze Scott, Li Lingda, Song Shuaiwen Leon, Ding Caiwen, Li Xiaoye S, Liu Hang (2021) Dr. top-k: delegate-centric top-k on gpus. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 1–14

  22. Abdelmoniem Ahmed M, Elzanaty Ahmed, Alouini Mohamed-Slim, Canini Marco (2021) An efficient statistical-based gradient compression technique for distributed training systems. Proc Mach Learn Syst 3:297–322

    Google Scholar 

  23. Sahu Atal, Dutta Aritra, Abdelmoniem Ahmed M, Banerjee Trambak, Canini Marco, Kalnis Panos (2021) Rethinking gradient sparsification as total error minimization. Adv Neural Inf Proc Syst 34:8133–8146

    Google Scholar 

  24. Luebke David (2008) Cuda: Scalable parallel programming for high-performance scientific computing. In: 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro. IEEE, pp 836–838

  25. Sidco compressor source code (2021) https://github.com/sands-lab/SIDCo/blob/main/compression.py/

  26. Seide Frank, Fu Hao, Droppo Jasha, Li Gang, Yu Dong (2014) 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In: Fifteenth Annual Conference of the International Speech Communication Association

  27. Yamane Taro (1967) Statistics: an introduction analysis. Harper & Row

  28. Israel Glenn D (1992) Determining sample size. Fact Sheet PEOD-6, University of Florida

  29. He Kaiming, Zhang Xiangyu, Ren Shaoqing, Sun Jian (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778

  30. Krizhevsky Alex (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto

  31. Hochreiter Sepp, Schmidhuber Jürgen (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  32. Merity Stephen, Xiong Caiming, Bradbury James, Socher Richard (2016) Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843

  33. He Xiangnan, Liao Lizi, Zhang Hanwang, Nie Liqiang, Hu Xia, Chua Tat-Seng (2017) Neural collaborative filtering. In: Proceedings of the 26th International Conference on World Wide Web, pp 173–182

  34. Movielens 20m dataset (2015) https://grouplens.org/datasets/movielens/20m/

  35. Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Kopf Andreas, Yang Edward, DeVito Zachary, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, Chintala Soumith (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32:8024–8035

    Google Scholar 

  36. Hard-threshold sparsifier source code (2021) https://github.com/sands-lab/rethinking-sparsification/

  37. Neural collaborative filtering source code (2018) https://github.com/yihong-chen/neural-collaborative-filtering/

Download references

Acknowledgements

This work was jointly supported by the BK21 FOUR program (NRF5199991014091), the Basic Science Research Program (2022R1F1A1062779) of National Research Foundation (NRF) of Korea, the Korea Institute of Science and Technology Information (KISTI) (TS-2022-RE-0019), and (KSC-2022-CRE-0406).

Author information

Authors and Affiliations

Authors

Contributions

DY and SO wrote the main manuscript text. DY and MJ carried out the experiment. All authors reviewed the manuscript.

Corresponding author

Correspondence to Sangyoon Oh.

Ethics declarations

Conflict of interest

Daegun Yoon, Minjoong Jeong, and Sangyoon Oh declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yoon, D., Jeong, M. & Oh, S. SAGE: toward on-the-fly gradient compression ratio scaling. J Supercomput 79, 11387–11409 (2023). https://doi.org/10.1007/s11227-023-05120-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05120-7

Keywords

Navigation