Abstract
Synchronous distributed data parallel (SDDP) training is widely employed in distributed deep learning systems to train DNN models on large datasets. The performance of SDDP training essentially depends on the communication overhead and the statistical efficiency. However, existing approaches only optimize either the communication overhead or the statistical efficiency to accelerate SDDP training. In this paper, we adopt the advantages of those approaches and design a new approach, namely SkipSMA, that benefits from both low communication overhead and high statistical efficiency. In particular, we exploit the skipping strategy with an adaptive interval to decrease the communication frequency, which guarantees low communication overhead. Moreover, we employ the correction technique to mitigate the divergence while keeping small batch sizes, which ensures high statistical efficiency. To demonstrate the performance of SkipSMA, we integrate it into TensorFlow. Our experiments show that SkipSMA outperforms the state-of-the-art solutions for SDDP training, e.g., 6.88x speedup over SSGD.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Damania, et al.: Pytorch rpc: Distributed deep learning built on tensor-optimized remote procedure calls. MLSys (2023)
Koliousis, A., et al.: Crossbow: Scaling deep learning with small batch sizes on multi-gpu servers. Proc. VLDB Endow. 12(11), 1399–1413 (2019)
Li, S., et al.: Pytorch distributed: Experiences on accelerating data parallel training. Proc. VLDB Endow. 13(12), 3005–3018 (2020)
Moritz, P., et al.: Sparknet: Training deep networks in spark. In: ICLR (2016)
Shirkoohi, M.K., et al.: Sip-ml: high-bandwidth optical network interconnects for machine learning training. In: SIGCOMM. pp. 657–675 (2021)
Yu, H., et al.: Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In: AAAI. pp. 5693–5700 (2019)
Zhang, C., et al.: Dimmwitted: A study of main-memory statistical analytics. Proc. VLDB Endow. 7(12), 1283–1294 (2014)
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No. 62272168), and Natural Science Foundation of Shanghai (No. 23ZR1419900).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sun, Y., Bi, N., Xu, C., Niu, Y., Zhou, H. (2024). Accelerating Synchronous Distributed Data Parallel Training with Small Batch Sizes. In: Onizuka, M., et al. Database Systems for Advanced Applications. DASFAA 2024. Lecture Notes in Computer Science, vol 14854. Springer, Singapore. https://doi.org/10.1007/978-981-97-5569-1_33
Download citation
DOI: https://doi.org/10.1007/978-981-97-5569-1_33
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5568-4
Online ISBN: 978-981-97-5569-1
eBook Packages: Computer ScienceComputer Science (R0)