Skip to main content

Accelerating Synchronous Distributed Data Parallel Training with Small Batch Sizes

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14854))

Included in the following conference series:

  • 281 Accesses

Abstract

Synchronous distributed data parallel (SDDP) training is widely employed in distributed deep learning systems to train DNN models on large datasets. The performance of SDDP training essentially depends on the communication overhead and the statistical efficiency. However, existing approaches only optimize either the communication overhead or the statistical efficiency to accelerate SDDP training. In this paper, we adopt the advantages of those approaches and design a new approach, namely SkipSMA, that benefits from both low communication overhead and high statistical efficiency. In particular, we exploit the skipping strategy with an adaptive interval to decrease the communication frequency, which guarantees low communication overhead. Moreover, we employ the correction technique to mitigate the divergence while keeping small batch sizes, which ensures high statistical efficiency. To demonstrate the performance of SkipSMA, we integrate it into TensorFlow. Our experiments show that SkipSMA outperforms the state-of-the-art solutions for SDDP training, e.g., 6.88x speedup over SSGD.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Damania, et al.: Pytorch rpc: Distributed deep learning built on tensor-optimized remote procedure calls. MLSys (2023)

    Google Scholar 

  2. Koliousis, A., et al.: Crossbow: Scaling deep learning with small batch sizes on multi-gpu servers. Proc. VLDB Endow. 12(11), 1399–1413 (2019)

    Article  Google Scholar 

  3. Li, S., et al.: Pytorch distributed: Experiences on accelerating data parallel training. Proc. VLDB Endow. 13(12), 3005–3018 (2020)

    Article  Google Scholar 

  4. Moritz, P., et al.: Sparknet: Training deep networks in spark. In: ICLR (2016)

    Google Scholar 

  5. Shirkoohi, M.K., et al.: Sip-ml: high-bandwidth optical network interconnects for machine learning training. In: SIGCOMM. pp. 657–675 (2021)

    Google Scholar 

  6. Yu, H., et al.: Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In: AAAI. pp. 5693–5700 (2019)

    Google Scholar 

  7. Zhang, C., et al.: Dimmwitted: A study of main-memory statistical analytics. Proc. VLDB Endow. 7(12), 1283–1294 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 62272168), and Natural Science Foundation of Shanghai (No. 23ZR1419900).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chen Xu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, Y., Bi, N., Xu, C., Niu, Y., Zhou, H. (2024). Accelerating Synchronous Distributed Data Parallel Training with Small Batch Sizes. In: Onizuka, M., et al. Database Systems for Advanced Applications. DASFAA 2024. Lecture Notes in Computer Science, vol 14854. Springer, Singapore. https://doi.org/10.1007/978-981-97-5569-1_33

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-5569-1_33

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-5568-4

  • Online ISBN: 978-981-97-5569-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics