Model averaging in distributed machine learning: a case study with Apache Spark

Guo, Yunyan; Zhang, Zhipeng; Jiang, Jiawei; Wu, Wentao; Zhang, Ce; Cui, Bin; Li, Jianzhong

doi:10.1007/s00778-021-00664-7

Model averaging in distributed machine learning: a case study with Apache Spark

Regular Paper
Published: 15 April 2021

Volume 30, pages 693–712, (2021)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Yunyan Guo ORCID: orcid.org/0000-0002-1935-0946¹,
Zhipeng Zhang²,
Jiawei Jiang⁴,
Wentao Wu³,
Ce Zhang⁴,
Bin Cui² &
…
Jianzhong Li¹

1069 Accesses
6 Citations
Explore all metrics

Abstract

The increasing popularity of Apache Spark has attracted many users to put their data into its ecosystem. On the other hand, it has been witnessed in the literature that Spark is slow when it comes to distributed machine learning (ML). One resort is to switch to specialized systems such as parameter servers, which are claimed to have better performance. Nonetheless, users have to undergo the painful procedure of moving data into and out of Spark. In this paper, we investigate performance bottlenecks of MLlib (an official Spark package for ML) in detail, by focusing on analyzing its implementation of stochastic gradient descent (SGD)—the workhorse under the training of many ML models. We show that the performance inferiority of Spark is caused by implementation issues rather than fundamental flaws of the bulk synchronous parallel (BSP) model that governs Spark’s execution: we can significantly improve Spark’s performance by leveraging the well-known “model averaging” (MA) technique in distributed ML. Indeed, model averaging is not limited to SGD, and we further showcase an application of MA to training latent Dirichlet allocation (LDA) models within Spark. Our implementation is not intrusive and requires light development effort. Experimental evaluation results reveal that the MA-based versions of SGD and LDA can be orders of magnitude faster compared to their counterparts without using MA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

Optimization for Large-Scale Machine Learning with Distributed Features and Observations

Large-scale distributed L-BFGS

Article Open access 17 July 2017

Notes

It remains an open question to ensure convergence when using model averaging (perhaps with an implementation different from the current MA-SGD) to train deep models.
We assign one task to each executor because when we increase the number of tasks per executor, the time per iteration increases due to the heavy communication overhead.
https://en.wikipedia.org/wiki/Gantt_chart.
https://0x0fff.com/spark-architecture-shuffle/.
We ignore the intermediate aggregators in Fig. 2b.
However, the proof is very lengthy and technical, and thus is omitted here.
https://en.wikipedia.org/wiki/Digamma_function.
The default value of c is 49 to guarantee a relative error of \(1e{-}8\), though \(c=9\) is enough for a relative error of \(1e{-}5\).
SVM is a representative for GLMs. In fact, linear models share similar training process from a system perspective.
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
https://archive.ics.uci.edu/ml/machine-learning-datasets/ and http://commoncrawl.org.
http://spark.apache.org/docs/latest/tuning.html.
We chose batch size as follows: 16k for NYTimes, 64k for PubMed, and 100k for CommonCrawl.
The speedup per iteration is computed by dividing the elapsed time (the right plot) by the number of iterations (the left plot).
Since L-BFGS in spark.ml performs normalization, we evaluate the models with the normalized data.

References

Amazon EC2 on-Demand Instance Pricing. https://aws.amazon.com/ec2/pricing/on-demand/
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI, pp. 265–283 (2016)
Agarwal, A., Chapelle, O., Dudík, M., Langford, J.: A reliable effective terascale linear learning system. J. Mach. Learn. Res. 15(1), 1111–1133 (2014)
MathSciNet MATH Google Scholar
Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., Smola, A.J.: Scalable inference in latent variable models. In: WSDM, pp. 123–132. ACM (2012)
Alistarh, D., Allen-Zhu, Z., Li, J.: Byzantine stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 4613–4623 (2018)
Anderson, M., Smith, S., Sundaram, N., Capotă, M., Zhao, Z., Dulloor, S., Satish, N., Willke, T.L.: Bridging the gap between HPC and big data frameworks. Proc. VLDB Endow. 10(8), 901–912 (2017)
Article Google Scholar
Bernardo, J.M., et al.: Psi (digamma) function. Appl. Stat. 25(3), 315–317 (1976)
Article Google Scholar
Boden, C., Spina, A., Rabl, T., Markl, V.: Benchmarking data flow systems for scalable machine learning. In: Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, pp. 1–10 (2017)
Bottou, L.: Large-scale machine learning with stochastic gradient descent, pp. 177–186 (2010)
Bottou, L.: Stochastic gradient descent tricks. In: Neural Networks: Tricks of the Trade, pp. 421–436. Springer (2012)
Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: SIGKDD, pp. 785–794 (2016)
Chen, W., Wang, Z., Zhou, J.: Large-scale l-BFGS using mapreduce. In: Advances in Neural Information Processing Systems, pp. 1332–1340 (2014)
Dai, J., Wang, Y., Qiu, X., Ding, D., Zhang, Y., Wang, Y., Jia, X., Zhang, C., Wan, Y., Li, Z., et al.: Bigdl: a distributed deep learning framework for big data. Preprint arXiv:1804.05839 (2018)
Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research 13(Jan.), 165–202 (2012)
MathSciNet MATH Google Scholar
Fan, W., Xu, J., Wu, Y., Yu, W., Jiang, J., Zheng, Z., Zhang, B., Cao, Y., Tian, C.: Parallelizing sequential graph computations. In: SIGMOD, pp. 495–510 (2017)
Foulds, J., Boyles, L., DuBois, C., Smyth, P., Welling, M.: Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation. In: SIGKDD, pp. 446–454. ACM (2013)
Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent Dirichlet allocation. In: NIPS, pp. 856–864 (2010)
Hsieh, K., Harlap, A., Vijaykumar, N., Konomis, D., Ganger, G.R., Gibbons, P.B., Mutlu, O.: Gaia: Geo-distributed machine learning approaching \(\{\)LAN\(\}\) speeds. In: NSDI, pp. 629–647 (2017)
Huang, Y., Jin, T., Wu, Y., Cai, Z., Yan, X., Yang, F., Li, J., Guo, Y., Cheng, J.: Flexps: flexible parallelism control in parameter server architecture. Proc. VLDB Endow. 11(5), 566–579 (2018)
Article Google Scholar
Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: SIGMOD, pp. 463–478 (2017)
Jiang, J., Fu, F., Yang, T., Cui, B.: Sketchml: accelerating distributed machine learning with data sketches. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1269–1284 (2018)
Jiang, J., Yu, L., Jiang, J., Liu, Y., Cui, B.: Angel: a new large-scale machine learning system. Natl. Sci. Rev. 5(2), 216–236 (2017)
Article Google Scholar
Jiang, P., Agrawal, G.: A linear speedup analysis of distributed deep learning with sparse and quantized communication. In: Advances in Neural Information Processing Systems, pp. 2525–2536 (2018)
Kaoudi, Z., Quiané-Ruiz, J.A., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD, pp. 977–992. ACM (2017)
Kucukelbir, A., Ranganath, R., Gelman, A., Blei, D.: Automatic variational inference in Stan. In: NIPS, pp. 568–576 (2015)
Li, F., Chen, L., Zeng, Y., Kumar, A., Wu, X., Naughton, J.F., Patel, J.M.: Tuple-oriented compression for large-scale mini-batch stochastic gradient descent. In: Proceedings of the 2019 International Conference on Management of Data, pp. 1517–1534 (2019)
Li, M., Anderson, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.Y.: Scaling distributed machine learning with the parameter server. In: OSDI, pp. 583–598 (2014)
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1–3), 503–528 (1989)
Article MathSciNet Google Scholar
Liu, X., Zeng, J., Yang, X., Yan, J., Yang, Q.: Scalable parallel EM algorithms for latent Dirichlet allocation in multi-core systems. In: WWW, pp. 669–679 (2015)
McSherry, F., Isard, M., Murray, D.G.: Scalability! but at what \(\{\)COST\(\}\)? In: HotOS (2015)
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: Mllib: machine learning in Apache Spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)
MathSciNet MATH Google Scholar
Onizuka, M., Fujimori, T., Shiokawa, H.: Graph partitioning for distributed graph processing. Data Sci. Eng. 2(1), 94–105 (2017)
Article Google Scholar
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8024–8035 (2019)
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, Malta (2010). http://is.muni.cz/publication/884893/en
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article MathSciNet Google Scholar
Stich, S.U.: Local SGD converges fast and communicates little. In: ICLR 2019 International Conference on Learning Representations, CONF (2019)
Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005)
Article Google Scholar
Ueno, K., Suzumura, T., Maruyama, N., Fujisawa, K., Matsuoka, S.: Efficient breadth-first search on massively parallel and distributed-memory machines. Data Sci. Eng. 2(1), 22–35 (2017)
Article Google Scholar
Xie, C., Koyejo, S., Gupta, I.: Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance. In: International Conference on Machine Learning, pp. 6893–6901 (2019)
Xing, E.P., Ho, Q., Dai, W., Kim, J.K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., Yu, Y.: Petuum: a new platform for distributed machine learning on big data. IEEE Trans. Big Data 1(2), 49–67 (2015)
Article Google Scholar
Xu, N., Chen, L., Cui, B.: LogGP: a log-based dynamic graph partitioning method. Proc. VLDB Endow. 7(14), 1917–1928 (2014)
Article Google Scholar
Yuan, J., Gao, F., Ho, Q., Dai, W., Wei, J., Zheng, X., Xing, E.P., Liu, T.Y., Ma, W.Y.: Lightlda: big topic models on modest computer clusters. In: World Wide Web, pp. 1351–1361 (2015)
Yut, L., Zhang, C., Shao, Y., Cui, B.: LDA*: a robust and large-scale topic modeling system. Proc. VLDB Endow. 10(11), 1406–1417 (2017)
Article Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)
Google Scholar
Zaheer, M., Wick, M., Tristan, J.B., Smola, A., Steele, G.: Exponential stochastic cellular automata for massively parallel inference. In: Artificial Intelligence and Statistics, pp. 966–975 (2016)
Zhang, C., Ré, C.: Dimmwitted: a study of main-memory statistical analytics. Proc. VLDB Endow. 7(12), 11 (2014)
Article Google Scholar
Zhang, H., Zeng, L., Wu, W., Zhang, C.: How good are machine learning clouds for binary classification with good features? In: SoCC, p. 649 (2017)
Zhang, J., De Sa, C., Mitliagkas, I., Ré, C.: Parallel SGD: When does averaging help? Preprint arXiv:1606.07365 (2016)
Zhang, K., Alqahtani, S., Demirbas, M.: A comparison of distributed machine learning platforms. In: ICCCN, pp. 1–9 (2017)
Zhang, Y., Jordan, M.I.: Splash: User-friendly programming interface for parallelizing stochastic algorithms. Preprint arXiv:1506.07552 (2015)
Zinkevich, M., Weimer, M., Li, L., Smola, A.J.: Parallelized stochastic gradient descent. In: NIPS, pp. 2595–2603 (2010)

Download references

Acknowledgements

This work is funded by the National Natural Science Foundation of China (NSFC) Grant No. 61832003 and U1811461, and Chinese Scholarship Council.

Author information

Authors and Affiliations

Massive Data Computing Research Center, Harbin Institute of Technology, Harbin, 150001, China
Yunyan Guo & Jianzhong Li
School of EECS, Peking University, Beijing, 100871, China
Zhipeng Zhang & Bin Cui
Microsoft Research, Redmond, WA, USA
Wentao Wu
Department of Computer Science, ETH Zürich, 8092, Zurich, Switzerland
Jiawei Jiang & Ce Zhang

Authors

Yunyan Guo
View author publications
You can also search for this author in PubMed Google Scholar
Zhipeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Wentao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Ce Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Bin Cui
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunyan Guo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work of the paper was performed when the first two authors were visiting students at ETH Zurich.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, Y., Zhang, Z., Jiang, J. et al. Model averaging in distributed machine learning: a case study with Apache Spark. The VLDB Journal 30, 693–712 (2021). https://doi.org/10.1007/s00778-021-00664-7

Download citation

Received: 03 December 2019
Revised: 26 July 2020
Accepted: 02 September 2020
Published: 15 April 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s00778-021-00664-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model averaging in distributed machine learning: a case study with Apache Spark

Abstract

Access this article

Similar content being viewed by others

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

Optimization for Large-Scale Machine Learning with Distributed Features and Observations

Large-scale distributed L-BFGS

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Model averaging in distributed machine learning: a case study with Apache Spark

Abstract

Access this article

Similar content being viewed by others

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

Optimization for Large-Scale Machine Learning with Distributed Features and Observations

Large-scale distributed L-BFGS

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation