A systematic evaluation of machine learning on serverless infrastructure

Jiang, Jiawei; Gan, Shaoduo; Du, Bo; Alonso, Gustavo; Klimovic, Ana; Singla, Ankit; Wu, Wentao; Wang, Sheng; Zhang, Ce

doi:10.1007/s00778-023-00813-0

A systematic evaluation of machine learning on serverless infrastructure

Regular Paper
Published: 20 September 2023

Volume 33, pages 425–449, (2024)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Jiawei Jiang ORCID: orcid.org/0000-0003-0051-0046^1,2^na1,
Shaoduo Gan⁴^na1,
Bo Du³,
Gustavo Alonso⁴,
Ana Klimovic⁴,
Ankit Singla⁵,
Wentao Wu⁶,
Sheng Wang³ &
…
Ce Zhang⁴

588 Accesses
1 Citation
Explore all metrics

Abstract

Recently, the serverless paradigm of computing has inspired research on its applicability to data-intensive tasks such as ETL, database query processing, and machine learning (ML) model training. Recent efforts have proposed multiple systems for training large-scale ML models in a distributed manner on top of serverless infrastructures (e.g., AWS Lambda). Yet, there is so far no consensus on the design space for such systems when compared with systems built on top of classical “serverful” infrastructures. Indeed, a variety of factors could impact the performance of training ML models in a distributed environment, such as the optimization algorithm used and the synchronization protocol followed by parallel executors, which must be carefully considered when designing serverless ML systems. To clarify contradictory observations from previous work, in this paper we present a systematic comparative study of serverless and serverful systems for distributed ML training. We present a design space that covers design choices made by previous systems on aspects such as optimization algorithms and synchronization protocols. We then implement a platform, LambdaML , that enables a fair comparison between serverless and serverful systems by navigating the aforementioned design space. We further improve LambdaML toward automatic support by designing a hyper-parameter tuning framework that leverages the ability of serverless infrastructure. We present empirical evaluation results using LambdaML on both single training jobs and multi-tenant workloads. Our results reveal that there is no “one size fits all” serverless solution given the current state of the art—one must choose different designs for different ML workloads. We also develop an analytic model based on the empirical observations to capture the cost/performance tradeoffs that one has to consider when deciding between serverless and serverful designs for distributed ML training.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 10

Fig. 15

Fig. 16

Fig. 19

Fig. 20

Fig. 21

Fig. 22

Fig. 23

A survey on federated learning: challenges and applications

Article 11 November 2022

Big data analytics on Apache Spark

Article 13 October 2016

Big data analytics in Cloud computing: an overview

Article Open access 06 August 2022

Notes

https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel.
Although we did a lot of efforts to run larger image dataset (e.g., ImageNet), the performance is extremely slow, since Lambda does not provide GPUs and limits the maximal memory.
We do not include regression models such as linear regression, but we believe the trade-off space would be the same, since the model complexity is similar.
http://star.mit.edu/cluster/
http://projects.dfki.uni-kl.de/yfcc100m/
Note that, Hogwild! [57] uses a single machine to train RCV1 within 9.5 s (without startup and data loading time). The model training time of LambdaML is about 27 s. Hogwild! uses a lock-free asynchronous strategy and prefers sparse datasets. Although the training algorithm is different from our setting, we believe it is important to report these numbers as a reference.
Distributed PyTorch with ADMM achieves the best results when training LR and SVM, distributed PyTorch achieves the best results when training KM, and distributed PyTorch with SGD achieves the best results when training MN.
Due to space limitations, we show results of four representative tasks. The observed patterns are the same on the other workloads.

References

Abadi, D.J., Madden, S.R., Hachem, N.: Column-stores vs. row-stores: how different are they really? In: SIGMOD, pp. 967–980 (2008)
Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI, pp. 265–283 (2016)
Akkus, I.E., Chen, R., Rimac, I., et al.: Sand: towards high-performance serverless computing. In: USENIX ATC, pp. 923–935 (2018)
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5(1), 1–9 (2014)
Article Google Scholar
Baldini, I., Castro, P., Chang, K., et al.: Serverless computing: current trends and open problems. In: Research Advances in Cloud Computing, pp. 1–20 (2017)
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. JMLR 13(2), 281–305 (2012)
MathSciNet Google Scholar
Bergstra, J., Yamins, D., Cox, D.D., et al.: Hyperopt: a python library for optimizing the hyperparameters of machine learning algorithms. In: SciPy, vol. 13, p. 20 (2013)
Bhattacharjee, A., Barve, Y., Khare, S., Bao, S., Gokhale, A., Damiano, T.: Stratum: a serverless framework for the lifecycle management of machine learning-based data analytics tasks. In: OpML, pp. 59–61 (2019)
Boehm, M., Tatikonda, S., Reinwald, B., et al.: Hybrid parallelization strategies for large-scale machine learning in systemML. VLDB 7(7), 553–564 (2014)
Google Scholar
Boyd, S., Parikh, N., Chu, E., et al.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Article Google Scholar
Cao, W., Zhang, Y., Yang, X., Li, F., Wang, S., Hu, Q., Cheng, X., Chen, Z., Liu, Z., Fang, J., et al.: PolarDB serverless: a cloud native database for disaggregated data centers. In: SIGMOD, pp. 2477–2489 (2021)
Carreira, J., Fonseca, P., Tumanov, A., Zhang, A., Katz, R.: Cirrus: a serverless framework for end-to-end ML workflows. In: SoCC, pp. 13–24 (2019)
Castro, P., Ishakian, V., Muthusamy, V., Slominski, A.: The rise of serverless computing. Commun. ACM 62(12), 44–54 (2019)
Article Google Scholar
Chaturapruek, S., Duchi, J.C., Ré, C.: Asynchronous stochastic convex optimization: the noise is in the noise and SGD don’t care. In: NeurIPS, pp. 1531–1539 (2015)
Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: SIGKDD, pp. 785–794 (2016)
Dean, J., Corrado, G., Monga, R., et al.: Large scale distributed deep networks. In: NeurIPS, pp. 1223–1231 (2012)
Falkner, S., Klein, A., Hutter, F.: BOHB: robust and efficient hyperparameter optimization at scale. In: ICML, pp. 1437–1446 (2018)
Fard, A., Le, A., Larionov, G., Dhillon, W., Bear, C.: Vertica-ML: distributed machine learning in vertica database. In: SIGMOD, pp. 755–768 (2020)
Feng, L., Kudva, P., Da Silva, D., Hu, J.: Exploring serverless computing for neural network training. In: CLOUD, pp. 334–341 (2018)
Fingler, H., Akshintala, A., Rossbach, C.J.: USETL: unikernels for serverless extract transform and load why should you settle for less? In: APSys, pp. 23–30 (2019)
Gropp, W., Gropp, W.D., Lusk, E., Lusk, A.D.F.E.E., Skjellum, A.: Using MPI: portable parallel programming with the message-passing interface, vol. 1 (1999)
Gupta, V., Kadhe, S., Courtade, T., Mahoney, M.W., Ramchandran, K.: Oversketched Newton: fast convex optimization for serverless systems. arXiv:1903.08857 (2019)
Hellerstein, J.M., Faleiro, J.M., Gonzalez, J., et al.: Serverless computing: one step forward, two steps back. In: CIDR (2019)
Hendrickson, S., Sturdevant, S., Harter, T., Venkataramani, V., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H.: Serverless computation with openlambda. In: HotCloud (2016)
Ho, Q., Cipar, J., Cui, H., et al.: More effective distributed ml via a stale synchronous parallel parameter server. In: NeurIPS, pp. 1223–1231 (2013)
Hsieh, K., Harlap, A., Vijaykumar, N., Konomis, D., Ganger, G.R., Gibbons, P.B., Mutlu, O.: Gaia: geo-distributed machine learning approaching LAN speeds. In: NSDI, pp. 629–647 (2017)
Huang, Y., Jin, T., Wu, Y., et al.: FlexPS: flexible parallelism control in parameter server architecture. VLDB 11(5), 566–579 (2018)
Google Scholar
Ishakian, V., Muthusamy, V., Slominski, A.: Serving deep learning models in a serverless platform. In: IC2E, pp. 257–262 (2018)
Jiang, J., Cui, B., Zhang, C., Fu, F.: DimBoost: boosting gradient boosting decision tree to higher dimensions. In: SIGMOD, pp. 1363–1376 (2018)
Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: SIGMOD, pp. 463–478 (2017)
Jiang, J., Fu, F., Yang, T., Cui, B.: SketchML: accelerating distributed machine learning with data sketches. In: SIGMOD, pp. 1269–1284 (2018)
Jiang, J., Yu, L., Jiang, J., Liu, Y., Cui, B.: Angel: a new large-scale machine learning system. Natl. Sci. Rev. 5(2), 216–236 (2018)
Article Google Scholar
Jonas, E., Schleier-Smith, J., Sreekanti, V., et al.: Cloud programming simplified: a Berkeley view on serverless computing. arXiv:1902.03383 (2019)
Kaoudi, Z., Quiané-Ruiz, J.A., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD, pp. 977–992 (2017)
Kara, K., Eguro, K., Zhang, C., Alonso, G.: ColumnML: column-store machine learning with on-the-fly data transformation. VLDB 12(4), 348–361 (2018)
Google Scholar
Klein, A., Falkner, S., Mansur, N., Hutter, F.: RoBO: a flexible and robust Bayesian optimization framework in Python. In: NIPS 2017 Bayesian Optimization Workshop, pp. 4–9 (2017)
Klimovic, A., Wang, Y., Kozyrakis, C., Stuedi, P., Pfefferle, J., Trivedi, A.: Understanding ephemeral storage for serverless analytics. In: USENIX ATC, pp. 789–794 (2018)
Klimovic, A., Wang, Y., Stuedi, P., Trivedi, A., Pfefferle, J., Kozyrakis, C.: Pocket: elastic ephemeral storage for serverless analytics. In: OSDI, pp. 427–444 (2018)
Kraska, T., Talwalkar, A., Duchi, J.C., Griffith, R., Franklin, M.J., Jordan, M.I.: MLbase: a distributed machine-learning system. In: CIDR, vol. 1, pp. 2–1 (2013)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. JMLR 5(4), 361–397 (2004)
Google Scholar
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. JMLR 18(1), 6765–6816 (2017)
MathSciNet Google Scholar
Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication efficient distributed machine learning with the parameter server. In: NeurIPS, pp. 19–27 (2014)
Li, S., Zhao, Y., Varma, R., et al.: PyTorch distributed: experiences on accelerating data parallel training. VLDB 13(12), 3005–3018 (2020)
Google Scholar
Liaw, R., Bhardwaj, R., Dunlap, L., Zou, Y., Gonzalez, J.E., Stoica, I., Tumanov, A.: HyperSched: dynamic resource reallocation for model development on a deadline. In: SoCC, pp. 61–73 (2019)
Liu, J., Zhang, C.: Distributed learning systems with first-order methods. Found. Trends Databases 9, 1–100 (2020)
Article Google Scholar
McSherry, F., Isard, M., Murray, D.G.: Scalability! but at what cost? In: HotOS (2015)
Meng, X., Bradley, J., Yavuz, B., et al.: MLlib: machine learning in apache spark. JMLR 17(1), 1235–1241 (2016)
MathSciNet Google Scholar
Misra, U., Liaw, R., Dunlap, L., Bhardwaj, R., Kandasamy, K., Gonzalez, J.E., Stoica, I., Tumanov, A.: RubberBand: cloud-based hyperparameter tuning. In: EuroSys, pp. 327–342 (2021)
Müller, I., Marroquín, R., Alonso, G.: Lambada: interactive data analytics on cold data using serverless cloud infrastructure. In: SIGMOD, pp. 115–130 (2020)
Narayanan, D., Santhanam, K., Kazhamiaka, F., Phanishayee, A., Zaharia, M.: Heterogeneity-aware cluster scheduling policies for deep learning workloads. In: OSDI, pp. 481–498 (2020)
Ooi, B.C., Tan, K.L., Wang, S., et al.: SINGA: a distributed deep learning platform. In: MM, pp. 685–688 (2015)
Paszke, A., Gross, S., Massa, F., et al.: PyTorch: an imperative style, high-performance deep learning library. NeurIPS 32, 8026–8037 (2019)
Google Scholar
Perron, M., Castro Fernandez, R., DeWitt, D., Madden, S.: Starling: A scalable query engine on cloud functions. In: SIGMOD, pp. 131–141 (2020)
Poppe, O., Guo, Q., Lang, W., Arora, P., Oslake, M., Xu, S., Kalhan, A.: Moneyball: proactive auto-scaling in Microsoft Azure SQL database serverless. In: VLDB (2022)
Pu, Q., Venkataraman, S., Stoica, I.: Shuffling, fast and slow: scalable analytics on serverless infrastructure. In: NSDI, pp. 193–206 (2019)
Rausch, T., Hummer, W., Muthusamy, V., Rashed, A., Dustdar, S.: Towards a serverless platform for edge AI. In: HotEdge (2019)
Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: NeurIPS, pp. 693–701 (2011)
Schleier-Smith, J., Sreekanti, V., Khandelwal, A., Carreira, J., Yadwadkar, N.J., Popa, R.A., Gonzalez, J.E., Stoica, I., Patterson, D.A.: What serverless computing is and should become: the next phase of cloud computing. Commun. ACM 64(5), 76–84 (2021)
Article Google Scholar
Shankar, V., Krauth, K., Pu, Q., et al.: Numpywren: serverless linear algebra. arXiv:1810.09679 (2018)
Sparks, E.R., Venkataraman, S., Kaftan, T., Franklin, M.J., Recht, B.: KeystoneML: optimizing pipelines for large-scale advanced analytics. In: ICDE, pp. 535–546 (2017)
Tandon, R., Lei, Q., Dimakis, A.G., Karampatziakis, N.: Gradient coding: avoiding stragglers in distributed learning. In: ICML, pp. 3368–3376 (2017)
Tang, H., Gan, S., Zhang, C., Zhang, T., Liu, J.: Communication compression for decentralized training. In: NeurIPS, pp. 7663–7673 (2018)
Tang, H., Lian, X., Yan, M., Zhang, C., Liu, J.: D\(^{2}\): decentralized training over decentralized data. In: ICML, pp. 4848–4856 (2018)
Wang, H., Niu, D., Li, B.: Distributed machine learning with a serverless architecture. In: INFOCOM, pp. 1288–1296 (2019)
Wang, J., Joshi, G.: Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. arXiv:1810.08313 (2018)
Wang, L., Li, M., Zhang, Y., Ristenpart, T., Swift, M.: Peeking behind the curtains of serverless platforms. In: USENIX ATC, pp. 133–146 (2018)
Wawrzoniak, M., Müller, I., Fraga Barcelos Paulus Bruno, R., Alonso, G.: Boxer: data analytics on network-enabled serverless platforms. In: CIDR (2021)
Wu, Y., Dinh, T.T.A., Hu, G., Zhang, M., Chee, Y.M., Ooi, B.C.: Serverless data science-are we there yet? A case study of model serving (2022)
Zhang, H., Li, J., Kara, K., Alistarh, D., Liu, J., Zhang, C.: ZipML: training linear models with end-to-end low precision, and a little bit of deep learning. In: ICML, pp. 4035–4043 (2017)
Zhang, Z., Jiang, J., Wu, W., Zhang, C., Yu, L., Cui, B.: MLlib*: fast training of GLMs using spark MLlib. In: ICDE, pp. 1778–1789 (2019)
Zheng, S., Meng, Q., Wang, T., et al.: Asynchronous stochastic gradient descent with delay compensation. In: ICML, pp. 4120–4129 (2017)
Zinkevich, M., Weimer, M., Smola, A.J., Li, L.: Parallelized stochastic gradient descent. In: NeurIPS, pp. 2595–2603 (2010)

Download references

Acknowledgements

This work was sponsored by National Key R &D Program of China (No. 2022ZD0116315), Key R &D Program of Hubei Province (No. 2023BAB077), the Fundamental Research Funds for the Central Universities (No. 2042023kf0219). This work was supported by Ant Group. We gratefully acknowledge the help from Quanqing Xu (xuquanqing.xqq@oceanbase.com) and Chuanhui Yang (rizhao.ych@oceanbase.com) from OceanBase, AntGroup.

Author information

Shaoduo Gan and Jiawei Jiang have contributed equally.

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, China
Jiawei Jiang
OceanBase, Ant Group, Hangzhou, China
Jiawei Jiang
School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, China
Bo Du & Sheng Wang
Department of Computer Science, ETH Zürich, Zurich, Switzerland
Shaoduo Gan, Gustavo Alonso, Ana Klimovic & Ce Zhang
Google, Zurich, Switzerland
Ankit Singla
Redmond, Microsoft Research, Washington, USA
Wentao Wu

Authors

Jiawei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Shaoduo Gan
View author publications
You can also search for this author in PubMed Google Scholar
Bo Du
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo Alonso
View author publications
You can also search for this author in PubMed Google Scholar
Ana Klimovic
View author publications
You can also search for this author in PubMed Google Scholar
Ankit Singla
View author publications
You can also search for this author in PubMed Google Scholar
Wentao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ce Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Du.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jiang, J., Gan, S., Du, B. et al. A systematic evaluation of machine learning on serverless infrastructure. The VLDB Journal 33, 425–449 (2024). https://doi.org/10.1007/s00778-023-00813-0

Download citation

Received: 25 May 2022
Revised: 01 August 2023
Accepted: 17 August 2023
Published: 20 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00778-023-00813-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A systematic evaluation of machine learning on serverless infrastructure

Abstract

Access this article

Similar content being viewed by others

A survey on federated learning: challenges and applications

Big data analytics on Apache Spark

Big data analytics in Cloud computing: an overview

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A systematic evaluation of machine learning on serverless infrastructure

Abstract

Access this article

Similar content being viewed by others

A survey on federated learning: challenges and applications

Big data analytics on Apache Spark

Big data analytics in Cloud computing: an overview

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation