Skip to main content
Log in

A systematic evaluation of machine learning on serverless infrastructure

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Recently, the serverless paradigm of computing has inspired research on its applicability to data-intensive tasks such as ETL, database query processing, and machine learning (ML) model training. Recent efforts have proposed multiple systems for training large-scale ML models in a distributed manner on top of serverless infrastructures (e.g., AWS Lambda). Yet, there is so far no consensus on the design space for such systems when compared with systems built on top of classical “serverful” infrastructures. Indeed, a variety of factors could impact the performance of training ML models in a distributed environment, such as the optimization algorithm used and the synchronization protocol followed by parallel executors, which must be carefully considered when designing serverless ML systems. To clarify contradictory observations from previous work, in this paper we present a systematic comparative study of serverless and serverful systems for distributed ML training. We present a design space that covers design choices made by previous systems on aspects such as optimization algorithms and synchronization protocols. We then implement a platform, LambdaML , that enables a fair comparison between serverless and serverful systems by navigating the aforementioned design space. We further improve LambdaML toward automatic support by designing a hyper-parameter tuning framework that leverages the ability of serverless infrastructure. We present empirical evaluation results using LambdaML on both single training jobs and multi-tenant workloads. Our results reveal that there is no “one size fits all” serverless solution given the current state of the art—one must choose different designs for different ML workloads. We also develop an analytic model based on the empirical observations to capture the cost/performance tradeoffs that one has to consider when deciding between serverless and serverful designs for distributed ML training.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

Notes

  1. https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html

  2. https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel.

  3. Although we did a lot of efforts to run larger image dataset (e.g., ImageNet), the performance is extremely slow, since Lambda does not provide GPUs and limits the maximal memory.

  4. We do not include regression models such as linear regression, but we believe the trade-off space would be the same, since the model complexity is similar.

  5. http://star.mit.edu/cluster/

  6. http://projects.dfki.uni-kl.de/yfcc100m/

  7. Note that, Hogwild! [57] uses a single machine to train RCV1 within 9.5 s (without startup and data loading time). The model training time of LambdaML is about 27 s. Hogwild! uses a lock-free asynchronous strategy and prefers sparse datasets. Although the training algorithm is different from our setting, we believe it is important to report these numbers as a reference.

  8. Distributed PyTorch with ADMM achieves the best results when training LR and SVM, distributed PyTorch achieves the best results when training KM, and distributed PyTorch with SGD achieves the best results when training MN.

  9. Due to space limitations, we show results of four representative tasks. The observed patterns are the same on the other workloads.

References

  1. Abadi, D.J., Madden, S.R., Hachem, N.: Column-stores vs. row-stores: how different are they really? In: SIGMOD, pp. 967–980 (2008)

  2. Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI, pp. 265–283 (2016)

  3. Akkus, I.E., Chen, R., Rimac, I., et al.: Sand: towards high-performance serverless computing. In: USENIX ATC, pp. 923–935 (2018)

  4. Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5(1), 1–9 (2014)

    Article  Google Scholar 

  5. Baldini, I., Castro, P., Chang, K., et al.: Serverless computing: current trends and open problems. In: Research Advances in Cloud Computing, pp. 1–20 (2017)

  6. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. JMLR 13(2), 281–305 (2012)

    MathSciNet  Google Scholar 

  7. Bergstra, J., Yamins, D., Cox, D.D., et al.: Hyperopt: a python library for optimizing the hyperparameters of machine learning algorithms. In: SciPy, vol. 13, p. 20 (2013)

  8. Bhattacharjee, A., Barve, Y., Khare, S., Bao, S., Gokhale, A., Damiano, T.: Stratum: a serverless framework for the lifecycle management of machine learning-based data analytics tasks. In: OpML, pp. 59–61 (2019)

  9. Boehm, M., Tatikonda, S., Reinwald, B., et al.: Hybrid parallelization strategies for large-scale machine learning in systemML. VLDB 7(7), 553–564 (2014)

    Google Scholar 

  10. Boyd, S., Parikh, N., Chu, E., et al.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)

    Article  Google Scholar 

  11. Cao, W., Zhang, Y., Yang, X., Li, F., Wang, S., Hu, Q., Cheng, X., Chen, Z., Liu, Z., Fang, J., et al.: PolarDB serverless: a cloud native database for disaggregated data centers. In: SIGMOD, pp. 2477–2489 (2021)

  12. Carreira, J., Fonseca, P., Tumanov, A., Zhang, A., Katz, R.: Cirrus: a serverless framework for end-to-end ML workflows. In: SoCC, pp. 13–24 (2019)

  13. Castro, P., Ishakian, V., Muthusamy, V., Slominski, A.: The rise of serverless computing. Commun. ACM 62(12), 44–54 (2019)

    Article  Google Scholar 

  14. Chaturapruek, S., Duchi, J.C., Ré, C.: Asynchronous stochastic convex optimization: the noise is in the noise and SGD don’t care. In: NeurIPS, pp. 1531–1539 (2015)

  15. Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: SIGKDD, pp. 785–794 (2016)

  16. Dean, J., Corrado, G., Monga, R., et al.: Large scale distributed deep networks. In: NeurIPS, pp. 1223–1231 (2012)

  17. Falkner, S., Klein, A., Hutter, F.: BOHB: robust and efficient hyperparameter optimization at scale. In: ICML, pp. 1437–1446 (2018)

  18. Fard, A., Le, A., Larionov, G., Dhillon, W., Bear, C.: Vertica-ML: distributed machine learning in vertica database. In: SIGMOD, pp. 755–768 (2020)

  19. Feng, L., Kudva, P., Da Silva, D., Hu, J.: Exploring serverless computing for neural network training. In: CLOUD, pp. 334–341 (2018)

  20. Fingler, H., Akshintala, A., Rossbach, C.J.: USETL: unikernels for serverless extract transform and load why should you settle for less? In: APSys, pp. 23–30 (2019)

  21. Gropp, W., Gropp, W.D., Lusk, E., Lusk, A.D.F.E.E., Skjellum, A.: Using MPI: portable parallel programming with the message-passing interface, vol. 1 (1999)

  22. Gupta, V., Kadhe, S., Courtade, T., Mahoney, M.W., Ramchandran, K.: Oversketched Newton: fast convex optimization for serverless systems. arXiv:1903.08857 (2019)

  23. Hellerstein, J.M., Faleiro, J.M., Gonzalez, J., et al.: Serverless computing: one step forward, two steps back. In: CIDR (2019)

  24. Hendrickson, S., Sturdevant, S., Harter, T., Venkataramani, V., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H.: Serverless computation with openlambda. In: HotCloud (2016)

  25. Ho, Q., Cipar, J., Cui, H., et al.: More effective distributed ml via a stale synchronous parallel parameter server. In: NeurIPS, pp. 1223–1231 (2013)

  26. Hsieh, K., Harlap, A., Vijaykumar, N., Konomis, D., Ganger, G.R., Gibbons, P.B., Mutlu, O.: Gaia: geo-distributed machine learning approaching LAN speeds. In: NSDI, pp. 629–647 (2017)

  27. Huang, Y., Jin, T., Wu, Y., et al.: FlexPS: flexible parallelism control in parameter server architecture. VLDB 11(5), 566–579 (2018)

    Google Scholar 

  28. Ishakian, V., Muthusamy, V., Slominski, A.: Serving deep learning models in a serverless platform. In: IC2E, pp. 257–262 (2018)

  29. Jiang, J., Cui, B., Zhang, C., Fu, F.: DimBoost: boosting gradient boosting decision tree to higher dimensions. In: SIGMOD, pp. 1363–1376 (2018)

  30. Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: SIGMOD, pp. 463–478 (2017)

  31. Jiang, J., Fu, F., Yang, T., Cui, B.: SketchML: accelerating distributed machine learning with data sketches. In: SIGMOD, pp. 1269–1284 (2018)

  32. Jiang, J., Yu, L., Jiang, J., Liu, Y., Cui, B.: Angel: a new large-scale machine learning system. Natl. Sci. Rev. 5(2), 216–236 (2018)

    Article  Google Scholar 

  33. Jonas, E., Schleier-Smith, J., Sreekanti, V., et al.: Cloud programming simplified: a Berkeley view on serverless computing. arXiv:1902.03383 (2019)

  34. Kaoudi, Z., Quiané-Ruiz, J.A., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD, pp. 977–992 (2017)

  35. Kara, K., Eguro, K., Zhang, C., Alonso, G.: ColumnML: column-store machine learning with on-the-fly data transformation. VLDB 12(4), 348–361 (2018)

    Google Scholar 

  36. Klein, A., Falkner, S., Mansur, N., Hutter, F.: RoBO: a flexible and robust Bayesian optimization framework in Python. In: NIPS 2017 Bayesian Optimization Workshop, pp. 4–9 (2017)

  37. Klimovic, A., Wang, Y., Kozyrakis, C., Stuedi, P., Pfefferle, J., Trivedi, A.: Understanding ephemeral storage for serverless analytics. In: USENIX ATC, pp. 789–794 (2018)

  38. Klimovic, A., Wang, Y., Stuedi, P., Trivedi, A., Pfefferle, J., Kozyrakis, C.: Pocket: elastic ephemeral storage for serverless analytics. In: OSDI, pp. 427–444 (2018)

  39. Kraska, T., Talwalkar, A., Duchi, J.C., Griffith, R., Franklin, M.J., Jordan, M.I.: MLbase: a distributed machine-learning system. In: CIDR, vol. 1, pp. 2–1 (2013)

  40. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. JMLR 5(4), 361–397 (2004)

    Google Scholar 

  41. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. JMLR 18(1), 6765–6816 (2017)

    MathSciNet  Google Scholar 

  42. Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication efficient distributed machine learning with the parameter server. In: NeurIPS, pp. 19–27 (2014)

  43. Li, S., Zhao, Y., Varma, R., et al.: PyTorch distributed: experiences on accelerating data parallel training. VLDB 13(12), 3005–3018 (2020)

    Google Scholar 

  44. Liaw, R., Bhardwaj, R., Dunlap, L., Zou, Y., Gonzalez, J.E., Stoica, I., Tumanov, A.: HyperSched: dynamic resource reallocation for model development on a deadline. In: SoCC, pp. 61–73 (2019)

  45. Liu, J., Zhang, C.: Distributed learning systems with first-order methods. Found. Trends Databases 9, 1–100 (2020)

    Article  Google Scholar 

  46. McSherry, F., Isard, M., Murray, D.G.: Scalability! but at what cost? In: HotOS (2015)

  47. Meng, X., Bradley, J., Yavuz, B., et al.: MLlib: machine learning in apache spark. JMLR 17(1), 1235–1241 (2016)

    MathSciNet  Google Scholar 

  48. Misra, U., Liaw, R., Dunlap, L., Bhardwaj, R., Kandasamy, K., Gonzalez, J.E., Stoica, I., Tumanov, A.: RubberBand: cloud-based hyperparameter tuning. In: EuroSys, pp. 327–342 (2021)

  49. Müller, I., Marroquín, R., Alonso, G.: Lambada: interactive data analytics on cold data using serverless cloud infrastructure. In: SIGMOD, pp. 115–130 (2020)

  50. Narayanan, D., Santhanam, K., Kazhamiaka, F., Phanishayee, A., Zaharia, M.: Heterogeneity-aware cluster scheduling policies for deep learning workloads. In: OSDI, pp. 481–498 (2020)

  51. Ooi, B.C., Tan, K.L., Wang, S., et al.: SINGA: a distributed deep learning platform. In: MM, pp. 685–688 (2015)

  52. Paszke, A., Gross, S., Massa, F., et al.: PyTorch: an imperative style, high-performance deep learning library. NeurIPS 32, 8026–8037 (2019)

    Google Scholar 

  53. Perron, M., Castro Fernandez, R., DeWitt, D., Madden, S.: Starling: A scalable query engine on cloud functions. In: SIGMOD, pp. 131–141 (2020)

  54. Poppe, O., Guo, Q., Lang, W., Arora, P., Oslake, M., Xu, S., Kalhan, A.: Moneyball: proactive auto-scaling in Microsoft Azure SQL database serverless. In: VLDB (2022)

  55. Pu, Q., Venkataraman, S., Stoica, I.: Shuffling, fast and slow: scalable analytics on serverless infrastructure. In: NSDI, pp. 193–206 (2019)

  56. Rausch, T., Hummer, W., Muthusamy, V., Rashed, A., Dustdar, S.: Towards a serverless platform for edge AI. In: HotEdge (2019)

  57. Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: NeurIPS, pp. 693–701 (2011)

  58. Schleier-Smith, J., Sreekanti, V., Khandelwal, A., Carreira, J., Yadwadkar, N.J., Popa, R.A., Gonzalez, J.E., Stoica, I., Patterson, D.A.: What serverless computing is and should become: the next phase of cloud computing. Commun. ACM 64(5), 76–84 (2021)

    Article  Google Scholar 

  59. Shankar, V., Krauth, K., Pu, Q., et al.: Numpywren: serverless linear algebra. arXiv:1810.09679 (2018)

  60. Sparks, E.R., Venkataraman, S., Kaftan, T., Franklin, M.J., Recht, B.: KeystoneML: optimizing pipelines for large-scale advanced analytics. In: ICDE, pp. 535–546 (2017)

  61. Tandon, R., Lei, Q., Dimakis, A.G., Karampatziakis, N.: Gradient coding: avoiding stragglers in distributed learning. In: ICML, pp. 3368–3376 (2017)

  62. Tang, H., Gan, S., Zhang, C., Zhang, T., Liu, J.: Communication compression for decentralized training. In: NeurIPS, pp. 7663–7673 (2018)

  63. Tang, H., Lian, X., Yan, M., Zhang, C., Liu, J.: D\(^{2}\): decentralized training over decentralized data. In: ICML, pp. 4848–4856 (2018)

  64. Wang, H., Niu, D., Li, B.: Distributed machine learning with a serverless architecture. In: INFOCOM, pp. 1288–1296 (2019)

  65. Wang, J., Joshi, G.: Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. arXiv:1810.08313 (2018)

  66. Wang, L., Li, M., Zhang, Y., Ristenpart, T., Swift, M.: Peeking behind the curtains of serverless platforms. In: USENIX ATC, pp. 133–146 (2018)

  67. Wawrzoniak, M., Müller, I., Fraga Barcelos Paulus Bruno, R., Alonso, G.: Boxer: data analytics on network-enabled serverless platforms. In: CIDR (2021)

  68. Wu, Y., Dinh, T.T.A., Hu, G., Zhang, M., Chee, Y.M., Ooi, B.C.: Serverless data science-are we there yet? A case study of model serving (2022)

  69. Zhang, H., Li, J., Kara, K., Alistarh, D., Liu, J., Zhang, C.: ZipML: training linear models with end-to-end low precision, and a little bit of deep learning. In: ICML, pp. 4035–4043 (2017)

  70. Zhang, Z., Jiang, J., Wu, W., Zhang, C., Yu, L., Cui, B.: MLlib*: fast training of GLMs using spark MLlib. In: ICDE, pp. 1778–1789 (2019)

  71. Zheng, S., Meng, Q., Wang, T., et al.: Asynchronous stochastic gradient descent with delay compensation. In: ICML, pp. 4120–4129 (2017)

  72. Zinkevich, M., Weimer, M., Smola, A.J., Li, L.: Parallelized stochastic gradient descent. In: NeurIPS, pp. 2595–2603 (2010)

Download references

Acknowledgements

This work was sponsored by National Key R &D Program of China (No. 2022ZD0116315), Key R &D Program of Hubei Province (No. 2023BAB077), the Fundamental Research Funds for the Central Universities (No. 2042023kf0219). This work was supported by Ant Group. We gratefully acknowledge the help from Quanqing Xu (xuquanqing.xqq@oceanbase.com) and Chuanhui Yang (rizhao.ych@oceanbase.com) from OceanBase, AntGroup.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Du.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, J., Gan, S., Du, B. et al. A systematic evaluation of machine learning on serverless infrastructure. The VLDB Journal 33, 425–449 (2024). https://doi.org/10.1007/s00778-023-00813-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-023-00813-0

Keywords

Navigation