Skip to main content

Training and Serving Machine Learning Models at Scale

  • Conference paper
  • First Online:
Service-Oriented Computing (ICSOC 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13740))

Included in the following conference series:

  • 2107 Accesses

Abstract

In recent years, Web services are becoming more and more intelligent (e.g., in understanding user preferences) thanks to the integration of components that rely on Machine Learning (ML). Before users can interact (inference phase) with an ML-based service (ML-Service), the underlying ML model must learn (training phase) from existing data, a process that requires long-lasting batch computations. The management of these two, diverse phases is complex and meeting time and quality requirements can hardly be done with manual approaches.

This paper highlights some of the major issues in managing ML-services in both training and inference modes and presents some initial solutions that are able to meet set requirements with minimum user inputs. A preliminary evaluation demonstrates that our solutions allow these systems to become more efficient and predictable with respect to their response time and accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://spark.apache.org/docs/latest/job-scheduling.html.

  2. 2.

    https://www.tensorflow.org/tfx/guide/serving.

  3. 3.

    https://docker.com.

  4. 4.

    https://kubernetes.io.

  5. 5.

    https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/.

References

  1. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of the Symposium on Operating Systems Design and Implementation, pp. 265–283. USENIX (2016)

    Google Scholar 

  2. Baresi, L., Denaro, G., Quattrocchi, G.: Symbolic execution-driven extraction of the parallel execution plans of spark applications. In: Proceedings of the Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 246–256. ACM (2019)

    Google Scholar 

  3. Baresi, L., Leva, A., Quattrocchi, G.: Fine-grained dynamic resource allocation for big-data applications. IEEE Trans. Software Eng. 47(8), 1668–1682 (2021)

    Article  Google Scholar 

  4. Baresi, L., Quattrocchi, G., Rasi, N.: Federated machine learning as a self-adaptive problem. In: Proceedings of the International Symposium on Software Engineering for Adaptive and Self-Managing Systems, pp. 41–47 (2021)

    Google Scholar 

  5. Baresi, L., Quattrocchi, G., Rasi, N.: Resource management for TensorFlow inference. In: Hacid, H., Kao, O., Mecella, M., Moha, N., Paik, H. (eds.) ICSOC 2021. LNCS, vol. 13121, pp. 238–253. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91431-8_15

    Chapter  Google Scholar 

  6. Chen, C.-C., Yang, C.-L., Cheng, H.-Y.: Efficient and robust parallel DNN training through model parallelism on multi-GPU platform. arXiv preprint arXiv:1809.02839 (2018)

  7. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  8. Deng, L.: The MNIST database of handwritten digit images for machine learning research. Signal Process. Mag. 29(6), 141–142 (2012)

    Article  Google Scholar 

  9. Fedorov, R., Camerada, A., Fraternali, P., Tagliasacchi, M.: Estimating snow cover from publicly available images. IEEE Trans. Multimedia 18(6), 1187–1200 (2016)

    Article  Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE (2016)

    Google Scholar 

  11. Islam, M.T., Srirama, S.N., Karunasekera, S., Buyya, R.: Cost-efficient dynamic scheduling of big data applications in apache spark on cloud. J. Syst. Softw. 162, 110515 (2020)

    Google Scholar 

  12. Jia, Z., Zaharia, M., Aiken, A.: Beyond data and model parallelism for deep neural networks. Proc. Mach. Learn. Syst. 1, 1–13 (2019)

    Google Scholar 

  13. Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349(6245), 255–260 (2015)

    Google Scholar 

  14. Jouppi, N.P., Young, C., Patil, N., Patterson, D.: A domain-specific architecture for deep neural networks. Commun. ACM 61(9), 50–59 (2018)

    Google Scholar 

  15. Juba, B., Le, H.S.: Precision-recall versus accuracy and the role of large data sets. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4039–4048 (2019)

    Google Scholar 

  16. Dipu Kabir, H.M., Khosravi, A., Hosen, M.A., Nahavandi, S.: Neural network-based uncertainty quantification: a survey of methodologies and applications. IEEE Access 6, 36218–36234 (2018)

    Google Scholar 

  17. Labidi, T., Mtibaa, A., Gaaloul, W., Tata, S., Gargouri, F.: Cloud SLA modeling and monitoring. In: Proceedings of the International Conference on Services Computing, pp. 338–345. IEEE (2017)

    Google Scholar 

  18. Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems. Annual Conference on Neural Information Processing Systems, vol. 30, pp. 6402–6413 (2017)

    Google Scholar 

  19. Lam, C.: Hadoop in Action. Simon and Schuster (2010)

    Google Scholar 

  20. Li, L., et al.: A system for massively parallel hyperparameter tuning. Proc. Mach. Learn. Syst. 2, 230–246 (2020)

    Google Scholar 

  21. Mohri, M., Rostamizadeh, A., Talwalkar, A.. Foundations of Machine Learning. MIT Press (2018)

    Google Scholar 

  22. Morabito, R., Chiang, M.: Discover, provision, and orchestration of machine learning inference services in heterogeneous edge. In: 41st International Conference on Distributed Computing Systems, pp. 1116–1119. IEEE (2021)

    Google Scholar 

  23. Nguyen, N., Khan, M.M.H., Wang, K.: Towards automatic tuning of apache spark configuration. In: IEEE International Conference on Cloud Computing, pp. 417–425 (2018)

    Google Scholar 

  24. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, Annual Conference on Neural Information Processing Systems, vol. 32, pp. 8024–8035 (2019)

    Google Scholar 

  25. Sahai, A., Durante, A., Machiraju, V.: Towards Automated SLA Management for Web Services. Hewlett-Packard Research Report HPL-2001-310 (R. 1) (2002)

    Google Scholar 

  26. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019)

    Google Scholar 

  27. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

    Google Scholar 

  28. Vabalas, A., Gowen, E., Poliakoff, E., Casson, A.J.: Machine learning algorithm validation with a limited sample size. PloS ONE 14(11), e0224365 (2019)

    Google Scholar 

  29. Weiss, M., Tonella, P.: Uncertainty-wizard: fast and user-friendly neural network uncertainty quantification. In: Proceedings of the International Conference on Software Testing, Verification and Validation, pp. 436–441. IEEE (2021)

    Google Scholar 

  30. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)

  31. Jia Xu and David Lorge Parnas: Scheduling processes with release times, deadlines, precedence and exclusion relations. IEEE Trans. Softw. Eng. 16(3), 360–369 (1990)

    Article  Google Scholar 

  32. Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. 10(2), 1–19 (2019)

    Article  Google Scholar 

  33. Zaharia, M., et al.: Spark: cluster computing with working sets. In: Proceedings of the International Conference on Hot Topics in Cloud Computing. USENIX (2010)

    Google Scholar 

  34. Zhang, X., Zou, J., He, K., Sun, J.: Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 1943–1955 (2015)

    Article  Google Scholar 

Download references

Acknowledgments

This work has been partially supported by the SISMA (MIUR, PRIN 2017, Contract 201752ENYB) and EMELIOT (MUR, PRIN 2020, Contract 2020W3A5FY) national research projects.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giovanni Quattrocchi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Baresi, L., Quattrocchi, G. (2022). Training and Serving Machine Learning Models at Scale. In: Troya, J., Medjahed, B., Piattini, M., Yao, L., Fernández, P., Ruiz-Cortés, A. (eds) Service-Oriented Computing. ICSOC 2022. Lecture Notes in Computer Science, vol 13740. Springer, Cham. https://doi.org/10.1007/978-3-031-20984-0_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20984-0_48

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20983-3

  • Online ISBN: 978-3-031-20984-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics