Abstract
Distributed ML training is widely used to improve training performance. However, current distributed training frameworks bring undesirable burdens to application-oriented users due to its server-centric design. It is also difficult for users to customize training (e.g., with adaptive policies) to guarantee performance in dynamic environments. Thus, it is meaningful to make training framework lightweight and programmable. We argue that serverless paradigm can effectively help meet the demands. In this paper, we propose TrainFlow, adopting serverless paradigm to simplify and extend programmability of data-parallel training. First, the basic framework is built with a novel serverless process model, providing a high-level view and various state sharing. Then training can be divided into 2 processes with specific workflows. Second, TrainFlow provides an event-driven hook mechanism, allowing users to customize training workflow. We implement and evaluate TrainFlow with OpenFaaS. Experiments demonstrate its availability and programmability. For availability, TrainFlow can support various training patterns, and shows advantages of performance (e.g., 1.6× higher speedup ratio than baseline) and resource consuming (e.g., at most 41.0% less memory consuming than baseline). For programmability, TrainFlow can work with adaptive policies as expected (e.g., at most 1.48× higher throughput in a case).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
You, Y., Zhang, Z., Hsieh, C., Demmel, J., Keutzer, K.: ImageNet training in minutes. In: The 47th International Conference on Parallel Processing (ICPP ’18) (2018)
Yuan, J., Li, X., Cheng, C., Liu, J., Guo, R., Cai, S., et al.: Oneflow: redesign the distributed deep learning framework from scratch. arXiv preprint arXiv:2110.15032 (2021)
Harlap, A., Narayanan, D., Phanishayee, A., Seshadri, V., Devanur, N., Ganger, G., et al.: Pipedream: fast and efficient pipeline parallel DNN training. arXiv preprint arXiv:1806.03377 (2018)
Mai, L., Li, G., Wagenländer, M., Fertakis, K., Brabete, A.O., Piezuch, P.: KungFu: marking training in distributed machine learning adaptive. In: The 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’20) (2020)
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al.: TensorFlow: a system for large-scale machine learning. In: The 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16) (2016)
Jiang, Y., Zhu, Y., Lan, C., Yi, B., Cui, Y., Guo, C.: A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In: The 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’20) (2020)
Wang, G., Venkataraman, S., Phanishayee, A., Devanur, N., Thelin, J., Stoica, I.: Blink: fast and generic collectives for distributed ML. CoRR, abs/1910.04940 (2019)
Cipar, J., Ho, Q., Kim, J.K., Lee, S., Ganger, G.R., Gibson, G., et al.: Solving the straggler problem with bounded staleness. In: The 2013 Workshop on Hot Topics in Operating Systems (HotOS ’13) (2013)
Wagenländer, M., Mai, L., Li, G., Pietzuch, P.: Spotnik: designing distributed machine learning for transient cloud resources. In: The 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud ’20) (2020)
Jia, X., Song, S., He, W., Wang, Y., Rong, H., Zhou, F., et al.: Highly scalable deep learning training system with mixed-precision: training ImageNet in four minutes. arXiv preprint arXiv:1807.11205 (2018)
Wang, H., Niu, D., Li, B.: Distributed machine learning with a serverless architecture. In: The 2019 IEEE International Conference on Computer Communications (INFOCOM ’19) (2019)
Carreira, J., Fonseca, P., Tumanov, A., Zhang, A., Katz, R.:. Cirrus: a serverless framework for end-to-end ML workflows. In: The 2019 ACM Symposium on Cloud Computing (SoCC ’19) (2019)
Jiang, J., Gan, S., Liu, Y., Wang, F., Alonse, G., Klimovic, A., et al.: Towards demystifying serverless machine learning training. In: The 2021 ACM International Conference on Management of Data (SIGMOD ’21) (2021)
Sánchez-Artigas, M., Sarroca, P.: Experience paper: towards enhancing cost efficiency in serverless machine learning training. In: The 22nd ACM International Middleware Conference (Middleware ’21) (2021)
DAPR, Distributed Application Runtime. https://dapr.io/. Accessed 18 March 2020
Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., Rellermeyer, J.S.: A survey on distributed machine learning. ACM Computing Surveys (CSUR), 53(2), 1–33 (2020)
Lian, X., Zhang, W., Zhang, C., Liu, J.: Asynchronous decentralized parallel stochastic gradient descent. In: The 35th International Conference on Machine Learning, PMLR 80 (2018)
López, P.G., Arjona, A., Sampé, J., Slominski, A., Villard, L.: Triggerflow: trigger-based orchestration of serverless workflows. Future Gener. Comput. Syst. 124(1), 215–229 (2021)
ElasticDL: A Kubernetes-native deep learning framework. https://elasticdl.github.io/
Barcelona-Pons, D., Sutra, P., Sánchez-Artigas, M., París, G., García-López, P.: Stateful serverless computing with crucial[J]. ACM Trans. Softw. Eng. Methodol. 31(3), 1–38 (2022)
Sergeev, A., Balso, M.: Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 IFIP International Federation for Information Processing
About this paper
Cite this paper
Tan, W. et al. (2022). TrainFlow: A Lightweight, Programmable ML Training Framework via Serverless Paradigm. In: Liu, S., Wei, X. (eds) Network and Parallel Computing. NPC 2022. Lecture Notes in Computer Science, vol 13615. Springer, Cham. https://doi.org/10.1007/978-3-031-21395-3_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-21395-3_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21394-6
Online ISBN: 978-3-031-21395-3
eBook Packages: Computer ScienceComputer Science (R0)