Abstract
Predicting the runtime of distributed iterative jobs can help reduce the deployment cost of clusters and optimize their resource allocation and scheduling strategies, but the runtime depends on various factors which are difficult to be acquired before execution. In this paper, we propose a generalized online prediction method for the runtime of distributed iterative jobs, which is centered on a series of online machine learning models. The method consists of three phases: 1) estimating the number of iterations for the current iterative job. 2) predicting the runtime metrics of each iteration by an online polynomial regression model. 3) Runtime metrics sequence is analyzed using an LSTM trained with online learning to predict the runtime of each iteration. We conducted experiments on typical Flink iterative jobs, and the experimental results show that our method improves the accuracy by 4.79% compared to the state-of-the-art methods, while for the improvement in accuracy for delta iterative jobs is even more than 15%.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1232–1240 (2012)
Carbone, P., et al.: Apache flink™: stream and batch processing in a single engine. IEEE Data Eng. Bull. 38(4), 28–38 (2015)
Tumanov, A., et al.: TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In: Proceedings of the Eleventh European Conference on Computer Systems, pp. 35:1–35:16 (2016)
Wolf, J.L., et al.: FLEX: a slot allocation scheduling optimizer for mapreduce workloads. In: 11th International Middleware Conference, vol. 6452, pp. 1–20 (2010)
Thamsen, L., et al.: Selecting resources for distributed dataflow systems according to runtime targets. In: 35th IEEE International Performance Computing and Communications Conference, pp. 1–8 (2016)
Lama, P., Zhou, X.: AROMA: automated resource allocation and configuration of mapreduce environment in the cloud. In: 9th International Conference on Autonomic Computing, pp. 63–72 (2012)
Renner, T., et al.: Adaptive resource management for distributed data analytics based on container-level cluster monitoring. In: Proceedings of the 6th International Conference on Data Science, Technology and Applications, pp. 38–47 (2017)
Thamsen, L., et al.: Ellis: dynamically scaling distributed dataflows to meet runtime targets. In: IEEE International Conference on Cloud Computing Technology and Science, pp. 146–153 (2017). https://doi.org/10.1109/CloudCom.2017.37
Popescu, A.D., et al.: Predict: towards predicting the runtime of large scale iterative analytics. Proc. VLDB Endow. 6(14), 1678–1689 (2013)
Koch, J., et al.: SMiPE: estimating the progress of recurring iterative distributed dataflows. In: 18th International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 156–163 (2017)
Kumar, V., et al.: Apache Hadoop YARN: yet another resource negotiator. In: ACM Symposium on Cloud Computing, pp. 5:1–5:16 (2013)
Hilman, M.H., et al.: Task runtime prediction in scientific workflows using an online incremental learning approach. In: 11th IEEE/ACM International Conference on Utility and Cloud Computing, pp. 93–102 (2018)
Gao, M., et al.: Online anomaly detection via incremental tensor decomposition. In: Ni, W., Wang, X., Song, W., Li, Y. (eds.) WISA 2019. LNCS, vol. 11817, pp. 3–14. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30952-7_1
Pham, T., et al.: Predicting workflow task execution time in the cloud using A two-stage machine learning approach. IEEE Trans. Cloud Comput. 8(1), 256–268 (2020). https://doi.org/10.1109/TCC.2017.2732344
da Silva, R.F., et al.: Online task resource consumption prediction for scientific workflows. Parallel Process. Lett. 25(3), 1541003:1–1541003:25 (2015)
Pumma, S., et al.: A runtime estimation framework for ALICE. Future Gener. Comput. Syst. 72, 65–77 (2017). https://doi.org/10.1016/j.future.2017.02.040
Acknowledgments
This research was supported by the National Key R&D Program of China under Grant No. 2018YFB1004402; and the National Natural Science Foundation of China under Grant No. 61772124.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Yue, X., Shi, L., Zhao, Y., Ji, H., Wang, G. (2021). Online Runtime Prediction Method for Distributed Iterative Jobs. In: Xing, C., Fu, X., Zhang, Y., Zhang, G., Borjigin, C. (eds) Web Information Systems and Applications. WISA 2021. Lecture Notes in Computer Science(), vol 12999. Springer, Cham. https://doi.org/10.1007/978-3-030-87571-8_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-87571-8_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87570-1
Online ISBN: 978-3-030-87571-8
eBook Packages: Computer ScienceComputer Science (R0)