Online Runtime Prediction Method for Distributed Iterative Jobs

Yue, Xiaofei; Shi, Lan; Zhao, Yuhai; Ji, Hangxu; Wang, Guoren

doi:10.1007/978-3-030-87571-8_14

Online Runtime Prediction Method for Distributed Iterative Jobs

Xiaofei Yue¹³,
Lan Shi¹³,
Yuhai Zhao¹³,
Hangxu Ji¹³ &
…
Guoren Wang¹⁴

Conference paper
First Online: 17 September 2021

2504 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12999))

Abstract

Predicting the runtime of distributed iterative jobs can help reduce the deployment cost of clusters and optimize their resource allocation and scheduling strategies, but the runtime depends on various factors which are difficult to be acquired before execution. In this paper, we propose a generalized online prediction method for the runtime of distributed iterative jobs, which is centered on a series of online machine learning models. The method consists of three phases: 1) estimating the number of iterations for the current iterative job. 2) predicting the runtime metrics of each iteration by an online polynomial regression model. 3) Runtime metrics sequence is analyzed using an LSTM trained with online learning to predict the runtime of each iteration. We conducted experiments on typical Flink iterative jobs, and the experimental results show that our method improves the accuracy by 4.79% compared to the state-of-the-art methods, while for the improvement in accuracy for delta iterative jobs is even more than 15%.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1232–1240 (2012)
Google Scholar
Carbone, P., et al.: Apache flink™: stream and batch processing in a single engine. IEEE Data Eng. Bull. 38(4), 28–38 (2015)
Google Scholar
Tumanov, A., et al.: TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In: Proceedings of the Eleventh European Conference on Computer Systems, pp. 35:1–35:16 (2016)
Google Scholar
Wolf, J.L., et al.: FLEX: a slot allocation scheduling optimizer for mapreduce workloads. In: 11th International Middleware Conference, vol. 6452, pp. 1–20 (2010)
Google Scholar
Thamsen, L., et al.: Selecting resources for distributed dataflow systems according to runtime targets. In: 35th IEEE International Performance Computing and Communications Conference, pp. 1–8 (2016)
Google Scholar
Lama, P., Zhou, X.: AROMA: automated resource allocation and configuration of mapreduce environment in the cloud. In: 9th International Conference on Autonomic Computing, pp. 63–72 (2012)
Google Scholar
Renner, T., et al.: Adaptive resource management for distributed data analytics based on container-level cluster monitoring. In: Proceedings of the 6th International Conference on Data Science, Technology and Applications, pp. 38–47 (2017)
Google Scholar
Thamsen, L., et al.: Ellis: dynamically scaling distributed dataflows to meet runtime targets. In: IEEE International Conference on Cloud Computing Technology and Science, pp. 146–153 (2017). https://doi.org/10.1109/CloudCom.2017.37
Popescu, A.D., et al.: Predict: towards predicting the runtime of large scale iterative analytics. Proc. VLDB Endow. 6(14), 1678–1689 (2013)
Article Google Scholar
Koch, J., et al.: SMiPE: estimating the progress of recurring iterative distributed dataflows. In: 18th International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 156–163 (2017)
Google Scholar
Kumar, V., et al.: Apache Hadoop YARN: yet another resource negotiator. In: ACM Symposium on Cloud Computing, pp. 5:1–5:16 (2013)
Google Scholar
Hilman, M.H., et al.: Task runtime prediction in scientific workflows using an online incremental learning approach. In: 11th IEEE/ACM International Conference on Utility and Cloud Computing, pp. 93–102 (2018)
Google Scholar
Gao, M., et al.: Online anomaly detection via incremental tensor decomposition. In: Ni, W., Wang, X., Song, W., Li, Y. (eds.) WISA 2019. LNCS, vol. 11817, pp. 3–14. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30952-7_1
Chapter Google Scholar
Pham, T., et al.: Predicting workflow task execution time in the cloud using A two-stage machine learning approach. IEEE Trans. Cloud Comput. 8(1), 256–268 (2020). https://doi.org/10.1109/TCC.2017.2732344
Article Google Scholar
da Silva, R.F., et al.: Online task resource consumption prediction for scientific workflows. Parallel Process. Lett. 25(3), 1541003:1–1541003:25 (2015)
Google Scholar
Pumma, S., et al.: A runtime estimation framework for ALICE. Future Gener. Comput. Syst. 72, 65–77 (2017). https://doi.org/10.1016/j.future.2017.02.040
Article Google Scholar

Download references

Acknowledgments

This research was supported by the National Key R&D Program of China under Grant No. 2018YFB1004402; and the National Natural Science Foundation of China under Grant No. 61772124.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Northeastern University, Shenyang, 110819, China
Xiaofei Yue, Lan Shi, Yuhai Zhao & Hangxu Ji
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China
Guoren Wang

Authors

Xiaofei Yue
View author publications
You can also search for this author in PubMed Google Scholar
Lan Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yuhai Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Hangxu Ji
View author publications
You can also search for this author in PubMed Google Scholar
Guoren Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuhai Zhao .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Chunxiao Xing
Institute of Computer Science, University of Göttingen, Goettingen, Germany
Xiaoming Fu
Tsinghua University, Beijing, China
Yong Zhang
Chinese Academy of Sciences, Beijing, China
Guigang Zhang
Renmin University of China, Beijing, China
Chaolemen Borjigin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yue, X., Shi, L., Zhao, Y., Ji, H., Wang, G. (2021). Online Runtime Prediction Method for Distributed Iterative Jobs. In: Xing, C., Fu, X., Zhang, Y., Zhang, G., Borjigin, C. (eds) Web Information Systems and Applications. WISA 2021. Lecture Notes in Computer Science(), vol 12999. Springer, Cham. https://doi.org/10.1007/978-3-030-87571-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-87571-8_14
Published: 17 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87570-1
Online ISBN: 978-3-030-87571-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)