Skip to main content

Online Runtime Prediction Method for Distributed Iterative Jobs

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12999))

Abstract

Predicting the runtime of distributed iterative jobs can help reduce the deployment cost of clusters and optimize their resource allocation and scheduling strategies, but the runtime depends on various factors which are difficult to be acquired before execution. In this paper, we propose a generalized online prediction method for the runtime of distributed iterative jobs, which is centered on a series of online machine learning models. The method consists of three phases: 1) estimating the number of iterations for the current iterative job. 2) predicting the runtime metrics of each iteration by an online polynomial regression model. 3) Runtime metrics sequence is analyzed using an LSTM trained with online learning to predict the runtime of each iteration. We conducted experiments on typical Flink iterative jobs, and the experimental results show that our method improves the accuracy by 4.79% compared to the state-of-the-art methods, while for the improvement in accuracy for delta iterative jobs is even more than 15%.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1232–1240 (2012)

    Google Scholar 

  2. Carbone, P., et al.: Apache flink™: stream and batch processing in a single engine. IEEE Data Eng. Bull. 38(4), 28–38 (2015)

    Google Scholar 

  3. Tumanov, A., et al.: TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In: Proceedings of the Eleventh European Conference on Computer Systems, pp. 35:1–35:16 (2016)

    Google Scholar 

  4. Wolf, J.L., et al.: FLEX: a slot allocation scheduling optimizer for mapreduce workloads. In: 11th International Middleware Conference, vol. 6452, pp. 1–20 (2010)

    Google Scholar 

  5. Thamsen, L., et al.: Selecting resources for distributed dataflow systems according to runtime targets. In: 35th IEEE International Performance Computing and Communications Conference, pp. 1–8 (2016)

    Google Scholar 

  6. Lama, P., Zhou, X.: AROMA: automated resource allocation and configuration of mapreduce environment in the cloud. In: 9th International Conference on Autonomic Computing, pp. 63–72 (2012)

    Google Scholar 

  7. Renner, T., et al.: Adaptive resource management for distributed data analytics based on container-level cluster monitoring. In: Proceedings of the 6th International Conference on Data Science, Technology and Applications, pp. 38–47 (2017)

    Google Scholar 

  8. Thamsen, L., et al.: Ellis: dynamically scaling distributed dataflows to meet runtime targets. In: IEEE International Conference on Cloud Computing Technology and Science, pp. 146–153 (2017). https://doi.org/10.1109/CloudCom.2017.37

  9. Popescu, A.D., et al.: Predict: towards predicting the runtime of large scale iterative analytics. Proc. VLDB Endow. 6(14), 1678–1689 (2013)

    Article  Google Scholar 

  10. Koch, J., et al.: SMiPE: estimating the progress of recurring iterative distributed dataflows. In: 18th International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 156–163 (2017)

    Google Scholar 

  11. Kumar, V., et al.: Apache Hadoop YARN: yet another resource negotiator. In: ACM Symposium on Cloud Computing, pp. 5:1–5:16 (2013)

    Google Scholar 

  12. Hilman, M.H., et al.: Task runtime prediction in scientific workflows using an online incremental learning approach. In: 11th IEEE/ACM International Conference on Utility and Cloud Computing, pp. 93–102 (2018)

    Google Scholar 

  13. Gao, M., et al.: Online anomaly detection via incremental tensor decomposition. In: Ni, W., Wang, X., Song, W., Li, Y. (eds.) WISA 2019. LNCS, vol. 11817, pp. 3–14. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30952-7_1

    Chapter  Google Scholar 

  14. Pham, T., et al.: Predicting workflow task execution time in the cloud using A two-stage machine learning approach. IEEE Trans. Cloud Comput. 8(1), 256–268 (2020). https://doi.org/10.1109/TCC.2017.2732344

    Article  Google Scholar 

  15. da Silva, R.F., et al.: Online task resource consumption prediction for scientific workflows. Parallel Process. Lett. 25(3), 1541003:1–1541003:25 (2015)

    Google Scholar 

  16. Pumma, S., et al.: A runtime estimation framework for ALICE. Future Gener. Comput. Syst. 72, 65–77 (2017). https://doi.org/10.1016/j.future.2017.02.040

    Article  Google Scholar 

Download references

Acknowledgments

This research was supported by the National Key R&D Program of China under Grant No. 2018YFB1004402; and the National Natural Science Foundation of China under Grant No. 61772124.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuhai Zhao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yue, X., Shi, L., Zhao, Y., Ji, H., Wang, G. (2021). Online Runtime Prediction Method for Distributed Iterative Jobs. In: Xing, C., Fu, X., Zhang, Y., Zhang, G., Borjigin, C. (eds) Web Information Systems and Applications. WISA 2021. Lecture Notes in Computer Science(), vol 12999. Springer, Cham. https://doi.org/10.1007/978-3-030-87571-8_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87571-8_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87570-1

  • Online ISBN: 978-3-030-87571-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics