Skip to main content

Statistical Learning-Based Prediction of Execution Time of Data-Intensive Program Under Hadoop2.0

  • Conference paper
  • First Online:
Data Science (ICPCSEE 2018)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 901))

  • 1554 Accesses

Abstract

This paper is mainly to predict the running time of data-intensive MapReduce program under Hadoop2.0 environment. Although MapReduce programs are diverse, they can be divided into data-intensive and computationally intensive, depending on the time complexity and the nature of the program. The prediction of computationally intensive programs has always been difficult, and Hadoop has exhibited certain database attributes that are basically data-intensive. Moreover, the relationship between data-intensive programs and the amount of data is more closely related and shows certain statistical characteristics. So the method of statistical learning is applied to predict the execution time. This paper first generates training data and test data according to requirements, and then selects the appropriate features through the analysis of the logs. The prediction was first performed using the KCCA algorithm. However, the deficiencies were found. Then based on the characteristics of the kernel function, a prediction method based on deep learning was proposed, and the result was significant.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Song, G., Meng, Z., Huet, F., et al.: A hadoop mapreduce performance prediction method. In: IEEE International Conference on High Performance Computing and Communications and 2013 IEEE International Conference on Embedded and Ubiquitous Computing, pp. 820–825. IEEE (2013)

    Google Scholar 

  2. Lin, X., Meng, Z., Xu, C., et al.: A practical performance model for hadoop mapreduce. In: IEEE International Conference on CLUSTER Computing Workshops, pp. 231–239. IEEE (2012)

    Google Scholar 

  3. Khan, M., Jin, Y., Li, M.: Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans. Parallel Distrib. Syst. 27(2), 441–454 (2016)

    Article  Google Scholar 

  4. Liu, Y., Zeng, Y., Piao, X.: High-responsive scheduling with mapreduce performance prediction on hadoop YARN. In: IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, pp. 238–247. IEEE (2016)

    Google Scholar 

  5. Ganapathi, A., Chen, Y., Fox, A., et al.: Statistics-driven workload modeling for the cloud. In: IEEE International Conference on Data Engineering Workshops, pp. 87–92. IEEE (2010)

    Google Scholar 

  6. Bach, F.R., Jordan, M.I.: Kernel independent component analysis. J. Mach. Learn. Res. 3(1), 1–48 (2002)

    MathSciNet  MATH  Google Scholar 

  7. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2014)

    Article  Google Scholar 

  8. Malekimajd, M., Ardagna, D., Ciavotta, M.: Optimal map reduce job capacity allocation in cloud systems. ACM Sigmetrics Perform. Eval. Rev. 42(4), 51–61 (2015)

    Article  Google Scholar 

  9. Verma, A., Cherkasova, L., Campbell, R.H.: ARIA: automatic resource inference and allocation for mapreduce environments. In: International Conference on Autonomic Computing, ICAC 2011, Karlsruhe, Germany, June 2011, pp. 235–244. DBLP (2011)

    Google Scholar 

  10. Mathiya, B.J., Desai, V.L.: Apache hadoop yarn parameter configuration challenges and optimization. In: International Conference on Soft-Computing and Networks Security, pp. 1–6. IEEE (2015)

    Google Scholar 

  11. Chen, C.O., Zhuo, Y.Q., Yeh, C.C., et al.: Machine learning-based configuration parameter tuning on hadoop system. In: IEEE International Congress on Big Data, pp. 386–392. IEEE Computer Society (2015)

    Google Scholar 

  12. Bei, Z., Yu, Z., Zhang, H., et al.: Hadoop performance prediction model based on random forest. ZTE Commun. 11(2), 38–44 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongzhi Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, H., Li, J., Wang, H. (2018). Statistical Learning-Based Prediction of Execution Time of Data-Intensive Program Under Hadoop2.0. In: Zhou, Q., Gan, Y., Jing, W., Song, X., Wang, Y., Lu, Z. (eds) Data Science. ICPCSEE 2018. Communications in Computer and Information Science, vol 901. Springer, Singapore. https://doi.org/10.1007/978-981-13-2203-7_31

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-2203-7_31

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-2202-0

  • Online ISBN: 978-981-13-2203-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics