Skip to main content

RUPredHadoop: Resources Utilization Predictor for Hadoop with Large-Scale Clusters

  • Conference paper
  • First Online:
Book cover Big Data (Big Data 2018)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 945))

Included in the following conference series:

  • 1877 Accesses

Abstract

Apache Hadoop is a widely used distributed system in large-scale production environment. With the increasing size of data volume and cluster scale, its performance is limited by inappropriate resources utilization. This paper introduces a resources utilization predictor (RUPredHadoop) to predict utilization of cpu, memory, read/write rate of disk and network, especially for large-scale Hadoop clusters. In terms of the similarity of data and workflow in Hadoop, the pattern of resource utilization for a single task is proposed, and then formulized by a single task model. Besides that, the distribution of fine-grained runtime is studied, so that a parallel-batch-tasks-based model could regenerate the whole Mapreduce job by migrating the single task model from the minimum cluster to a large-scale production cluster. With RUPredHadoop, we can locate the resource bottleneck for Hadoop clusters, meanwhile we can agilely configure clusters for applications with massive data. The performance of RUPredHadoop is validated by a test cluster with 35 nodes and a production cluster with 80 nodes. Results show that the normalization error is below 10% for benchmark applications with maximum 100 TB data.

This paper is partially supported by the National key research and development program of China (No. 2017YFB1400300), the National Natural Science Foundation of China (No. 61573292), State Key Laboratory of Rail Transit Engineering Informatization (FSDI) (No. SKLK16-04) .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Parmar, R.R., Roy, S., Bhattacharyya, D., Bandyopadhyay, S.K., Kim, T.H.: Large-scale encryption in the hadoop environment: challenges and solutions. IEEE Access 5, 7156–7163 (2017)

    Article  Google Scholar 

  2. Herodotou, H.: Hadoop performance models. arXiv preprint. arXiv:1106.0940 (2011)

  3. Verma, A., Cherkasova, L., Campbell, R.H.: Play it again, SimMR!. In: Proceedings of IEEE International Conference on CLUSTER Computing, vol. 8, no. 1, pp. 253–261 (2011)

    Google Scholar 

  4. Liu, N., Yang, X., Sun, X.H., Jenkins, J., Ross, R.: YARNsim: simulating hadoop YARN. In: Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 637–646 (2015)

    Google Scholar 

  5. Teng, F., Yu, L., Magoulès, F.: SimMapReduce: a simulator for modeling MapReduce framework. In: Proceedings of the 2011 Fifth FTRA International Conference on Multimedia and Ubiquitous Engineering (MUE 2011), pp. 277–282. IEEE Computer Society (2011)

    Google Scholar 

  6. Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: Proceedings of the 15th Biennial Conference on Innovative Data Systems Research, pp. 261–272 (2011)

    Google Scholar 

  7. Yigitbasi, N., Willke, T.L., Liao, G., Epema, D.: Towards machine learning-based auto-tuning of MapReduce. In: Proceedings of the 2013 IEEE 21st International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 11–20. IEEE Computer Society (2013)

    Google Scholar 

  8. Li, M., et al.: MRONLINE: MapReduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176 (2014)

    Google Scholar 

  9. Ganglia Monitoring System: Ganglia (2016). http://ganglia.sourceforge.net/. Accessed 10 Oct 2016

  10. Nagios (2016). https://www.nagios.org/. Accessed 10 Oct 2016

  11. Apache Ambari: Ambari (2016). https://ambari.apache.org. Accessed 07 Apr 2017

  12. LinkedIn dr-elephant (2016). https://github.com/linkedin/dr-elephant. Accessed 07 Apr 2017

  13. Wang, G., Butt, A.R., Pandey, P., Gupta, K.: A simulation approach to evaluating design decisions in MapReduce setups. In: IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 1–11 (2009)

    Google Scholar 

  14. Wang, G., Butt, A.R., Pandey, P., Gupta, K.: Using realistic simulation for performance analysis of MapReduce setups. In: Proceedings of the 1st ACM Workshop on Large-Scale System and Application Performance, pp. 19–26 (2009)

    Google Scholar 

  15. Apache: Mumak: Map-Reduce Simulator-ASF JIRA (2009). https://issues.apache.org/jira/browse/MAPREDUCE-728. Accessed 21 Apr 2017

  16. Hammoud, S., Li, M., Liu, Y., Alham, N.K., Liu, Z.: MRSim: a discrete event based MapReduce simulator. In: Proceedings of the 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), vol. 6, pp. 2993–2997 (2010)

    Google Scholar 

  17. Apache: Rumen: a tool to extract job characterization data from job tracker logs (2010). https://issues.apache.org/jira/browse/MAPREDUCE-751. Accessed 21 Apr 2017

  18. Howell, F., McNab, R.: SimJava: a discrete event simulation library for Java. Simul. Ser. 30, 51–56 (1998)

    Google Scholar 

  19. Buyya, R., Murshed, M.: GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurr. Comput.: Pract. Exp. 14(13–15), 1175–1220 (2002)

    Article  Google Scholar 

  20. Herodotou, H., Dong, F., Babu, S.: MapReduce programming and cost-based optimization? Crossing this chasm with starfish. Proc. VLDB Endow. 4(12), 1446–1449 (2011)

    Google Scholar 

  21. Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. In: Encyclopedia of Database Systems, vol. 4, no. 11, pp. 1111–1122 (2011)

    Google Scholar 

  22. Apache: Apache hadoop (2017). http://hadoop.apache.org. Accessed 09 Oct 2016

  23. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the Sixth OSDI Symposium on Operating Systems Design and Implementation, pp. 137–150 (2004)

    Google Scholar 

  24. Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: MRTuner: a toolkit to enable holistic optimization for MapReduce jobs. Proc. VLDB Endow. 7(13), 1319–1330 (2014)

    Article  Google Scholar 

  25. Georges, A., Kotliar, G., Krauth, W., Rozenberg, M.J.: Dynamical mean-field theory of strongly correlated fermion systems and the limit of infinite dimensions. Rev. Mod. Phys. 68(1), 13–125 (1996)

    Article  MathSciNet  Google Scholar 

  26. Pearson, K.: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics, pp. 11–28. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_2

    Chapter  Google Scholar 

  27. Intel-Hadoop: HiBench-5.0 (2016). https://github.com/intel-hadoop/HiBench. Accessed 09 Oct 2016

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fei Teng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ning, S., Teng, F., Li, Y., Cui, Z., Yu, L., Du, S. (2018). RUPredHadoop: Resources Utilization Predictor for Hadoop with Large-Scale Clusters. In: Xu, Z., Gao, X., Miao, Q., Zhang, Y., Bu, J. (eds) Big Data. Big Data 2018. Communications in Computer and Information Science, vol 945. Springer, Singapore. https://doi.org/10.1007/978-981-13-2922-7_32

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-2922-7_32

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-2921-0

  • Online ISBN: 978-981-13-2922-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics