Abstract
MapReduce is an efficient tool for data-intensive applications. Hadoop, an open-source implementation of MapReduce, has been widely adopted and experienced by some enterprises and scientific computing communities. However, when users intend to run a MapReduce program in Hadoop, they have to set a number of configuration parameters to make sure the program runs efficiently. Users often run into performance problems because they are unaware of how to set these parameters. To address these performance problems, we focus on the optimization opportunities presented by the high configurability of Hadoop, and propose an automation tool named HCOpt for performance optimization of Hadoop configuration parameters. HCOpt uses a Profile Engine to collect monitoring information from running MapReduce programs, a Prediction Engine to estimate the performance of a given Hadoop configuration and a genetic-based search algorithm to find an optimized configuration in the large search space. Our evaluation shows that HCOpt reduces the job completion time of Hadoop applications by up to 20 % when compared to applications run with configuration parameters that suggested by the rule-based optimization.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th on Symposium on Operating Systems Design & Implementation (OSDI), pp. 137–150 (2004)
Apache hadoop. http://hadoop.apache.org
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP), pp. 29–43 (2003)
Vaidya. hadoop.apache.org/mapreduce/docs/r0.21.0/vaidya.html
Hadoop Performance Monitoring UI. http://code.google.com/p/hadoop-toolkit/wiki
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 975–986 (2010)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.: HaLoop: efficient iterative data processing on large clusters. VLDB Endowment 3(1–2), 285–296 (2010)
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: sharing across multiple queries in MapReduce. VLDB Endowment 3(1–2), 494–505 (2010)
Olston, C., Reed, B., Silberstein, A., Srivastava, U.: Automatic optimization of parallel dataflow programs. In: Proceedings of USENIX 2008 Annual Technical Conference, (ATC), pp. 267–273 (2008)
A Instrumentation Tool for Java. https://kenai.com/projects/btrace
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Murthy, R.: Hive - a warehousing solution over a MapReduce Framework. VLDB Endowment 2(2), 1626–1629 (2009)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1099–1110 (2008)
Acknowledgments
This paper is partly supported by the NSFC under grant No. 61433019 and No. 61370104, International Science & Technology Cooperation Program of China under grant No. 2015DFE12860, and Chinese Universities Scientific Fundunder grant No. 2015MS077.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhang, X., Zeng, L., Shi, X., Wu, S., Xie, X., Jin, H. (2016). HCOpt: An Automatic Optimizer for Configuration Parameters of Hadoop. In: Zu, Q., Hu, B. (eds) Human Centered Computing. HCC 2016. Lecture Notes in Computer Science(), vol 9567. Springer, Cham. https://doi.org/10.1007/978-3-319-31854-7_54
Download citation
DOI: https://doi.org/10.1007/978-3-319-31854-7_54
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31853-0
Online ISBN: 978-3-319-31854-7
eBook Packages: Computer ScienceComputer Science (R0)