Abstract
Hadoop can deal with Zeta-level data, but the huge request for Disk I/O and Network utilization often appears as the limitations in Hadoop. During different job execution phases of Hadoop, the production of intermediate data is enormous, and transferring the same data over the network to the “reduce” process becomes an overload. In this paper, we discuss an intelligent data compression policy to overcome these limitations and to improve the performance of Hadoop. An intelligent compression policy is devised that starts compression at an apt time when all the map tasks are not completed in the job. This policy reduces the data transfer time in a network. The results are evaluated by running several benchmarks, which shows an improvement of about 8–15% during job execution and depicts the merits of the proposed compression policy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manage. 35(2), 137–144 (2015)
Reddy, K.H.K., Roy, D.S.: Dppacs: a novel data partitioning and placement aware computation scheduling scheme for data-intensive cloud applications. Comput. J. 59(1), 64–82 (2015)
Dean, J., Ghemawat, S.: MapReduce: simplied data processing on large clusters. In: USENIX OSDI (2004)
Paik, S.S., Goswami, R.S., Roy, D.S., Reddy, K.H.K.: Intelligent data placement in heterogeneous hadoop cluster. In: International Conference on Next Generation Computing Technologies, pp. 568–579. Springer, Singapore, October 2017
Chen, Y., Ganapathi, A., Randy Katz, H.: To compress or not to compress compute vs. IO tradeoffs for MapReduce energy efficiency. In: Proceedings of the First ACM SIGCOMM Workshop on Green Networking, August 2010
White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly, Sebastopol (2009)
Xiang, L.-H., Li, M., Zhang, D.-F., Chen, D.-F.: Benefit of compression in hadoop: a case study of improving IO performance on hadoop. In: Green Networking: Proceedings of the 6th International Asia Conference on Industrial Engineering and Management Innovation (2015)
Zhuo, T., Jiang, L., Zhoua, J., Kenli, L., Keqin, L.: A self-adaptive scheduling algorithm for reduce start time. Fut. Generat. Comput. Syst. 43, 51–60 (2015)
Zaharia, M., Andy, K., Joseph, A.D., Randy, K.H., Ion, S.: Improving MapReduce performance in heterogeneous environments. In: Osdi, vol. 8, no. 4, p. 7 (2008)
Reddy, K.H.K., Das, H., Roy, D.S.: A data aware scheme for scheduling big data applications with SAVANNA Hadoop. Networks of the Future, pp. 377–392 (2017)
Chen, Q., Cheng, L., Zhen, X.: Improving MapReduce performance using smart speculative execution strategy. IEEE Trans. Comput. 63(4), 954–967 (2014)
Shouvik, B., Daniel, M.A.: The anatomy Of Mapreduce jobs, scheduling, and performance challenges. In: Proceedings of 2013 Conference of the Computer Measurement Group, San Diego (2013)
Apache Hadoop Homepage (2018). http://Hadoop.apache.org
Cloudera Distribution Hadoop (2018). https://www.cloudera.com/products/open-source/apache-hadoop/key-cdh-components.html
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ashu, A., Hussain, M.W., Sinha Roy, D., Reddy, H.K. (2021). Intelligent Data Compression Policy for Hadoop Performance Optimization. In: Abraham, A., Jabbar, M., Tiwari, S., Jesus, I. (eds) Proceedings of the 11th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2019). SoCPaR 2019. Advances in Intelligent Systems and Computing, vol 1182. Springer, Cham. https://doi.org/10.1007/978-3-030-49345-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-49345-5_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49344-8
Online ISBN: 978-3-030-49345-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)