Intelligent Data Compression Policy for Hadoop Performance Optimization

Ashu, A.; Hussain, Mir Wajahat; Sinha Roy, Diptendu; Reddy, Hemant Kumar

doi:10.1007/978-3-030-49345-5_9

A. Ashu¹⁸,
Mir Wajahat Hussain¹⁹,
Diptendu Sinha Roy¹⁹ &
…
Hemant Kumar Reddy²⁰

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1182))

Included in the following conference series:

International Conference on Soft Computing and Pattern Recognition

351 Accesses
9 Citations

Abstract

Hadoop can deal with Zeta-level data, but the huge request for Disk I/O and Network utilization often appears as the limitations in Hadoop. During different job execution phases of Hadoop, the production of intermediate data is enormous, and transferring the same data over the network to the “reduce” process becomes an overload. In this paper, we discuss an intelligent data compression policy to overcome these limitations and to improve the performance of Hadoop. An intelligent compression policy is devised that starts compression at an apt time when all the map tasks are not completed in the job. This policy reduces the data transfer time in a network. The results are evaluated by running several benchmarks, which shows an improvement of about 8–15% during job execution and depicts the merits of the proposed compression policy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manage. 35(2), 137–144 (2015)
Article Google Scholar
Reddy, K.H.K., Roy, D.S.: Dppacs: a novel data partitioning and placement aware computation scheduling scheme for data-intensive cloud applications. Comput. J. 59(1), 64–82 (2015)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplied data processing on large clusters. In: USENIX OSDI (2004)
Google Scholar
Paik, S.S., Goswami, R.S., Roy, D.S., Reddy, K.H.K.: Intelligent data placement in heterogeneous hadoop cluster. In: International Conference on Next Generation Computing Technologies, pp. 568–579. Springer, Singapore, October 2017
Google Scholar
Chen, Y., Ganapathi, A., Randy Katz, H.: To compress or not to compress compute vs. IO tradeoffs for MapReduce energy efficiency. In: Proceedings of the First ACM SIGCOMM Workshop on Green Networking, August 2010
Google Scholar
White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly, Sebastopol (2009)
Google Scholar
Xiang, L.-H., Li, M., Zhang, D.-F., Chen, D.-F.: Benefit of compression in hadoop: a case study of improving IO performance on hadoop. In: Green Networking: Proceedings of the 6th International Asia Conference on Industrial Engineering and Management Innovation (2015)
Google Scholar
Zhuo, T., Jiang, L., Zhoua, J., Kenli, L., Keqin, L.: A self-adaptive scheduling algorithm for reduce start time. Fut. Generat. Comput. Syst. 43, 51–60 (2015)
Google Scholar
Zaharia, M., Andy, K., Joseph, A.D., Randy, K.H., Ion, S.: Improving MapReduce performance in heterogeneous environments. In: Osdi, vol. 8, no. 4, p. 7 (2008)
Google Scholar
Reddy, K.H.K., Das, H., Roy, D.S.: A data aware scheme for scheduling big data applications with SAVANNA Hadoop. Networks of the Future, pp. 377–392 (2017)
Google Scholar
Chen, Q., Cheng, L., Zhen, X.: Improving MapReduce performance using smart speculative execution strategy. IEEE Trans. Comput. 63(4), 954–967 (2014)
Article MathSciNet MATH Google Scholar
Shouvik, B., Daniel, M.A.: The anatomy Of Mapreduce jobs, scheduling, and performance challenges. In: Proceedings of 2013 Conference of the Computer Measurement Group, San Diego (2013)
Google Scholar
Apache Hadoop Homepage (2018). http://Hadoop.apache.org
Cloudera Distribution Hadoop (2018). https://www.cloudera.com/products/open-source/apache-hadoop/key-cdh-components.html

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, SRM University AP, Hyderabad, India
A. Ashu
Department of Computer Science and Engineering, National Institute of Technology, Meghalaya, Shillong, Meghalaya, India
Mir Wajahat Hussain & Diptendu Sinha Roy
School of Computer Science, National Institute of Science and Technology Berhampur, Berhampur, Odisha, India
Hemant Kumar Reddy

Authors

A. Ashu
View author publications
You can also search for this author in PubMed Google Scholar
Mir Wajahat Hussain
View author publications
You can also search for this author in PubMed Google Scholar
Diptendu Sinha Roy
View author publications
You can also search for this author in PubMed Google Scholar
Hemant Kumar Reddy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Ashu .

Editor information

Editors and Affiliations

Scientific Network for Innovation and Research Excellence, Machine Intelligence Research Labs (MIR), Auburn, WA, USA
Ajith Abraham
Department of Computer Science and Engineering, Vardhaman College of Engineering, Hyderabad, Telangana, India
M. A. Jabbar
Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid, Madrid, Spain
Sanju Tiwari
ISEP - Instituto Superior de Engenharia do Porto, Porto, Portugal
Isabel M. S. Jesus

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ashu, A., Hussain, M.W., Sinha Roy, D., Reddy, H.K. (2021). Intelligent Data Compression Policy for Hadoop Performance Optimization. In: Abraham, A., Jabbar, M., Tiwari, S., Jesus, I. (eds) Proceedings of the 11th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2019). SoCPaR 2019. Advances in Intelligent Systems and Computing, vol 1182. Springer, Cham. https://doi.org/10.1007/978-3-030-49345-5_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-49345-5_9
Published: 01 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49344-8
Online ISBN: 978-3-030-49345-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics