Skip to main content

Intelligent Data Compression Policy for Hadoop Performance Optimization

  • Conference paper
  • First Online:
Proceedings of the 11th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2019) (SoCPaR 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1182))

Included in the following conference series:

Abstract

Hadoop can deal with Zeta-level data, but the huge request for Disk I/O and Network utilization often appears as the limitations in Hadoop. During different job execution phases of Hadoop, the production of intermediate data is enormous, and transferring the same data over the network to the “reduce” process becomes an overload. In this paper, we discuss an intelligent data compression policy to overcome these limitations and to improve the performance of Hadoop. An intelligent compression policy is devised that starts compression at an apt time when all the map tasks are not completed in the job. This policy reduces the data transfer time in a network. The results are evaluated by running several benchmarks, which shows an improvement of about 8–15% during job execution and depicts the merits of the proposed compression policy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manage. 35(2), 137–144 (2015)

    Article  Google Scholar 

  2. Reddy, K.H.K., Roy, D.S.: Dppacs: a novel data partitioning and placement aware computation scheduling scheme for data-intensive cloud applications. Comput. J. 59(1), 64–82 (2015)

    Google Scholar 

  3. Dean, J., Ghemawat, S.: MapReduce: simplied data processing on large clusters. In: USENIX OSDI (2004)

    Google Scholar 

  4. Paik, S.S., Goswami, R.S., Roy, D.S., Reddy, K.H.K.: Intelligent data placement in heterogeneous hadoop cluster. In: International Conference on Next Generation Computing Technologies, pp. 568–579. Springer, Singapore, October 2017

    Google Scholar 

  5. Chen, Y., Ganapathi, A., Randy Katz, H.: To compress or not to compress compute vs. IO tradeoffs for MapReduce energy efficiency. In: Proceedings of the First ACM SIGCOMM Workshop on Green Networking, August 2010

    Google Scholar 

  6. White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly, Sebastopol (2009)

    Google Scholar 

  7. Xiang, L.-H., Li, M., Zhang, D.-F., Chen, D.-F.: Benefit of compression in hadoop: a case study of improving IO performance on hadoop. In: Green Networking: Proceedings of the 6th International Asia Conference on Industrial Engineering and Management Innovation (2015)

    Google Scholar 

  8. Zhuo, T., Jiang, L., Zhoua, J., Kenli, L., Keqin, L.: A self-adaptive scheduling algorithm for reduce start time. Fut. Generat. Comput. Syst. 43, 51–60 (2015)

    Google Scholar 

  9. Zaharia, M., Andy, K., Joseph, A.D., Randy, K.H., Ion, S.: Improving MapReduce performance in heterogeneous environments. In: Osdi, vol. 8, no. 4, p. 7 (2008)

    Google Scholar 

  10. Reddy, K.H.K., Das, H., Roy, D.S.: A data aware scheme for scheduling big data applications with SAVANNA Hadoop. Networks of the Future, pp. 377–392 (2017)

    Google Scholar 

  11. Chen, Q., Cheng, L., Zhen, X.: Improving MapReduce performance using smart speculative execution strategy. IEEE Trans. Comput. 63(4), 954–967 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  12. Shouvik, B., Daniel, M.A.: The anatomy Of Mapreduce jobs, scheduling, and performance challenges. In: Proceedings of 2013 Conference of the Computer Measurement Group, San Diego (2013)

    Google Scholar 

  13. Apache Hadoop Homepage (2018). http://Hadoop.apache.org

  14. Cloudera Distribution Hadoop (2018). https://www.cloudera.com/products/open-source/apache-hadoop/key-cdh-components.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Ashu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ashu, A., Hussain, M.W., Sinha Roy, D., Reddy, H.K. (2021). Intelligent Data Compression Policy for Hadoop Performance Optimization. In: Abraham, A., Jabbar, M., Tiwari, S., Jesus, I. (eds) Proceedings of the 11th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2019). SoCPaR 2019. Advances in Intelligent Systems and Computing, vol 1182. Springer, Cham. https://doi.org/10.1007/978-3-030-49345-5_9

Download citation

Publish with us

Policies and ethics