Skip to main content

Optimization Analysis of Hadoop

  • Conference paper
  • First Online:
Social Computing (ICYCSEE 2016)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 623))

  • 1308 Accesses

Abstract

Hadoop is a distributed data processing platform supporting MapReduce parallel computing framework. In order to deal with general problems, there is always a need of accelerating Hadoop under certain circumstance such as Hive jobs. By outputting current time to logs at specially selected points, we traced the workflow of a typical MapReduce job generated by Hive and making time statistics for every phase of the job. Using different data quantities, we compared the proportion of each phase and located the bottleneck points of Hadoop. We make two major optimization advices: (1) focus on using combine and optimizing Net Work and Disk IO when dealing with big jobs having a large number of intermediate results; (2) optimizing map function and Disk IO when dealing with short jobs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Dean, J., Ghemawats, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  2. Apache Software Foundation. Apache Hadoop [EB/OL], 24 March 2016. http://hadoop.apache.org/

  3. Apache Software Foundation. Apache Hive [EB/OL], 24 March 2016. http://hive.apache.org/index.html

  4. Apache Software Foundation. Spark-on-Hadoop [EB/OL], 24 March 2016. http://spark.apache.org/docs/0.6.0/running-on-yarn.html

  5. Apache Software Foundation. Storm-on-Hadoop [EB/OL], 24 March 2016. http://storm.apache.org/index.html

  6. Apache Software Foundation. Tez-on-Hadoop [EB/OL], 24 March 2016. http://tez.apache.org/

  7. Apache Software Foundation. Hadoop HDFS [EB/OL], 24 March 2016. http://hadoop.apache.org/docs/r2.6.4/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Overview

  8. Argonne National Laboratory. Message passing interface standard [EB/OL], 24 March 2016. http://www.mcs.anl.gov/research/projects/mpi

  9. Computer Science and Mathematics Division of Oak Ridge National Laboratory. Parallel virtual machine [EB/OL], 24 March 2016. http://www.csm.ornl.gov/pvm/

  10. Apache Software Foundation. Hadoop MapReduce [EB/OL], 24 March 2016. http://hadoop.apache.org/docs/r2.6.4/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

  11. Apache Software Foundation. Hadoop YARN [EB/OL], 24 March 2016. http://hadoop.apache.org/docs/r2.6.4/hadoop-yarn/hadoop-yarn-site/YARN.html

  12. OpenStack. OpenStack Swift [EB/OL], 24 March 2016. http://docs.openstack.org/developer/swift/#overview-and-concepts

Download references

Acknowledgement

This paper was partially supported by National Sci-Tech Support Plan 2015BAH10F01 and NSFC grant U1509216, 61472099, 61133002 and the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province LC2016026.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinglun Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media Singapore

About this paper

Cite this paper

Li, J., Shi, S., Wang, H. (2016). Optimization Analysis of Hadoop. In: Che, W., et al. Social Computing. ICYCSEE 2016. Communications in Computer and Information Science, vol 623. Springer, Singapore. https://doi.org/10.1007/978-981-10-2053-7_46

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-2053-7_46

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-2052-0

  • Online ISBN: 978-981-10-2053-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics