Skip to main content
Log in

Optimizing data placement in heterogeneous Hadoop clusters

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Data placement decision of Hadoop distributed file system (HDFS) is very important for the data locality which is a primary criterion for task scheduling of MapReduce model and eventually affects the application performance. The existing HDFS’s rack-aware data placement strategy and replication scheme are work well with MapReduce framework in homogeneous Hadoop clusters, but in practice, such data placement policy can noticeably reduce MapReduce performance and may cause increasingly energy dissipation in heterogeneous environments. Besides that, HDFS employs an inflexible replica factor acquiescently for each data block, which will give rise to unnecessary waste of storage space when there is a lot of inactive data in Hadoop system. In this paper, we propose a novel data placement strategy (SLDP) for heterogeneous Hadoop clusters. SLDP adopts a heterogeneity aware algorithm to divide various nodes into several virtual storage tiers (VSTs) firstly, and then places data blocks across nodes in each VST circuitously according to the hotness of data. Furthermore, SLDP uses a hotness proportional replication to save disk space and also has an effective power control function. Experimental results on two real data-intensive applications show that SLDP is energy-efficient, space-saving and able to improve MapReduce performance in a heterogeneous Hadoop cluster significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Armbrust, M., Fox, A., Griffith, R., et al.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)

    Article  Google Scholar 

  2. IBM Big Data: [Online]. http://www.ibm.com/big-data/

  3. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  4. Apache Hadoop: [Online]. http://hadoop.apache.org

  5. Shvachko, K., Kuang, H., Radia, S., et al.: The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, May 2010, pp. 1–10

  6. Hadoop Wiki: Applications powered by Hadoop. [Online]. http://wiki.apache.org/hadoop/PoweredBy

  7. Hadoop Distributed File System Architecture Guide: [Online]. http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

  8. White, T.: Hadoop-the Definitive Guide: Storage and Analysis at Internet Scale, 3rd edn. O’Reilly Media Inc, Sebastopol, CA (2012)

    Google Scholar 

  9. Zaharia, M., Konwinski, A., Joseph, A. D., et al.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, Dec 2008, pp. 29–42

  10. Mauch, V., Kunze, M., Hillenbrand, M.: High performance cloud computing. Futur. Gener. Comput. Syst. 29(6), 1408–1416 (2013)

    Article  Google Scholar 

  11. Amur, H., Cipar, J., Gupta, V., et al.: Robust and flexible power-proportional storage. In: Proceedings of the ACM Symposium on Cloud Computing, June 2010, pp. 217–228

  12. Carns, P.H., Walter, I., Ligon, B., et al.: PVFS: a parallel virtual file system for Linux clusters. In: Proceedings of the 4th Annual Linux Showcase and Conference, Oct 2000, pp. 317–327

  13. Microsystems, S.: Lustre file system: high-performance storage architecture and scalable cluster file system. Technical Report, Lustre File System White Paper (2007)

    Google Scholar 

  14. Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles, Oct 2003, pp. 29–43

  15. Lin, M., Wierman, A., Andrew, L.L.H., et al.: Dynamic right-sizing for power-proportional data centers. IEEE/ACM Trans. Netw. 21(5), 1378–1391 (2013)

    Article  Google Scholar 

  16. Barroso, L.A., Holzle, U.: The case for energy-proportional computing. Computer 40(12), 33–37 (2007)

    Article  Google Scholar 

  17. Nan, Zhu, Xue, Liu, Jie, Liu, et al.: Towards a cost-efficient MapReduce: mitigating power peaks for Hadoop clusters. Tsinghua Sci. Technol. 19(1), 24–32 (2014)

    Article  Google Scholar 

  18. Xie, J., Yin, S., Ruan, X., et al.: Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In: Proceedings of the IEEE International Parallel & Distributed Processing Symposium, Workshops, April 2010

  19. Apache Hadoop: Enable support for heterogeneous storages in HDFS. [Online]. https://issues.apache.org/jira/browse/HDFS-2832

  20. Jain, K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)

    Article  Google Scholar 

  21. Iri, M.C., Ignjatovi, J., Bogdanovi, S.: Fuzzy equivalence relations and their equivalence classes. Fuzzy Sets Syst. 158(12), 1295–1313 (2007)

    Article  Google Scholar 

  22. Kaushik, R.T., Bhandarkar, M.: GreenHDFS: towards an energy-conserving, storage-efficient, hybrid Hadoop compute cluster. In: Proceedings of the 2010 International Conference on Power Aware Computing and Systems, June 2010, pp. 1–9

  23. AMS-02 Experiment: [Online]: http://www.ams02.org/

  24. Collaboration, A.M.S.: First result from the alpha magnetic spectrometer on the international space station: precision measurement of the positron fraction in primary cosmic rays of 0.5-350 GeV. Phys. Rev. Lett. 110(14), 1–10 (2013)

    Google Scholar 

  25. Myung, J.: Tutorial on maximum likelihood estimation. J. Math. Psychol. 47(1), 90–100 (2003)

    Article  MathSciNet  Google Scholar 

  26. Jittat, F., Bundit, L., Danupon, N.: Faster algorithms for semi-matching problems. ACM Trans. Algorithms 10(3), 14–37 (2014)

    MathSciNet  Google Scholar 

  27. Leverich, J., Kozyrakis, C.: On the energy (in) efficiency of Hadoop clusters. ACM SIGOPS Oper. Syst. Rev. 44(1), 61–65 (2010)

    Article  Google Scholar 

  28. Lang, W., Patel, J.M.: Energy management for MapReduce clusters. PVLDB 3(1–2), 129–139 (2010)

    Google Scholar 

  29. Rafique, M.M., Rose, B., Butt, A.R., et al.: Supporting MapReduce on large-scale asymmetric multi-core clusters. ACM SIGOPS Oper. Syst. Rev. 43(2), 25–34 (2009)

    Article  Google Scholar 

  30. Fadika, Z., Dede, E., Hartog, J., et al.: MARLA: MapReduce for heterogeneous clusters. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2012, pp. 49–56

  31. Guo, Z., Fox, G.: Improving MapReduce performance in heterogeneous network environments and resource utilization. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2012, pp. 714–716

  32. Guo, Z., Fox, G., Zhou, M.: Investigation of data locality in MapReduce. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2012, pp. 419–426

  33. Zhang, X., Feng, Y., Feng, S., et al.: An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments. In: Proceedings of the International Conference on Cloud and Service Computing, Dec 2011, pp. 235–242

  34. Jin, H., Yang, X., Sun, X.H. et al.: ADAPT: availability-aware MapReduce data placement for non-dedicated distributed computing. In: Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems, June 2012, pp. 516–525

  35. Vasić, N., Barisits, M., Salzgeber, V., et al.: Making cluster applications energy-aware. In: Proceedings of the 1st Workshop on Automated Control for Datacenters and Clouds, June 2009, pp. 37–42

  36. Maheshwari, N., Nanduri, R., Varma, V.: Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework. Future Gener. Comput. Syst. 28(1), 119–127 (2012)

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China under Grant Nos. 61320106007, 61202449, 61572129, 61502097, 61370207, National High-tech R&D Program of China (863 Program) under Grant No. 2013AA013503, China Fundamental Research Funds for the Central Universities under Grant No. 1109007115, Jiangsu research prospective joint research project under Grant Nos. BY2012202, BY2013073-01, Jiangsu Provincial Key Laboratory of Network and Information Security under Grant No. BM2003201, Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grant No. 93K-9, and partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Runqun Xiong.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiong, R., Luo, J. & Dong, F. Optimizing data placement in heterogeneous Hadoop clusters. Cluster Comput 18, 1465–1480 (2015). https://doi.org/10.1007/s10586-015-0495-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-015-0495-z

Keywords

Navigation