Optimizing data placement in heterogeneous Hadoop clusters

Xiong, Runqun; Luo, Junzhou; Dong, Fang

doi:10.1007/s10586-015-0495-z

Optimizing data placement in heterogeneous Hadoop clusters

Published: 01 October 2015

Volume 18, pages 1465–1480, (2015)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Runqun Xiong¹,
Junzhou Luo¹ &
Fang Dong¹

990 Accesses
27 Citations
Explore all metrics

Abstract

Data placement decision of Hadoop distributed file system (HDFS) is very important for the data locality which is a primary criterion for task scheduling of MapReduce model and eventually affects the application performance. The existing HDFS’s rack-aware data placement strategy and replication scheme are work well with MapReduce framework in homogeneous Hadoop clusters, but in practice, such data placement policy can noticeably reduce MapReduce performance and may cause increasingly energy dissipation in heterogeneous environments. Besides that, HDFS employs an inflexible replica factor acquiescently for each data block, which will give rise to unnecessary waste of storage space when there is a lot of inactive data in Hadoop system. In this paper, we propose a novel data placement strategy (SLDP) for heterogeneous Hadoop clusters. SLDP adopts a heterogeneity aware algorithm to divide various nodes into several virtual storage tiers (VSTs) firstly, and then places data blocks across nodes in each VST circuitously according to the hotness of data. Furthermore, SLDP uses a hotness proportional replication to save disk space and also has an effective power control function. Experimental results on two real data-intensive applications show that SLDP is energy-efficient, space-saving and able to improve MapReduce performance in a heterogeneous Hadoop cluster significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimization for energy-aware design of task scheduling in heterogeneous distributed systems: a meta-heuristic based approach

Article 07 April 2024

Cen Li & Liping Chen

Supporting efficient video file streaming in P2P cloud storage

Article 04 April 2024

Jinsung Kim & Eunsam Kim

Flexible fingerprint cuckoo filter for information retrieval optimization in distributed network

Article 11 April 2024

Wenhan Lian, Jinlin Wang & Jiali You

References

Armbrust, M., Fox, A., Griffith, R., et al.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)
Article Google Scholar
IBM Big Data: [Online]. http://www.ibm.com/big-data/
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Apache Hadoop: [Online]. http://hadoop.apache.org
Shvachko, K., Kuang, H., Radia, S., et al.: The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, May 2010, pp. 1–10
Hadoop Wiki: Applications powered by Hadoop. [Online]. http://wiki.apache.org/hadoop/PoweredBy
Hadoop Distributed File System Architecture Guide: [Online]. http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
White, T.: Hadoop-the Definitive Guide: Storage and Analysis at Internet Scale, 3rd edn. O’Reilly Media Inc, Sebastopol, CA (2012)
Google Scholar
Zaharia, M., Konwinski, A., Joseph, A. D., et al.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, Dec 2008, pp. 29–42
Mauch, V., Kunze, M., Hillenbrand, M.: High performance cloud computing. Futur. Gener. Comput. Syst. 29(6), 1408–1416 (2013)
Article Google Scholar
Amur, H., Cipar, J., Gupta, V., et al.: Robust and flexible power-proportional storage. In: Proceedings of the ACM Symposium on Cloud Computing, June 2010, pp. 217–228
Carns, P.H., Walter, I., Ligon, B., et al.: PVFS: a parallel virtual file system for Linux clusters. In: Proceedings of the 4th Annual Linux Showcase and Conference, Oct 2000, pp. 317–327
Microsystems, S.: Lustre file system: high-performance storage architecture and scalable cluster file system. Technical Report, Lustre File System White Paper (2007)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles, Oct 2003, pp. 29–43
Lin, M., Wierman, A., Andrew, L.L.H., et al.: Dynamic right-sizing for power-proportional data centers. IEEE/ACM Trans. Netw. 21(5), 1378–1391 (2013)
Article Google Scholar
Barroso, L.A., Holzle, U.: The case for energy-proportional computing. Computer 40(12), 33–37 (2007)
Article Google Scholar
Nan, Zhu, Xue, Liu, Jie, Liu, et al.: Towards a cost-efficient MapReduce: mitigating power peaks for Hadoop clusters. Tsinghua Sci. Technol. 19(1), 24–32 (2014)
Article Google Scholar
Xie, J., Yin, S., Ruan, X., et al.: Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In: Proceedings of the IEEE International Parallel & Distributed Processing Symposium, Workshops, April 2010
Apache Hadoop: Enable support for heterogeneous storages in HDFS. [Online]. https://issues.apache.org/jira/browse/HDFS-2832
Jain, K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)
Article Google Scholar
Iri, M.C., Ignjatovi, J., Bogdanovi, S.: Fuzzy equivalence relations and their equivalence classes. Fuzzy Sets Syst. 158(12), 1295–1313 (2007)
Article Google Scholar
Kaushik, R.T., Bhandarkar, M.: GreenHDFS: towards an energy-conserving, storage-efficient, hybrid Hadoop compute cluster. In: Proceedings of the 2010 International Conference on Power Aware Computing and Systems, June 2010, pp. 1–9
AMS-02 Experiment: [Online]: http://www.ams02.org/
Collaboration, A.M.S.: First result from the alpha magnetic spectrometer on the international space station: precision measurement of the positron fraction in primary cosmic rays of 0.5-350 GeV. Phys. Rev. Lett. 110(14), 1–10 (2013)
Google Scholar
Myung, J.: Tutorial on maximum likelihood estimation. J. Math. Psychol. 47(1), 90–100 (2003)
Article MathSciNet Google Scholar
Jittat, F., Bundit, L., Danupon, N.: Faster algorithms for semi-matching problems. ACM Trans. Algorithms 10(3), 14–37 (2014)
MathSciNet Google Scholar
Leverich, J., Kozyrakis, C.: On the energy (in) efficiency of Hadoop clusters. ACM SIGOPS Oper. Syst. Rev. 44(1), 61–65 (2010)
Article Google Scholar
Lang, W., Patel, J.M.: Energy management for MapReduce clusters. PVLDB 3(1–2), 129–139 (2010)
Google Scholar
Rafique, M.M., Rose, B., Butt, A.R., et al.: Supporting MapReduce on large-scale asymmetric multi-core clusters. ACM SIGOPS Oper. Syst. Rev. 43(2), 25–34 (2009)
Article Google Scholar
Fadika, Z., Dede, E., Hartog, J., et al.: MARLA: MapReduce for heterogeneous clusters. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2012, pp. 49–56
Guo, Z., Fox, G.: Improving MapReduce performance in heterogeneous network environments and resource utilization. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2012, pp. 714–716
Guo, Z., Fox, G., Zhou, M.: Investigation of data locality in MapReduce. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2012, pp. 419–426
Zhang, X., Feng, Y., Feng, S., et al.: An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments. In: Proceedings of the International Conference on Cloud and Service Computing, Dec 2011, pp. 235–242
Jin, H., Yang, X., Sun, X.H. et al.: ADAPT: availability-aware MapReduce data placement for non-dedicated distributed computing. In: Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems, June 2012, pp. 516–525
Vasić, N., Barisits, M., Salzgeber, V., et al.: Making cluster applications energy-aware. In: Proceedings of the 1st Workshop on Automated Control for Datacenters and Clouds, June 2009, pp. 37–42
Maheshwari, N., Nanduri, R., Varma, V.: Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework. Future Gener. Comput. Syst. 28(1), 119–127 (2012)
Article Google Scholar

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China under Grant Nos. 61320106007, 61202449, 61572129, 61502097, 61370207, National High-tech R&D Program of China (863 Program) under Grant No. 2013AA013503, China Fundamental Research Funds for the Central Universities under Grant No. 1109007115, Jiangsu research prospective joint research project under Grant Nos. BY2012202, BY2013073-01, Jiangsu Provincial Key Laboratory of Network and Information Security under Grant No. BM2003201, Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grant No. 93K-9, and partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Southeast University, Nanjing, 211189, People’s Republic of China
Runqun Xiong, Junzhou Luo & Fang Dong

Authors

Runqun Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Junzhou Luo
View author publications
You can also search for this author in PubMed Google Scholar
Fang Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Runqun Xiong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xiong, R., Luo, J. & Dong, F. Optimizing data placement in heterogeneous Hadoop clusters. Cluster Comput 18, 1465–1480 (2015). https://doi.org/10.1007/s10586-015-0495-z

Download citation

Received: 13 March 2015
Revised: 17 September 2015
Accepted: 19 September 2015
Published: 01 October 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10586-015-0495-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing data placement in heterogeneous Hadoop clusters

Abstract

Access this article

Similar content being viewed by others

Optimization for energy-aware design of task scheduling in heterogeneous distributed systems: a meta-heuristic based approach

Supporting efficient video file streaming in P2P cloud storage

Flexible fingerprint cuckoo filter for information retrieval optimization in distributed network

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimizing data placement in heterogeneous Hadoop clusters

Abstract

Access this article

Similar content being viewed by others

Optimization for energy-aware design of task scheduling in heterogeneous distributed systems: a meta-heuristic based approach

Supporting efficient video file streaming in P2P cloud storage

Flexible fingerprint cuckoo filter for information retrieval optimization in distributed network

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation