skip to main content
10.1145/3368235.3368843acmconferencesArticle/Chapter ViewAbstractPublication PagesuccConference Proceedingsconference-collections
research-article

Preventing Data Popularity Concentration in HDFS based Cloud Storage

Published: 02 December 2019 Publication History

Abstract

Hadoop Distributed File System(HDFS) often experiences skew in data storage over time, mainly because of random data block allocation policy, datanode failure, replica reconstruction, and client activity, leading to utilization and load imbalance in the system. Although HDFS provides tools to rebalance the data in the cluster, balancer only considers balancing disk space utilization among nodes which re-allocates the data from highly utilized nodes to low utilized nodes. Thus, data access skew which is caused by piling a large amount of popular data in one node is not addressed in the default HDFS balancer. To address this issue, we present popularity-aware balancer based on node popularity score which spreads the popular data uniformly among datanodes, resulting in the balance of future access load balancing and reduction of hot spots in the cloud storage system. Simulation results demonstrate the promising benefits of proposed popularity-aware balancer by evaluating the uniform distribution of popular data across nodes without compromising the amount of data transfers and variance in disk space.

References

[1]
C.L. Abad, Y.Luy, and R.H. Campbell. 2011. DARE:Adaptive Data Replication for Efficient Clus-ter Scheduling. In Proceedings of the IEEE International Conference on Cluster Computing . https://doi.org/10.1109/CLUSTER.2011.26
[2]
G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg, I. Stoica, D. Harlan, and E. Harris. 2011. Scarlett: Coping with skewed content popularity in MapReduce clusters. In Proceedings of the Sixth Conference on Computer systems . https://doi.org/10.1145/1966445.1966472
[3]
L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. 1999. Web caching and zipf-like distributions: evidence and implications. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies . https://doi.org/10.1109/INFCOM.1999.749260
[4]
R.N. Calheiros, R. Ranjan, A. Beloglazov, C.A.F.D. Rose, and R. Buyya. 2011. CloudSim: A toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Software-Practice and Experience, Vol. 41, 1 (2011). https://doi.org/10.1002/spe.995
[5]
C.Y.Lin and Y.C.Lin. 2015. A Load-Balancing Algorithm for Hadoop Distributed File System. In Proceedings of the International Conference on Network-Based Information Systems . https://doi.org/10.1109/NBiS.2015.30
[6]
C. Debians, P.A.-T. Togores, and F. Karakusoglu. 2012. HDFS Replication Simulator. (2012). Retrieved December 1, 2016 from https://github/peteratt/HDFS-Replication-Simulator
[7]
Y. Gao, K. Li, and Y. Jin. 2017. Compact, Popularity-Aware and Adaptive Hybrid Data Placement Schemes for Heterogeneous Cloud Storage. IEEE Access, Vol. 5, 1306--1318 (2017). https://doi.org/10.1109/ACCESS.2017.2668392
[8]
N. Grozev and R. Buyya. 2015. Performance modelling and simulation of three-tier applications in cloud and multi-cloud environments. Comput. J., Vol. 58, 1 (2015). https://doi.org/10.1093/comjnl/bxt107
[9]
J.Dharanipragada, S.Padala, B.Kammili, and V.Kumar. 2017. Tula: A disk latency aware balancing and block placement strategy for Hadoop. In Proceedings of the IEEE International Conference on Big Data . https://doi.org/10.1109/BigData.2017.8258253
[10]
Q.Wei, B.Veeravali, B.Gong, L.Zeng, and D.Feng. 2010. CDRM: A Cost-effective dyanmic replication management scheme for cloud storage cluster. In Proceedings of the IEEE International Conference on Cluster Computing . https://doi.org/10.1109/CLUSTER.2010.24
[11]
T. White. 2012. Hadoop: The Definitive Guide 3rd. ed.). O' Reilly Media, Inc.
[12]
Z.Cheng, Z.Luan, Y.Meng, Y.Xu, D.Qian, A.Roy, N.Zhang, and G.Guan. 2012. ERMS: An elastic replication management system for hdfs. In Proceedings of the IEEE International Conference on Cluster Computing Workshops . https://doi.org/10.1109/BigData.2017.8258253

Cited By

View all
  • (2024)Towards an Intelligent Framework for Scientific Computational Steering in Big Data Systems2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00085(671-675)Online publication date: 6-May-2024
  • (2024)On an Approximation Algorithm Combined with D3QN for HDFS Data Block Recovery in Heterogeneous Hadoop ClustersIntelligent Systems and Applications10.1007/978-3-031-66329-1_25(381-401)Online publication date: 31-Jul-2024
  • (2021)Research on Automatic Online Analysis Method of Data Hotness in Big Data Scenario2021 3rd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI)10.1109/MLBDBI54094.2021.00072(350-354)Online publication date: Dec-2021

Index Terms

  1. Preventing Data Popularity Concentration in HDFS based Cloud Storage

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    UCC '19 Companion: Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing Companion
    December 2019
    193 pages
    ISBN:9781450370448
    DOI:10.1145/3368235
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 December 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. balancing
    2. data popularity
    3. data storage
    4. hdfs

    Qualifiers

    • Research-article

    Conference

    UCC '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 38 of 125 submissions, 30%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 22 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Towards an Intelligent Framework for Scientific Computational Steering in Big Data Systems2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00085(671-675)Online publication date: 6-May-2024
    • (2024)On an Approximation Algorithm Combined with D3QN for HDFS Data Block Recovery in Heterogeneous Hadoop ClustersIntelligent Systems and Applications10.1007/978-3-031-66329-1_25(381-401)Online publication date: 31-Jul-2024
    • (2021)Research on Automatic Online Analysis Method of Data Hotness in Big Data Scenario2021 3rd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI)10.1109/MLBDBI54094.2021.00072(350-354)Online publication date: Dec-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media