skip to main content
10.1145/3341105.3374044acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Efficient scheme for compressing and transferring data in hadoop clusters

Published: 30 March 2020 Publication History

Abstract

The size of data collected by public institutions and industries is rapidly exploding. As the data that needs to be processed grows larger, there is a limit to processing big data simply by using scale-up servers. To address this limitation, distributed cluster computing systems that use scale-out servers have emerged. However, if the network bandwidth is not used efficiently, the distributed cluster computing systems can not maximize the performance of the scale-out servers. In this paper, we propose an efficient scheme for compressing and transferring data in Hadoop clusters. The proposed method selects an appropriate compression algorithm by calculating the data transfer cost model based on the information entropy of data and network bandwidth. Experimental results show that the data transfer time and the amount of data transfer between the data nodes of the proposed scheme are significantly reduced.

References

[1]
[n. d.]. Data compression. Retrieved July 21, 2019 from https://en.wikipedia.org/wiki/Data_compression
[2]
[n. d.]. Entropy. Retrieved July 21, 2019 from https://en.wikipedia.org/wiki/Entropy
[3]
[n. d.]. lzbench. Retrieved July 21, 2019 fromhttps://github.com/inikep/lzbench
[4]
Aaron Ogus Niranjan Nilakantan Arild Skjolsvold Sam McKelvie Yikang Xu Shashwat Srivastav Jiesheng Wu Huseyin Simitci Jaidev Haridas Chakravarthy Uddaraju Hemal Khatri Andrew Edwards Vaman Bedekar Shane Mainali Rafay Abbasi Arpit Agarwal Mian Fahim ul Haq Muhammad Ikram ul Haq Deepali Bhardwaj Sowmya Dayanand Anitha Adusumilli Marvin McNett Sriram Sankaran Kavitha Manivannan Brad Calder, Ju Wang and Leonidas Rigas. 2011. Windows Azure Storage: a highly available cloud storage service with strong consistency. In SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 143--157.
[5]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation. USENIX, 137--150.
[6]
Nabil Aly Lashin Doa'a Saad El-Shora, Ehab Rushdy Mohamed and Ibrahim Mahmoud El-Henawy. 2013. PERFORMANCE EVALUATION OF DATA COMPRESSION TECHNIQUES VERSUS DIFFERENT TYPES OFDATA. International Journal of Computer Science and Information Securiy Vol. 11, No. 12 (12 2013), 73--78.
[7]
Herodotos Herodotou. 2011. Hadoop Performance Models. In Technical Report CS-2011-05. Duke Computer Science.
[8]
Xiaojun Ruan Zhiyang Ding Yun Tian James Majors Adam Manzanares Jiong Xie, Shu Yin and Xiao Qin. 2010. Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW). IEEE, 1--9.
[9]
Sanjay Radia Konstantin Shvachko, Hairong Kuang and Robert Chansler. 2010. The Hadoop Distributed File System. In IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 1--10.
[10]
Doron Shiloach Maged Michael, Jose E. Moreira and Robert W. Wisniewski. 2007. Scale-up x Scale-out: A Case Study using Nutch/Lucene. In IEEE International Parallel and Distributed Processing Symposium. IEEE, 1--10.
[11]
Bogdan Nicolae. 2010. High Throughput Data-Compression for Cloud Storage. In International Conference on Data Management in Grid and P2P Systems. Springer, 1--12.
[12]
Seungjoon Noh and Youngik Eom. 2018. Streaming Compression Scheme for Reducing Network Resource Usage in Hadoop System. Journal of KIISE 45, 6 (June 2018), 516--521.
[13]
John F. Pane and Leland Joe. 2005. Making Better Use of Bandwidth: Data Compression and Network Management Technologies. Technical Report. Santa Monica, CA: RAND Corporation. https://www.rand.org/pubs/technical_reports/TR216.html
[14]
Kritwara Rattanaopas and Sureerat Kaewkeeree. 2017. Improving Hadoop MapReduce performance with data compression: A study using wordcount job. In 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). IEEE, 564--567.
[15]
Howard Gobioff Sanjay Ghemawat and Shun-Tak Leung. 2003. The Google File System. In 19th ACM Symposium on Operating Systems Principles. ACM, 20--43.
[16]
Yuan Zhou Xinxin Fan, Bo Lang and Tiarmmg Zang. 2017. Adding network bandwidth resource management to Hadoop YARN. In Seventh International Conference on Information Science and Technology (ICIST). IEEE, 444--449.
[17]
Dazhao Cheng Yanfei Guo, Jia Rao and Xiaobo Zhou. 2017. iShuffle: Improving Hadoop Performance with Shuffle-on-Write. IEEE Transactions on Parallel and Distributed Systems 28, 6 (June 2017), 1649--1662.

Cited By

View all
  • (2023)ALP: Adaptive Lossless floating-Point CompressionProceedings of the ACM on Management of Data10.1145/36267171:4(1-26)Online publication date: 12-Dec-2023

Index Terms

  1. Efficient scheme for compressing and transferring data in hadoop clusters

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SAC '20: Proceedings of the 35th Annual ACM Symposium on Applied Computing
    March 2020
    2348 pages
    ISBN:9781450368667
    DOI:10.1145/3341105
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 March 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Hadoop
    2. data compression
    3. netowrk bandwidth

    Qualifiers

    • Research-article

    Funding Sources

    • National Research Foundation of Korea (NRF) by Ministry of Science, ICT

    Conference

    SAC '20
    Sponsor:
    SAC '20: The 35th ACM/SIGAPP Symposium on Applied Computing
    March 30 - April 3, 2020
    Brno, Czech Republic

    Acceptance Rates

    Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

    Upcoming Conference

    SAC '25
    The 40th ACM/SIGAPP Symposium on Applied Computing
    March 31 - April 4, 2025
    Catania , Italy

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)ALP: Adaptive Lossless floating-Point CompressionProceedings of the ACM on Management of Data10.1145/36267171:4(1-26)Online publication date: 12-Dec-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media