research-article

Efficient scheme for compressing and transferring data in hadoop clusters

Authors:

Junyoung HeoAuthors Info & Claims

SAC '20: Proceedings of the 35th Annual ACM Symposium on Applied Computing

Pages 1256 - 1263

https://doi.org/10.1145/3341105.3374044

Published: 30 March 2020 Publication History

Abstract

The size of data collected by public institutions and industries is rapidly exploding. As the data that needs to be processed grows larger, there is a limit to processing big data simply by using scale-up servers. To address this limitation, distributed cluster computing systems that use scale-out servers have emerged. However, if the network bandwidth is not used efficiently, the distributed cluster computing systems can not maximize the performance of the scale-out servers. In this paper, we propose an efficient scheme for compressing and transferring data in Hadoop clusters. The proposed method selects an appropriate compression algorithm by calculating the data transfer cost model based on the information entropy of data and network bandwidth. Experimental results show that the data transfer time and the amount of data transfer between the data nodes of the proposed scheme are significantly reduced.

References

[1]

[n. d.]. Data compression. Retrieved July 21, 2019 from https://en.wikipedia.org/wiki/Data_compression

[2]

[n. d.]. Entropy. Retrieved July 21, 2019 from https://en.wikipedia.org/wiki/Entropy

[3]

[n. d.]. lzbench. Retrieved July 21, 2019 fromhttps://github.com/inikep/lzbench

[4]

Aaron Ogus Niranjan Nilakantan Arild Skjolsvold Sam McKelvie Yikang Xu Shashwat Srivastav Jiesheng Wu Huseyin Simitci Jaidev Haridas Chakravarthy Uddaraju Hemal Khatri Andrew Edwards Vaman Bedekar Shane Mainali Rafay Abbasi Arpit Agarwal Mian Fahim ul Haq Muhammad Ikram ul Haq Deepali Bhardwaj Sowmya Dayanand Anitha Adusumilli Marvin McNett Sriram Sankaran Kavitha Manivannan Brad Calder, Ju Wang and Leonidas Rigas. 2011. Windows Azure Storage: a highly available cloud storage service with strong consistency. In SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 143--157.

[5]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation. USENIX, 137--150.

[6]

Nabil Aly Lashin Doa'a Saad El-Shora, Ehab Rushdy Mohamed and Ibrahim Mahmoud El-Henawy. 2013. PERFORMANCE EVALUATION OF DATA COMPRESSION TECHNIQUES VERSUS DIFFERENT TYPES OFDATA. International Journal of Computer Science and Information Securiy Vol. 11, No. 12 (12 2013), 73--78.

[7]

Herodotos Herodotou. 2011. Hadoop Performance Models. In Technical Report CS-2011-05. Duke Computer Science.

[8]

Xiaojun Ruan Zhiyang Ding Yun Tian James Majors Adam Manzanares Jiong Xie, Shu Yin and Xiao Qin. 2010. Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW). IEEE, 1--9.

[9]

Sanjay Radia Konstantin Shvachko, Hairong Kuang and Robert Chansler. 2010. The Hadoop Distributed File System. In IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 1--10.

[10]

Doron Shiloach Maged Michael, Jose E. Moreira and Robert W. Wisniewski. 2007. Scale-up x Scale-out: A Case Study using Nutch/Lucene. In IEEE International Parallel and Distributed Processing Symposium. IEEE, 1--10.

[11]

Bogdan Nicolae. 2010. High Throughput Data-Compression for Cloud Storage. In International Conference on Data Management in Grid and P2P Systems. Springer, 1--12.

[12]

Seungjoon Noh and Youngik Eom. 2018. Streaming Compression Scheme for Reducing Network Resource Usage in Hadoop System. Journal of KIISE 45, 6 (June 2018), 516--521.

[13]

John F. Pane and Leland Joe. 2005. Making Better Use of Bandwidth: Data Compression and Network Management Technologies. Technical Report. Santa Monica, CA: RAND Corporation. https://www.rand.org/pubs/technical_reports/TR216.html

[14]

Kritwara Rattanaopas and Sureerat Kaewkeeree. 2017. Improving Hadoop MapReduce performance with data compression: A study using wordcount job. In 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). IEEE, 564--567.

[15]

Howard Gobioff Sanjay Ghemawat and Shun-Tak Leung. 2003. The Google File System. In 19th ACM Symposium on Operating Systems Principles. ACM, 20--43.

[16]

Yuan Zhou Xinxin Fan, Bo Lang and Tiarmmg Zang. 2017. Adding network bandwidth resource management to Hadoop YARN. In Seventh International Conference on Information Science and Technology (ICIST). IEEE, 444--449.

[17]

Dazhao Cheng Yanfei Guo, Jia Rao and Xiaobo Zhou. 2017. iShuffle: Improving Hadoop Performance with Shuffle-on-Write. IEEE Transactions on Parallel and Distributed Systems 28, 6 (June 2017), 1649--1662.

Digital Library

Cited By

Afroozeh AKuffo LBoncz P(2023)ALP: Adaptive Lossless floating-Point CompressionProceedings of the ACM on Management of Data10.1145/36267171:4(1-26)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626717

Index Terms

Efficient scheme for compressing and transferring data in hadoop clusters
1. Computer systems organization
  1. Architectures
    1. Distributed architectures

Recommendations

Energy-efficient hadoop for big data analytics and computing: A systematic review and research insights
Abstract
As the demands for big data analytics keep growing rapidly in scientific applications and online services, MapReduce and its open-source implementation Hadoop gained popularity in both academia and enterprises. Hadoop provides a highly feasible ...
Highlights
- This paper presents the new viewpoints/insights in improving the energy efficiency of Hadoop.
- Present valuable and feasible solutions towards improving the energy efficiency of Hadoop.
- Propose five categories of optimizing the ...
G-Hadoop: MapReduce across distributed data centers for data-intensive computing

Recently, the computational requirements for large-scale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge ...
'Big data', Hadoop and cloud computing in genomics

Graphical abstractDisplay Omitted Ever improving next generation sequencing technologies has led to an unprecedented proliferation of sequence data.Biology is now one of the fastest growing fields of big data science.Cloud computing and big data ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '20: Proceedings of the 35th Annual ACM Symposium on Applied Computing

March 2020

2348 pages

ISBN:9781450368667

DOI:10.1145/3341105

Conference Chairs:
Chih-Cheng Hung
Kennesaw State University
,
Tomas Cerny
Baylor University
,
Program Chairs:
Dongwan Shin
New Mexico Tech
,
Alessio Bechini
University of Pisa, Italy

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 March 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Research Foundation of Korea (NRF) by Ministry of Science, ICT

Conference

SAC '20

Sponsor:

SIGAPP

SAC '20: The 35th ACM/SIGAPP Symposium on Applied Computing

March 30 - April 3, 2020

Brno, Czech Republic

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25

Sponsor:
sigapp

The 40th ACM/SIGAPP Symposium on Applied Computing

March 31 - April 4, 2025

Catania , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
104
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Afroozeh AKuffo LBoncz P(2023)ALP: Adaptive Lossless floating-Point CompressionProceedings of the ACM on Management of Data10.1145/36267171:4(1-26)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626717

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten