An approach of fast data manipulation in HDFS with supplementary mechanisms

Wang, Youwei; Ma, Can; Wang, Weiping; Meng, Dan

doi:10.1007/s11227-014-1287-6

An approach of fast data manipulation in HDFS with supplementary mechanisms

Published: 13 September 2014

Volume 71, pages 1736–1753, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Youwei Wang¹,
Can Ma²,
Weiping Wang² &
…
Dan Meng²

301 Accesses
3 Citations
Explore all metrics

Abstract

The Hadoop framework has been widely applied in miscellaneous clusters to build large scalable and powerful systems for massive data processing based on commodity hardware. Hadoop distributed file system (HDFS), the distributed storage component of Hadoop, is responsible for managing vast amount of data effectively in large clusters. To utilize the parallel processing infrastructure of Hadoop, Map/Reduce, the traditional workflow needs to upload data from local file systems to HDFS first. Unfortunately, when dealing with massive data, the uploading procedure becomes extremely time-consuming which causes almost intolerable delay for urgent tasks, along with unnecessary space waste due to replicated data. The primary contribution of this paper is the proposition of Zput and its supplementary mechanism named Zport. After the implementation is described, we introduce several improved details which are significant for runtime efficiency and performance. Evaluation results prove that Zput can accelerate the local data uploading procedure by over 315.4 %, while Zport can boost the remote block distribution by over 190.3 %. Besides, the compatibility for upper-layer applications remains intact.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

The big data system, components, tools, and technologies: a survey

Article 18 September 2018

The evolution of distributed computing systems: from fundamental to new frontiers

Article 30 January 2021

References

Bonwick J (2005) Zfs end-to-end data integrity. https://blogs.oracle.com/bonwick/entry/zfs_end_to_end_data
Braden R, Borman D, Partridge C (1989) Computing the internet checksum. SIGCOMM Comput Commun Rev 19(2):86–94. ISSN:0146–4833, doi:10.1145/378444.378453
Chen Y, Ganapathi A, Katz RH (2010) To compress or not to compress—compute vs. io tradeoffs for mapreduce energy efficiency. In: Proceedings of the first ACM SIGCOMM workshop on green networking, green networking ’10. ACM, New York, pp 23–28, ISBN 978-1-4503-0196-1, doi:10.1145/1851290.1851296
Cohen F (1987) A cryptographic checksum for integrity protection. Comput Secur 6(6):505–510. ISSN:0167–4048, http://www.sciencedirect.com/science/article/pii/0167404887900319
Crume A, Buck J, Maltzahn C, Brandt S (2012) Compressing intermediate keys between mappers and reducers in scihadoop. In: Proceedings of the 2012 SC companion: high performance computing, networking storage and analysis, SCC ’12. IEEE Computer Society, Washington, DC, pp 7–12, ISBN:978-0-7695-4956-9, doi:10.1109/SC.Companion.2012.12
Eltabakh MY, Tian Y, Özcan F, Gemulla R, Krettek A, McPherson J (2011) Cohadoop: flexible data placement and its exploitation in hadoop. Proc VLDB Endow 4(9):575–585. ISSN:2150–8097, doi:10.14778/2002938.2002943
Fan X, Li S, Liao X, Wang L, Huang C, Ma J (2012) Datanode optimization in distributed storage systems. In: CLOUD COMPUTING 2012, The third international conference on cloud computing, GRIDs, and virtualization, pp 247–252, ISBN:978-1-61208-216-5
Fletcher J (1982) An arithmetic checksum for serial transmissions. Commun IEEE Trans 30(1):247–252, ISSN:0090–6778, doi:10.1109/TCOM.1982.1095369
Genova Z, Christensen K (2002) Efficient summarization of urls using crc32 for implementing url switching. In: Proceedings of the 27th annual IEEE conference on local computer networks, LCN ’02. IEEE Computer Society, Washington, DC, pp 343–344, ISBN:0-7695-1591-6, http://dl.acm.org/citation.cfm?id=648047.745545
Gopal V, Guilford J, Dixon M, Feghali W (2011) Fast, parallelized crc computation using the nehalem crc32 instruction. http://www.drdobbs.com/parallel/fast-parallelized-crc-computation-using/229401411
He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z (2011) Rcfile: a fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: Data engineering (ICDE), 2011 IEEE 27th international conference, pp 1199–1208
SSE Intel (2007) Programming reference. Intel’s software network, sofwareprojects. intel. com/avx, 2:7
Nicolae B (2010) High throughput data-compression for cloud storage. In: Proceedings of the third international conference on data management in grid and peer-to-peer systems, Globe’10. Springer-Verlag, Berlin, Heidelberg, pp 1–12, ISBN:3-642-15107-8, 978-3-642-15107-1, http://dl.acm.org/citation.cfm?id=1885229.1885231
Urbani J, Maassen J, Bal H (2010) Massive semantic web data compression with mapreduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing, HPDC ’10. ACM, New York, pp 795–802, ISBN:978-1-60558-942-8, doi:10.1145/1851476.1851591
Viswanathan A (2012) A guide to using lzo compression in hadoop. Linux J 2012(220). ISSN:1075–3583, http://dl.acm.org/citation.cfm?id=2371484.2371485
Wang Y, Wang W, Ma C, Meng D (2013) Zput: a speedy data uploading approach for the hadoop distributed file system. In: Cluster computing (CLUSTER), 2013 IEEE international conference, pp 1–5

Download references

Acknowledgments

This work is supported by the following programs: (1) the National High-Tech Research and Development Program of China (2013AA013204); (2) the National HeGaoJi Key Project under grant numbered 2013ZX01039-002-001-001; and (3) the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA06030200).

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, No.6 Kexueyuan South Road, Zhongguancun, Haidian District, Beijing, China
Youwei Wang
The Second Research Laboratory, Institute of Information Engineering, Chinese Academy of Sciences, No.89 Min Zhuang Road, Haidian District, Beijing, China
Can Ma, Weiping Wang & Dan Meng

Authors

Youwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Can Ma
View author publications
You can also search for this author in PubMed Google Scholar
Weiping Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dan Meng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Youwei Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Y., Ma, C., Wang, W. et al. An approach of fast data manipulation in HDFS with supplementary mechanisms. J Supercomput 71, 1736–1753 (2015). https://doi.org/10.1007/s11227-014-1287-6

Download citation

Published: 13 September 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s11227-014-1287-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An approach of fast data manipulation in HDFS with supplementary mechanisms

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

The big data system, components, tools, and technologies: a survey

The evolution of distributed computing systems: from fundamental to new frontiers

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An approach of fast data manipulation in HDFS with supplementary mechanisms

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

The big data system, components, tools, and technologies: a survey

The evolution of distributed computing systems: from fundamental to new frontiers

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation