Skip to main content
Log in

An approach of fast data manipulation in HDFS with supplementary mechanisms

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The Hadoop framework has been widely applied in miscellaneous clusters to build large scalable and powerful systems for massive data processing based on commodity hardware. Hadoop distributed file system (HDFS), the distributed storage component of Hadoop, is responsible for managing vast amount of data effectively in large clusters. To utilize the parallel processing infrastructure of Hadoop, Map/Reduce, the traditional workflow needs to upload data from local file systems to HDFS first. Unfortunately, when dealing with massive data, the uploading procedure becomes extremely time-consuming which causes almost intolerable delay for urgent tasks, along with unnecessary space waste due to replicated data. The primary contribution of this paper is the proposition of Zput and its supplementary mechanism named Zport. After the implementation is described, we introduce several improved details which are significant for runtime efficiency and performance. Evaluation results prove that Zput can accelerate the local data uploading procedure by over 315.4 %, while Zport can boost the remote block distribution by over 190.3 %. Besides, the compatibility for upper-layer applications remains intact.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Bonwick J (2005) Zfs end-to-end data integrity. https://blogs.oracle.com/bonwick/entry/zfs_end_to_end_data

  2. Braden R, Borman D, Partridge C (1989) Computing the internet checksum. SIGCOMM Comput Commun Rev 19(2):86–94. ISSN:0146–4833, doi:10.1145/378444.378453

  3. Chen Y, Ganapathi A, Katz RH (2010) To compress or not to compress—compute vs. io tradeoffs for mapreduce energy efficiency. In: Proceedings of the first ACM SIGCOMM workshop on green networking, green networking ’10. ACM, New York, pp 23–28, ISBN 978-1-4503-0196-1, doi:10.1145/1851290.1851296

  4. Cohen F (1987) A cryptographic checksum for integrity protection. Comput Secur 6(6):505–510. ISSN:0167–4048, http://www.sciencedirect.com/science/article/pii/0167404887900319

  5. Crume A, Buck J, Maltzahn C, Brandt S (2012) Compressing intermediate keys between mappers and reducers in scihadoop. In: Proceedings of the 2012 SC companion: high performance computing, networking storage and analysis, SCC ’12. IEEE Computer Society, Washington, DC, pp 7–12, ISBN:978-0-7695-4956-9, doi:10.1109/SC.Companion.2012.12

  6. Eltabakh MY, Tian Y, Özcan F, Gemulla R, Krettek A, McPherson J (2011) Cohadoop: flexible data placement and its exploitation in hadoop. Proc VLDB Endow 4(9):575–585. ISSN:2150–8097, doi:10.14778/2002938.2002943

  7. Fan X, Li S, Liao X, Wang L, Huang C, Ma J (2012) Datanode optimization in distributed storage systems. In: CLOUD COMPUTING 2012, The third international conference on cloud computing, GRIDs, and virtualization, pp 247–252, ISBN:978-1-61208-216-5

  8. Fletcher J (1982) An arithmetic checksum for serial transmissions. Commun IEEE Trans 30(1):247–252, ISSN:0090–6778, doi:10.1109/TCOM.1982.1095369

  9. Genova Z, Christensen K (2002) Efficient summarization of urls using crc32 for implementing url switching. In: Proceedings of the 27th annual IEEE conference on local computer networks, LCN ’02. IEEE Computer Society, Washington, DC, pp 343–344, ISBN:0-7695-1591-6, http://dl.acm.org/citation.cfm?id=648047.745545

  10. Gopal V, Guilford J, Dixon M, Feghali W (2011) Fast, parallelized crc computation using the nehalem crc32 instruction. http://www.drdobbs.com/parallel/fast-parallelized-crc-computation-using/229401411

  11. He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z (2011) Rcfile: a fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: Data engineering (ICDE), 2011 IEEE 27th international conference, pp 1199–1208

  12. SSE Intel (2007) Programming reference. Intel’s software network, sofwareprojects. intel. com/avx, 2:7

  13. Nicolae B (2010) High throughput data-compression for cloud storage. In: Proceedings of the third international conference on data management in grid and peer-to-peer systems, Globe’10. Springer-Verlag, Berlin, Heidelberg, pp 1–12, ISBN:3-642-15107-8, 978-3-642-15107-1, http://dl.acm.org/citation.cfm?id=1885229.1885231

  14. Urbani J, Maassen J, Bal H (2010) Massive semantic web data compression with mapreduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing, HPDC ’10. ACM, New York, pp 795–802, ISBN:978-1-60558-942-8, doi:10.1145/1851476.1851591

  15. Viswanathan A (2012) A guide to using lzo compression in hadoop. Linux J 2012(220). ISSN:1075–3583, http://dl.acm.org/citation.cfm?id=2371484.2371485

  16. Wang Y, Wang W, Ma C, Meng D (2013) Zput: a speedy data uploading approach for the hadoop distributed file system. In: Cluster computing (CLUSTER), 2013 IEEE international conference, pp 1–5

Download references

Acknowledgments

This work is supported by the following programs: (1) the National High-Tech Research and Development Program of China (2013AA013204); (2) the National HeGaoJi Key Project under grant numbered 2013ZX01039-002-001-001; and (3) the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA06030200).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Youwei Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Ma, C., Wang, W. et al. An approach of fast data manipulation in HDFS with supplementary mechanisms. J Supercomput 71, 1736–1753 (2015). https://doi.org/10.1007/s11227-014-1287-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1287-6

Keywords

Navigation