Abstract
The Hadoop framework has been widely applied in miscellaneous clusters to build large scalable and powerful systems for massive data processing based on commodity hardware. Hadoop distributed file system (HDFS), the distributed storage component of Hadoop, is responsible for managing vast amount of data effectively in large clusters. To utilize the parallel processing infrastructure of Hadoop, Map/Reduce, the traditional workflow needs to upload data from local file systems to HDFS first. Unfortunately, when dealing with massive data, the uploading procedure becomes extremely time-consuming which causes almost intolerable delay for urgent tasks, along with unnecessary space waste due to replicated data. The primary contribution of this paper is the proposition of Zput and its supplementary mechanism named Zport. After the implementation is described, we introduce several improved details which are significant for runtime efficiency and performance. Evaluation results prove that Zput can accelerate the local data uploading procedure by over 315.4 %, while Zport can boost the remote block distribution by over 190.3 %. Besides, the compatibility for upper-layer applications remains intact.
Similar content being viewed by others
References
Bonwick J (2005) Zfs end-to-end data integrity. https://blogs.oracle.com/bonwick/entry/zfs_end_to_end_data
Braden R, Borman D, Partridge C (1989) Computing the internet checksum. SIGCOMM Comput Commun Rev 19(2):86–94. ISSN:0146–4833, doi:10.1145/378444.378453
Chen Y, Ganapathi A, Katz RH (2010) To compress or not to compress—compute vs. io tradeoffs for mapreduce energy efficiency. In: Proceedings of the first ACM SIGCOMM workshop on green networking, green networking ’10. ACM, New York, pp 23–28, ISBN 978-1-4503-0196-1, doi:10.1145/1851290.1851296
Cohen F (1987) A cryptographic checksum for integrity protection. Comput Secur 6(6):505–510. ISSN:0167–4048, http://www.sciencedirect.com/science/article/pii/0167404887900319
Crume A, Buck J, Maltzahn C, Brandt S (2012) Compressing intermediate keys between mappers and reducers in scihadoop. In: Proceedings of the 2012 SC companion: high performance computing, networking storage and analysis, SCC ’12. IEEE Computer Society, Washington, DC, pp 7–12, ISBN:978-0-7695-4956-9, doi:10.1109/SC.Companion.2012.12
Eltabakh MY, Tian Y, Özcan F, Gemulla R, Krettek A, McPherson J (2011) Cohadoop: flexible data placement and its exploitation in hadoop. Proc VLDB Endow 4(9):575–585. ISSN:2150–8097, doi:10.14778/2002938.2002943
Fan X, Li S, Liao X, Wang L, Huang C, Ma J (2012) Datanode optimization in distributed storage systems. In: CLOUD COMPUTING 2012, The third international conference on cloud computing, GRIDs, and virtualization, pp 247–252, ISBN:978-1-61208-216-5
Fletcher J (1982) An arithmetic checksum for serial transmissions. Commun IEEE Trans 30(1):247–252, ISSN:0090–6778, doi:10.1109/TCOM.1982.1095369
Genova Z, Christensen K (2002) Efficient summarization of urls using crc32 for implementing url switching. In: Proceedings of the 27th annual IEEE conference on local computer networks, LCN ’02. IEEE Computer Society, Washington, DC, pp 343–344, ISBN:0-7695-1591-6, http://dl.acm.org/citation.cfm?id=648047.745545
Gopal V, Guilford J, Dixon M, Feghali W (2011) Fast, parallelized crc computation using the nehalem crc32 instruction. http://www.drdobbs.com/parallel/fast-parallelized-crc-computation-using/229401411
He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z (2011) Rcfile: a fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: Data engineering (ICDE), 2011 IEEE 27th international conference, pp 1199–1208
SSE Intel (2007) Programming reference. Intel’s software network, sofwareprojects. intel. com/avx, 2:7
Nicolae B (2010) High throughput data-compression for cloud storage. In: Proceedings of the third international conference on data management in grid and peer-to-peer systems, Globe’10. Springer-Verlag, Berlin, Heidelberg, pp 1–12, ISBN:3-642-15107-8, 978-3-642-15107-1, http://dl.acm.org/citation.cfm?id=1885229.1885231
Urbani J, Maassen J, Bal H (2010) Massive semantic web data compression with mapreduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing, HPDC ’10. ACM, New York, pp 795–802, ISBN:978-1-60558-942-8, doi:10.1145/1851476.1851591
Viswanathan A (2012) A guide to using lzo compression in hadoop. Linux J 2012(220). ISSN:1075–3583, http://dl.acm.org/citation.cfm?id=2371484.2371485
Wang Y, Wang W, Ma C, Meng D (2013) Zput: a speedy data uploading approach for the hadoop distributed file system. In: Cluster computing (CLUSTER), 2013 IEEE international conference, pp 1–5
Acknowledgments
This work is supported by the following programs: (1) the National High-Tech Research and Development Program of China (2013AA013204); (2) the National HeGaoJi Key Project under grant numbered 2013ZX01039-002-001-001; and (3) the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA06030200).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, Y., Ma, C., Wang, W. et al. An approach of fast data manipulation in HDFS with supplementary mechanisms. J Supercomput 71, 1736–1753 (2015). https://doi.org/10.1007/s11227-014-1287-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1287-6