An Overview on Data Deduplication Techniques

Zhang, Xuecheng; Deng, Mingzhu

doi:10.1007/978-3-319-38771-0_35

Xuecheng Zhang¹⁷ &
Mingzhu Deng¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 455))

1462 Accesses
4 Citations

Abstract

The massive data puts forward higher requirements on the capacity of storage devices, but from a practical point of view, the increasement of capacity is far more behind the growth of data. Deduplication technique, for its high efficiency, few resource consumption and extensive application scope, comes to the fore among various data reduction techniques. The so-called data deduplication refers to find and eliminate redundant data among the storage system. For local storage system, the only one data object is needed to store to save limited storage space; for network system, not only storage space can be saved, but also transmission bandwidth can be reduced to increase the transmission rate. It is a compromise to achieve the purpose of efficient storage at cost of computational overhead. This article will introduce data deduplication techniques, describe basic principles and processes, summarize the main technique of the current study and provide recommendations for future development.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bhagwat D, Pollack K, Long DD, Schwarz T, Miller EL, Paris JF (2006) Providing high reliability in a minimum redundancy archival storage system. In: 14th IEEE international symposium on modeling, analysis, and simulation of computer and telecommunication systems, MASCOTS. IEEE, pp 413–421
Google Scholar
Bhagwat D, Eshghi K, Long DDE, Lillibridge M (2009) Extreme binning: scalable, parallel deduplication for chunk-based file backup. Modeling analysis and simulation of computer and telecommunication systems MASCOTS, pp 1–9
Google Scholar
Bolosky WJ, Corbin S, Goebel D, Douceur JR (2000) Single instance storage in windows. In: Proceedings of the 4th USENIX windows systems symposium, pp 13–24. Seattle, WA
Google Scholar
Centera E (2004) Content addressed storage system
Google Scholar
Cox LP, Murray CD, Noble BD (2002) Pastiche: making backup cheap and easy. ACM SIGOPS Oper Syst Rev 36(SI):285–298
Article Google Scholar
Debnath BK, Sengupta S, Li J (2010) Chunkstash: speeding up inline storage deduplication using flash memory. In: USENIX annual technical conference
Google Scholar
Denehy TE, Hsu WW (2003) Duplicate management for reference data. Technical report, Research Report RJ10305, IBM
Google Scholar
Douglis F, Iyengar A (2003) Application-specific delta-encoding via resemblance detection. In: USENIX annual technical conference, general track, pp 113–126
Google Scholar
Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M (2009) Hydrastor: a scalable secondary storage. In: FAST, vol 9, pp 197–210
Google Scholar
Guo F, Efstathopoulos P (2011) Building a high-performance deduplication system. In: USENIX annual technical conference
Google Scholar
Henson V (2003) An analysis of compare-by-hash. In: HotOS, pp 13–18
Google Scholar
Jain N, Dahlin M, Tewari R (2005) Taper: tiered approach for eliminating redundancy in replica synchronization. In: Proceedings of the 4th conference on USENIX conference on file and storage technologies, vol 4, pp 21–21. USENIX Association
Google Scholar
Kruus E, Ungureanu C, Dubnicki C (2010) Bimodal content defined chunking for backup streams. In: FAST, pp 239–252
Google Scholar
Kubiatowicz J, Bindel D, Chen Y, Czerwinski S, Eaton P, Geels D, Gummadi R, Rhea S, Weatherspoon H, Weimer W et al (2000) Oceanstore: an architecture for global-scale persistent storage. ACM SIGPLAN Not 35(11):190–201
Article Google Scholar
Li AO, Shu JW, Ming-Qiang LI (2010) Data deduplication techniques. J Softw 1(21):430–433
Google Scholar
Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P (2009) Sparse indexing: large scale, inline deduplication using sampling and locality. In: Fast, vol 9, pp 111–123
Google Scholar
Lin X, Lu G, Douglis F, Shilane P, Wallace G (2014) Migratory compression: coarse-grained data reordering to improve compressibility. In: FAST, pp 257–271
Google Scholar
Liu C, Lu Y, Shi C, Lu G, Du DH, Wang DS (2008) Admad: application-driven metadata aware de-duplication archival storage system. In: Fifth IEEE international workshop on storage network architecture and parallel I/Os, SNAPI’08. IEEE, pp 29–35
Google Scholar
Liu C, Gu Y, Sun L, Yan B, Wang D (2009) R-admad: high reliability provision for large-scale de-duplication archival storage systems. In: Proceedings of the 23rd international conference on supercomputing. ACM, pp 370–379
Google Scholar
Meister D, Brinkmann A (2009) Multi-level comparison of data deduplication in a backup scenario. In: Proceedings of SYSTOR 2009: the Israeli experimental systems conference. ACM, p 8
Google Scholar
Meister D, Brinkmann A (2010) dedupv1: improving deduplication throughput using solid state drives (SSD). In: IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, pp 1–6
Google Scholar
Min J, Yoon D, Won Y (2011) Efficient deduplication techniques for modern backup operation. IEEE Trans Comput 60(6):824–840
Article MathSciNet Google Scholar
Muthitacharoen A, Chen B, Mazieres D (2001) A low-bandwidth network file system. In: ACM SIGOPS operating systems review, vol 35. ACM, pp 174–187
Google Scholar
Quinlan S, Dorward S (2002) Venti: a new approach to archival storage. In: FAST, vol 2, pp 89–101
Google Scholar
Tan Y, Yan Z, Feng D, He X, Zou Q, Yang L (2015) De-frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization. Clust Comput 18(1):79–92
Article Google Scholar
Won Y, Kim R, Ban J, Hur J, Oh S, Lee J (2008) Prun: eliminating information redundancy for large scale data backup system. In: International conference on computational sciences and its applications, ICCSA’08. IEEE, pp 139–144
Google Scholar
Xia W, Jiang H, Feng D, Tian L, Fu M, Wang Z (2012) P-dedupe: exploiting parallelism in data deduplication system. In: IEEE 7th international conference on networking, architecture and storage (NAS). IEEE, pp 338–347
Google Scholar
Xu M, Zhu Y, Lee PP, Xu Y, Even data placement for load balance in reliable distributed deduplication storage systems
Google Scholar
Yinjin F, Nong X, Fang L (2012) Research and development on key techniques of data deduplication [j]. J Comput Res Dev 1:002
Google Scholar
You L, Karamanolis CT (2004) Evaluation of efficient archival storage techniques. In: MSST, pp 227–232. Citeseer
Google Scholar
You LL, Pollack KT, Long DD (2005) Deep store: an archival storage system architecture. In: Proceedings of the 21st international conference on data engineering, ICDE. IEEE, pp 804–815
Google Scholar
Zhengda Z, Jingli Z (2010) A novel data redundancy scheme for de-duplication storage system. In: 3rd international symposium on knowledge acquisition and modeling (KAM). IEEE, pp 293–296
Google Scholar
Zhou Z, Zhou J (2012) High availability replication strategy for deduplication storage system. Adv Inf Sci Serv Sci 4(8):115
Google Scholar
Zhu B, Li K, Patterson RH (2008) Avoiding the disk bottleneck in the data domain deduplication file system. In: Fast, vol 8, pp 1–14
Google Scholar

Download references

Author information

Authors and Affiliations

National University of Defense Technology, Changsha, China
Xuecheng Zhang & Mingzhu Deng

Authors

Xuecheng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Mingzhu Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuecheng Zhang .

Editor information

Editors and Affiliations

Department of Automation and Applied Informatics, Faculty of Engineering, Aurel Vlaicu University of Arad, Arad, Romania
Valentina Emilia Balas
University of South Australia, Bournemouth University, Poole, UK, and University of South Australia, Adelaide, Australia
Lakhmi C. Jain
School of Information Engineering, Chang'an University, Xi'an, China
Xiangmo Zhao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, X., Deng, M. (2017). An Overview on Data Deduplication Techniques. In: Balas, V., Jain, L., Zhao, X. (eds) Information Technology and Intelligent Transportation Systems. Advances in Intelligent Systems and Computing, vol 455. Springer, Cham. https://doi.org/10.1007/978-3-319-38771-0_35

Download citation

DOI: https://doi.org/10.1007/978-3-319-38771-0_35
Published: 06 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-38769-7
Online ISBN: 978-3-319-38771-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics