A survey on novel classification of deduplication storage systems

Mohamed, Shawgi M. A.; Wang, Yongli

doi:10.1007/s10619-020-07301-2

A survey on novel classification of deduplication storage systems

Published: 16 June 2020

Volume 39, pages 201–230, (2021)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Shawgi M. A. Mohamed¹ &
Yongli Wang¹

874 Accesses
6 Citations
Explore all metrics

Abstract

The huge blast of information caused a lot of dilemmas in both storage and retrieval procedures. The enlargement in a massive quantity of digital data requirements imposes more storage space, which in turn radically increases performance and cost of backup. Data deduplication is one of the techniques that vanishes replicated data, decreases the bandwidth, and minimizes the disk usage and cost. Since various researches have been broadly considered in the literature, this paper reviews the ideas, categories, and different storage approaches that use data deduplication. Apart from the well-known classification that uses Granularity, Side, Timing, and Implementation for classifying the deduplication approaches, a new classification principle is adopted using the storage location. This classification identifies and describes the diverse methods. Moreover, the deduplication systems are comprehensively described according to the storage location, including Local, Centralized, and Clustered storage systems. Furthermore, the describing objectives, used techniques, features, and drawbacks of the most advanced methods of each type are broadly tackled. Finally, the major deduplication systems' challenges are recognized and illustrated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Overview on Data Deduplication Techniques

A Viewpoint on Different Data Deduplication Systems and Allied Issues

Data deduplication techniques for efficient cloud storage management: a systematic review

Article 20 December 2017

Ravneet Kaur, Inderveer Chana & Jhilik Bhattacharya

References

Al-Fares, M., Loukissas, A., Vahdat, A.: ACM SIGCOMM computer communication review, pp. 63–74. ACM, New York (2008)
Google Scholar
Greenberg, A., Hamilton, J.R., Jain, N., Kandula, S., Kim, C., Lahiri, P., Maltz, D.A., Patel, P., Sengupta, S.: ACM SIGCOMM computer communication review, pp. 51–62. ACM, New York (2009)
Google Scholar
Kandula, S., Sengupta, S., Greenberg, A., Patel, P., Chaiken, R.: Internet measurement, pp. 202–208. ACM, New York (2009)
Google Scholar
Mell, P., Grance, T., et al.: The NIST definition of cloud computing, Computer Security Division Information Technology Laboratory. National Institute of Standards and Technology, Gaithersburg (2011)
Google Scholar
Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., et al.: A view of cloud computing. Commun. ACM 53, 50–58 (2010)
Google Scholar
Antonopoulos, N., Gillam, L.: Cloud computing. Springer, London (2010)
MATH Google Scholar
Kilov, H., Linington, P.F., Romero, J.R., Tanaka, A., Vallecillo, A.: The reference model of open distributed processing: Foundations, experience, and applications. Comput. Stand. Interfaces 35, 247–256 (2013)
Google Scholar
Rumelhart, D.E., Hinton, G.E., McClelland, J.L., et al.: A general framework for parallel distributed processing. Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, pp. 45–76. MIT Press, Cambridge (1986)
Google Scholar
McClelland, J.L., Rumelhart, D.E.: Explorations in parallel distributed processing: a handbook of models, programs, and exercises. MIT Press, Cambridge (1989)
Google Scholar
Kumar, R., Marinov, D., Padua, D., Parthasarathy, M., Patel, S., Roth, D., Snir, M., Torrellas, J.: Parallel Computing Research at Illinois The UPCRC Agenda. University of Illinois, Champaign (2008)
Google Scholar
Berman, F., Fox, G., Hey, A.J.: Grid computing: making the global infrastructure a reality, vol. 2. Wiley, Hoboken (2003)
Google Scholar
Bote-Lorenzo, M.L., Dimitriadis, Y.A., GmezSnchez, E.: Grid computing, p. 291298. Springer, Berlin (2003)
Google Scholar
G. Mittal, D. Kesswani, K. Goswami, et al,” A survey of current trends in distributed, grid and cloud computing”, arXiv preprint arXiv:1308.1806,(2003).
Barroso, L.A., Dean, J., Holzle, U.: Web search fora planet: the Google cluster architecture. IEEE Micro 23, 2228 (2003)
Google Scholar
Linden, G., Smith, B., York, J.: Amazon. com recommendations: item-to-item collaborative filtering. IEEE Internet Comput. 7, 7680 (2003)
Google Scholar
Tsai, W.-T., Sun, X., Balasooriya, J.: Service-oriented cloud computing architecture. Information technology: new generations (ITNG) Seventh International Conference. IEEE, Piscataway (2010)
Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30, 107–117 (1998)
Google Scholar
Chu, H., Rosenthal, M.: Search engines for the worldwide web: A comparative study and evaluation methodology. Proc. Ann. Meeting-Am. Soc. Inform. Sci. 33, 127135 (1996)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107113 (2008)
Google Scholar
Shim, K.: Mapreduce algorithms for big data analysis. Proc. VLDB Endow. 5, 2016–2017 (2012)
Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. (TOCS) 26, 4 (2008)
Google Scholar
Geer, D.: Reducing the storage burden via data deduplication. Comput. IEEE 41(12), 15–17 (2008)
Google Scholar
Paulo, J., Pereira, J.: A survey and classification of storage deduplication systems. ACM Comput. Surveys (CSUR) 47, 11 (2014)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19, 116 (2007)
Google Scholar
Burrows, J.H.: Secure hash standard. Department of Commerce Washington DC Tech. Rep, Washington DC (1995)
Google Scholar
Kim, D., Song, S., Choi, B.-Y.: SAFE: structure-aware file and email deduplication for cloud-based storage systems, pp. 130–137. Piscataway, IEEE (2013)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24(5), 530–536 (1978)
MathSciNet MATH Google Scholar
Biggar, H.: Experiencing data deduplication: improvingefficiency and reducing capacity requirements, White paper. The Enterprise Strategy Group, Milford (2007)
Google Scholar
Manager, N., et al.: Demystifying data deduplication. Proceedings of the ACM/IFIP/USENIX Middleware’08 Conference Companion, pp. 12–17. ACM, New York (2008)
Google Scholar
ONeill, M., et al.: Low-cost Sha-1 hash function architecture for Rfid tags. RFIDSec. 8, 4151 (2008)
Google Scholar
Michail, H., Kakarountas, A.P., Koufopavlou, O., Goutis, C.E.: A low-power and high-throughput implementation of the Sha-1 hash function. International Symposium on Circuits and Systems ISCAS, p. 40864089. IEEE, Piscataway (2005)
Google Scholar
Deepakumara, J., Heys, H.M., Venkatesan, R.: FPGA implementation of md5 hash algorithm. IEEE Electr. Comput. Eng. Can. Conf. 2, 919924 (2001)
Google Scholar
Rabin, M.O., et al.: Fingerprinting by random polynomials. Center for Research in Computing Technology, Aiken Computation Laboratory, University, Cambridge (1981)
Google Scholar
Whiting, D. L., Dilatush, T.: System for backing up files from disk volumes on multiple nodes of a computer network. Google Patents 5778395, US,(1998)
Bigelow, S., Crocetti, P.: Compression, deduplication and encryption: what’s the difference?. TechTarget, Newton (2018)
Google Scholar
Constantinescu, C., Glider, J., Chambliss, D.: Mixing deduplication and compression on active data sets. 2011 Data Compression Conference, p. 393402. IEEE, Piscataway (2011)
Google Scholar
Rehomed, A.: Data security and reliability in cloud backup systems with deduplication. Ph.D. dissertation. Chinese University of Hong Kong, Hong Kong (2012)
Google Scholar
Dong, W., Douglis, F., Li, K., Patterson, R.H., Reddy, S., Shilane, P.: Tradeoffs in scalable data routing for deduplication clusters. FAST 11, 1529 (2011)
Google Scholar
Coates, J. L., Bozeman, P. E., Patterson, D. A.: Distributed storage cluster architecture. Google Patent 7590747, US,(2009)
Manogar, E., Abirami, S.: A study on data deduplication techniques for optimized storage. Advanced Computing (ICoAC) 2014 Sixth International Conference, p. 161166. IEEE, Piscataway (2014)
Google Scholar
Zhu, B., Li, K., Patterson, R.H.: Avoiding the disk bottleneck in the data domain deduplication file system. Fast 8, 114 (2008)
Google Scholar
Heckel, P.C.: Minimizing remote storage usage and synchronization time using deduplication and multi-chunking: Syncany as an example. Technical Report TR-CS-96-05. Universitat Mannheim, School of Business Informatics and Mathematics Laboratory for Dependable Distributed Systems University, Mannheim (2012)
Google Scholar
He, Q., Li, Z., Zhang, X.: ”Data deduplication techniques”, in Future Information Technology and Management Engineering (FITME). Int. Conf. IEEE 1, 430433 (2010)
Google Scholar
Vikraman, R., Abirami, S.: A study on various data deduplication systems. Int. J. Comput. Appl. 94(4), 35–40 (2014)
Google Scholar
Vanish, A., Sankar, K.S.: Study of chunking algorithm in data deduplication. Proceedings of the International Conference on Soft Computing Systems, p. 1320. Springer, New Delhi. (2016)
Google Scholar
Xia, W., Zhou, Y., Jiang, H., Feng, D., Hua, Y., Hu, Y., Liu, Q., Zhang, Y.: Fastcdc: a fast and efficient content defined chunking approach for data deduplication. USENIX Annual Technical Conference, p. 101114. USENIX, Berkeley (2016)
Google Scholar
Malhotra, J., Bakal, J.: A survey and comparative study of data deduplication techniques. Pervasive Computing (ICPC), International Conference, p. 15. IEEE, Piscataway (2015)
Google Scholar
Tang, Y., Yin, J., Dengand, S., Li, Y.: Diode: Dynamicinline-offline deduplication providing efficient space-saving and read/write performance for primary storage systems. Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), p. 481486. IEEE, Piscataway (2016)
Google Scholar
Tan, Y., et al.: SAM: A semantic-aware multitiered source de-duplication framework for cloud backup. 2010 39th International Conference on Parallel Processing, pp. 614–623. IEEE, Piscataway (2010)
Google Scholar
Fu, Y., et al.: AA-dedupe: an application-aware source deduplication approach for cloud backup services in the personal computing environment. 2011 IEEE International Conference on Cluster Computing, p. 112120. IEEE, Piscataway (2011)
Google Scholar
Naga Malleswari T. Y., D. Malathi, and G. Vadivu,” Deduplication techniques: A technical survey.” International J Innovative Res Sci Technol 1.7 (2014).
Montana, D.J.: Strongly typed genetic programming. Evolut. Comput. 3(2), 199230 (1995)
Google Scholar
Kaurav, N.: An investigation on data de-duplication methods and its recent advancements. Proceedings of the International Conference on Advances In Engineering And Technology-ICAET. Engineering And Technology ICAET, Talegaon Dabhade (2014)
Google Scholar
Talk, G., Keswani, V., Parab, N., Mace, J.: Disaster recovery using local and cloud spanning deduplicated storage system. Google Patent App.12/942,988, US(2011)
Kulkarni, P., Douglis, F., LaVoie, J.D., Tracey, J.M.: Redundancy elimination within large collections of files. USENIX Annual Technical Conference, General Track, p. 5972. USENIX, Berkeley (2004)
Google Scholar
Guo, F., Efstathopoulos, P.: Building a high-performance deduplication system in USENIX annual technical conference. USENIX annual technical conference. USENIX, Berkeley (2011)
Google Scholar
Broder, A., Mitzenmacher, M., Mitzenmacher, A.B.I.M.: Network applications of bloom filters: A survey. Internet Mathematics. Citeseer, Princeton (2002)
Google Scholar
Efstathopoulos, P., Guo, F., Shah, D.: Progressive sampling for deduplication indexing. Google Patent8311964, US,(2012)
Meister, D., Kaiser, J., Brinkmann, A., Cortes, T., Kuhn, M., Kunkel, J.: A study on data deduplication. Proceedings of the International Conference on High-Performance Computing, Networking, Storage and Analysis, p. 7. IEEE Computer Society Press, Washington, DC (2012)
Google Scholar
Xia, W., Jiang, H., Feng, D., Tian, L., Fu, M., Wang, Z.: P-dedupe: Exploiting parallelism in data deduplication system. Networking, Architecture, and Storage (NAS), IEEE 7th International Conference, p. 338347. IEEE, Piscataway (2012)
Google Scholar
Yan, H., Li, X., Wang, Y., Jia, C.: Centralized duplicate removal video storage system with privacy preservation. Sensors 18(6), 1814 (2018)
Google Scholar
Meng, H., Li, J., Liu, W., Zhang, C.: Mmsd: a metadata-aware multi-tiered source deduplication cloud backup system in the personal computing environment. Comput. Softw. 8, 542 (2013)
Google Scholar
T. Knutson and R. Carbone, ”Filesystem timestamps: What makes them tick?”, GIAC GCFA Gold Certification,(2016).
Yao, W., Ye, P.: Simdedup: a new deduplication scheme based on simhash. International Conference on Web-Age Information Management, p. 7988. Springer, Berlin (2013)
Google Scholar
Fu, Z.-J., Shu, J.-G., Wang, J., Liu, Y.-L., Lee, S.-Y.: Privacy-preserving smart similarity search based on simhash over encrypted data in cloud computing. Internet Technol. J. 16(3), 453460 (2015)
Google Scholar
Madhubala, G., Priyadharshini, R., Ranjitham, P., Baskaran, S.: Nature-inspired enhanced data deduplication for efficient cloud storage. Recent Trends in Information Technology (ICRTIT), International Conference, p. 16. IEEE, Piscataway (2014)
Google Scholar
Henson, V.: An analysis of compare-by-hash. HotOS, p. 1318. USENIX, Berkeley (2003)
Google Scholar
Brown, R. A.: Sequence matching algorithm. 2015, Google Patent 8965935, US(2015)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Dokl. 10, 707710 (1966)
MathSciNet Google Scholar
Mell, P., Grance, T., et al.: The NIST definition of cloud computing. National Institute of Standards and Technology, Gaithersburg (2011)
Google Scholar
Li, Y.-K., Xu, M., Ng, C.-H., Lee, P.P.: Efficient hybrid inline and out-of-line deduplication for backup storage. ACM Trans. Storage (TOS) 11(2), 1–21 (2015)
Google Scholar
Zorn, B.: Comparing mark-and-sweep and stop-and-copy garbage collection. Proceedings of the ACM conference on LISP and functional programming ACM, p. 8798. ACM, New York (1990)
Google Scholar
Retnamma, M. K. V., Kottomtharayil, R., Attard, D. R.: Distributed deduplicated storage system. Google Patent 9020900, US,(2015)
Fu, Y., Jiang, H., Xiao, N.: A scalable inline cluster deduplication framework for big data protection. Proceedings of the 13th international middleware conference, p. 354373. Springer-Verlag Inc, New York (2012)
Google Scholar
Eshghi, K., Tang, H.K.: A framework for analyzing and improving content-based chunking algorithms. Hewlett-Packard Labs Tech. Rep. TR 30, 2005 (2005)
Google Scholar
Zhang, X., Zhang, J.: Data deduplication cluster-based on similarity-locality approach. Green Computing and Communications(GreenCom), IEEE and Internet of Things iThings/CPSCom), IEEE International Conference on and IEEE Cyber, Physical and Social Computing, p. 21682172. IEEE, Piscataway (2013)
Google Scholar
Luo, S., Zhang, G., Wu, C., Khan, S., Li, K.: Boafft: Distributed deduplication for big data storage in the cloud. IEEE transactions on cloud computing. IEEE, Piscataway (2015)
Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. Proceedings of the ACM SIGMOD International Conference on Management of Data ACM, p. 743754. ACM, New York (2004)
Google Scholar
Paulo, J., Pereira, J.: Efficient deduplication in a distributed primary storage infrastructure. ACM Trans. Storage (TOS) 12(4), 20 (2016)
MathSciNet Google Scholar
Zhang, P., Huang, P., He, X., Wang, H., Zhou, K.: Resemblance and mergence based indexing for high-performance data deduplication. J. Syst. Softw. 128, 1124 (2017)
Google Scholar
Fu, Y., Xiao, N., Jiang, H., Hu, G., Chen, W.: Application-aware big data deduplication in cloud environment. IEEE Transactions on Cloud Computing. IEEE, Piscataway (2017)
Google Scholar
Sklavos, N., Koufopavlou, O.: Implementation of the sha-2 hash family standard using fpgas. J. Supercomput. 31, 227248 (2005)
MATH Google Scholar
Clements, A.T., Ahmad, I., Vilayannur, M., Li, J., et al.: Decentralized deduplication in san cluster file systems. USENIX Annual Technical Conference, p. 101114. USENIX, Berkeley (2009)
Google Scholar
Mike, D.: Understanding data deduplication ratios. SNIA Data Management Forum, p. 7. SNIA, Daryaganj (2008)
Google Scholar
He, S., Zhang, C., Hao, P.: Comparative study of features for fingerprint indexing. 16th IEEE International Conference on Image Processing ICIP, pp. 2749–2752. IEEE, Piscataway (2009)
Google Scholar
Douceur, J.R., Adya, A., Bolosky, W.J., Simon, P., Theimer, M.: Reclaiming space from duplicate files in a serverless distributed file system. Distributed Computing Systems,Proceedings. 22nd International Conference, p. 617624. IEEE, Piscataway (2002)
Google Scholar

Download references

Acknowledgments

This research was supported by the Nanjing Municipal Government-Nanjing University of Science and Technology Joint Scholarship for International Student. We thank our colleagues from the School of Computer Science and Engineering who provided insight and expertise that greatly assisted the research, although they may not agree with all of the conclusions of this paper.

Author information

Authors and Affiliations

Nanjing University of Science and Technology, Nanjing, China
Shawgi M. A. Mohamed & Yongli Wang

Authors

Shawgi M. A. Mohamed
View author publications
You can also search for this author in PubMed Google Scholar
Yongli Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shawgi M. A. Mohamed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mohamed, S.M.A., Wang, Y. A survey on novel classification of deduplication storage systems. Distrib Parallel Databases 39, 201–230 (2021). https://doi.org/10.1007/s10619-020-07301-2

Download citation

Published: 16 June 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s10619-020-07301-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on novel classification of deduplication storage systems

Abstract

Access this article

Similar content being viewed by others

An Overview on Data Deduplication Techniques

A Viewpoint on Different Data Deduplication Systems and Allied Issues

Data deduplication techniques for efficient cloud storage management: a systematic review

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A survey on novel classification of deduplication storage systems

Abstract

Access this article

Similar content being viewed by others

An Overview on Data Deduplication Techniques

A Viewpoint on Different Data Deduplication Systems and Allied Issues

Data deduplication techniques for efficient cloud storage management: a systematic review

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation