Abstract
The huge blast of information caused a lot of dilemmas in both storage and retrieval procedures. The enlargement in a massive quantity of digital data requirements imposes more storage space, which in turn radically increases performance and cost of backup. Data deduplication is one of the techniques that vanishes replicated data, decreases the bandwidth, and minimizes the disk usage and cost. Since various researches have been broadly considered in the literature, this paper reviews the ideas, categories, and different storage approaches that use data deduplication. Apart from the well-known classification that uses Granularity, Side, Timing, and Implementation for classifying the deduplication approaches, a new classification principle is adopted using the storage location. This classification identifies and describes the diverse methods. Moreover, the deduplication systems are comprehensively described according to the storage location, including Local, Centralized, and Clustered storage systems. Furthermore, the describing objectives, used techniques, features, and drawbacks of the most advanced methods of each type are broadly tackled. Finally, the major deduplication systems' challenges are recognized and illustrated.
Similar content being viewed by others
References
Al-Fares, M., Loukissas, A., Vahdat, A.: ACM SIGCOMM computer communication review, pp. 63–74. ACM, New York (2008)
Greenberg, A., Hamilton, J.R., Jain, N., Kandula, S., Kim, C., Lahiri, P., Maltz, D.A., Patel, P., Sengupta, S.: ACM SIGCOMM computer communication review, pp. 51–62. ACM, New York (2009)
Kandula, S., Sengupta, S., Greenberg, A., Patel, P., Chaiken, R.: Internet measurement, pp. 202–208. ACM, New York (2009)
Mell, P., Grance, T., et al.: The NIST definition of cloud computing, Computer Security Division Information Technology Laboratory. National Institute of Standards and Technology, Gaithersburg (2011)
Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., et al.: A view of cloud computing. Commun. ACM 53, 50–58 (2010)
Antonopoulos, N., Gillam, L.: Cloud computing. Springer, London (2010)
Kilov, H., Linington, P.F., Romero, J.R., Tanaka, A., Vallecillo, A.: The reference model of open distributed processing: Foundations, experience, and applications. Comput. Stand. Interfaces 35, 247–256 (2013)
Rumelhart, D.E., Hinton, G.E., McClelland, J.L., et al.: A general framework for parallel distributed processing. Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, pp. 45–76. MIT Press, Cambridge (1986)
McClelland, J.L., Rumelhart, D.E.: Explorations in parallel distributed processing: a handbook of models, programs, and exercises. MIT Press, Cambridge (1989)
Kumar, R., Marinov, D., Padua, D., Parthasarathy, M., Patel, S., Roth, D., Snir, M., Torrellas, J.: Parallel Computing Research at Illinois The UPCRC Agenda. University of Illinois, Champaign (2008)
Berman, F., Fox, G., Hey, A.J.: Grid computing: making the global infrastructure a reality, vol. 2. Wiley, Hoboken (2003)
Bote-Lorenzo, M.L., Dimitriadis, Y.A., GmezSnchez, E.: Grid computing, p. 291298. Springer, Berlin (2003)
G. Mittal, D. Kesswani, K. Goswami, et al,” A survey of current trends in distributed, grid and cloud computing”, arXiv preprint arXiv:1308.1806,(2003).
Barroso, L.A., Dean, J., Holzle, U.: Web search fora planet: the Google cluster architecture. IEEE Micro 23, 2228 (2003)
Linden, G., Smith, B., York, J.: Amazon. com recommendations: item-to-item collaborative filtering. IEEE Internet Comput. 7, 7680 (2003)
Tsai, W.-T., Sun, X., Balasooriya, J.: Service-oriented cloud computing architecture. Information technology: new generations (ITNG) Seventh International Conference. IEEE, Piscataway (2010)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30, 107–117 (1998)
Chu, H., Rosenthal, M.: Search engines for the worldwide web: A comparative study and evaluation methodology. Proc. Ann. Meeting-Am. Soc. Inform. Sci. 33, 127135 (1996)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107113 (2008)
Shim, K.: Mapreduce algorithms for big data analysis. Proc. VLDB Endow. 5, 2016–2017 (2012)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. (TOCS) 26, 4 (2008)
Geer, D.: Reducing the storage burden via data deduplication. Comput. IEEE 41(12), 15–17 (2008)
Paulo, J., Pereira, J.: A survey and classification of storage deduplication systems. ACM Comput. Surveys (CSUR) 47, 11 (2014)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19, 116 (2007)
Burrows, J.H.: Secure hash standard. Department of Commerce Washington DC Tech. Rep, Washington DC (1995)
Kim, D., Song, S., Choi, B.-Y.: SAFE: structure-aware file and email deduplication for cloud-based storage systems, pp. 130–137. Piscataway, IEEE (2013)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24(5), 530–536 (1978)
Biggar, H.: Experiencing data deduplication: improvingefficiency and reducing capacity requirements, White paper. The Enterprise Strategy Group, Milford (2007)
Manager, N., et al.: Demystifying data deduplication. Proceedings of the ACM/IFIP/USENIX Middleware’08 Conference Companion, pp. 12–17. ACM, New York (2008)
ONeill, M., et al.: Low-cost Sha-1 hash function architecture for Rfid tags. RFIDSec. 8, 4151 (2008)
Michail, H., Kakarountas, A.P., Koufopavlou, O., Goutis, C.E.: A low-power and high-throughput implementation of the Sha-1 hash function. International Symposium on Circuits and Systems ISCAS, p. 40864089. IEEE, Piscataway (2005)
Deepakumara, J., Heys, H.M., Venkatesan, R.: FPGA implementation of md5 hash algorithm. IEEE Electr. Comput. Eng. Can. Conf. 2, 919924 (2001)
Rabin, M.O., et al.: Fingerprinting by random polynomials. Center for Research in Computing Technology, Aiken Computation Laboratory, University, Cambridge (1981)
Whiting, D. L., Dilatush, T.: System for backing up files from disk volumes on multiple nodes of a computer network. Google Patents 5778395, US,(1998)
Bigelow, S., Crocetti, P.: Compression, deduplication and encryption: what’s the difference?. TechTarget, Newton (2018)
Constantinescu, C., Glider, J., Chambliss, D.: Mixing deduplication and compression on active data sets. 2011 Data Compression Conference, p. 393402. IEEE, Piscataway (2011)
Rehomed, A.: Data security and reliability in cloud backup systems with deduplication. Ph.D. dissertation. Chinese University of Hong Kong, Hong Kong (2012)
Dong, W., Douglis, F., Li, K., Patterson, R.H., Reddy, S., Shilane, P.: Tradeoffs in scalable data routing for deduplication clusters. FAST 11, 1529 (2011)
Coates, J. L., Bozeman, P. E., Patterson, D. A.: Distributed storage cluster architecture. Google Patent 7590747, US,(2009)
Manogar, E., Abirami, S.: A study on data deduplication techniques for optimized storage. Advanced Computing (ICoAC) 2014 Sixth International Conference, p. 161166. IEEE, Piscataway (2014)
Zhu, B., Li, K., Patterson, R.H.: Avoiding the disk bottleneck in the data domain deduplication file system. Fast 8, 114 (2008)
Heckel, P.C.: Minimizing remote storage usage and synchronization time using deduplication and multi-chunking: Syncany as an example. Technical Report TR-CS-96-05. Universitat Mannheim, School of Business Informatics and Mathematics Laboratory for Dependable Distributed Systems University, Mannheim (2012)
He, Q., Li, Z., Zhang, X.: ”Data deduplication techniques”, in Future Information Technology and Management Engineering (FITME). Int. Conf. IEEE 1, 430433 (2010)
Vikraman, R., Abirami, S.: A study on various data deduplication systems. Int. J. Comput. Appl. 94(4), 35–40 (2014)
Vanish, A., Sankar, K.S.: Study of chunking algorithm in data deduplication. Proceedings of the International Conference on Soft Computing Systems, p. 1320. Springer, New Delhi. (2016)
Xia, W., Zhou, Y., Jiang, H., Feng, D., Hua, Y., Hu, Y., Liu, Q., Zhang, Y.: Fastcdc: a fast and efficient content defined chunking approach for data deduplication. USENIX Annual Technical Conference, p. 101114. USENIX, Berkeley (2016)
Malhotra, J., Bakal, J.: A survey and comparative study of data deduplication techniques. Pervasive Computing (ICPC), International Conference, p. 15. IEEE, Piscataway (2015)
Tang, Y., Yin, J., Dengand, S., Li, Y.: Diode: Dynamicinline-offline deduplication providing efficient space-saving and read/write performance for primary storage systems. Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), p. 481486. IEEE, Piscataway (2016)
Tan, Y., et al.: SAM: A semantic-aware multitiered source de-duplication framework for cloud backup. 2010 39th International Conference on Parallel Processing, pp. 614–623. IEEE, Piscataway (2010)
Fu, Y., et al.: AA-dedupe: an application-aware source deduplication approach for cloud backup services in the personal computing environment. 2011 IEEE International Conference on Cluster Computing, p. 112120. IEEE, Piscataway (2011)
Naga Malleswari T. Y., D. Malathi, and G. Vadivu,” Deduplication techniques: A technical survey.” International J Innovative Res Sci Technol 1.7 (2014).
Montana, D.J.: Strongly typed genetic programming. Evolut. Comput. 3(2), 199230 (1995)
Kaurav, N.: An investigation on data de-duplication methods and its recent advancements. Proceedings of the International Conference on Advances In Engineering And Technology-ICAET. Engineering And Technology ICAET, Talegaon Dabhade (2014)
Talk, G., Keswani, V., Parab, N., Mace, J.: Disaster recovery using local and cloud spanning deduplicated storage system. Google Patent App.12/942,988, US(2011)
Kulkarni, P., Douglis, F., LaVoie, J.D., Tracey, J.M.: Redundancy elimination within large collections of files. USENIX Annual Technical Conference, General Track, p. 5972. USENIX, Berkeley (2004)
Guo, F., Efstathopoulos, P.: Building a high-performance deduplication system in USENIX annual technical conference. USENIX annual technical conference. USENIX, Berkeley (2011)
Broder, A., Mitzenmacher, M., Mitzenmacher, A.B.I.M.: Network applications of bloom filters: A survey. Internet Mathematics. Citeseer, Princeton (2002)
Efstathopoulos, P., Guo, F., Shah, D.: Progressive sampling for deduplication indexing. Google Patent8311964, US,(2012)
Meister, D., Kaiser, J., Brinkmann, A., Cortes, T., Kuhn, M., Kunkel, J.: A study on data deduplication. Proceedings of the International Conference on High-Performance Computing, Networking, Storage and Analysis, p. 7. IEEE Computer Society Press, Washington, DC (2012)
Xia, W., Jiang, H., Feng, D., Tian, L., Fu, M., Wang, Z.: P-dedupe: Exploiting parallelism in data deduplication system. Networking, Architecture, and Storage (NAS), IEEE 7th International Conference, p. 338347. IEEE, Piscataway (2012)
Yan, H., Li, X., Wang, Y., Jia, C.: Centralized duplicate removal video storage system with privacy preservation. Sensors 18(6), 1814 (2018)
Meng, H., Li, J., Liu, W., Zhang, C.: Mmsd: a metadata-aware multi-tiered source deduplication cloud backup system in the personal computing environment. Comput. Softw. 8, 542 (2013)
T. Knutson and R. Carbone, ”Filesystem timestamps: What makes them tick?”, GIAC GCFA Gold Certification,(2016).
Yao, W., Ye, P.: Simdedup: a new deduplication scheme based on simhash. International Conference on Web-Age Information Management, p. 7988. Springer, Berlin (2013)
Fu, Z.-J., Shu, J.-G., Wang, J., Liu, Y.-L., Lee, S.-Y.: Privacy-preserving smart similarity search based on simhash over encrypted data in cloud computing. Internet Technol. J. 16(3), 453460 (2015)
Madhubala, G., Priyadharshini, R., Ranjitham, P., Baskaran, S.: Nature-inspired enhanced data deduplication for efficient cloud storage. Recent Trends in Information Technology (ICRTIT), International Conference, p. 16. IEEE, Piscataway (2014)
Henson, V.: An analysis of compare-by-hash. HotOS, p. 1318. USENIX, Berkeley (2003)
Brown, R. A.: Sequence matching algorithm. 2015, Google Patent 8965935, US(2015)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Dokl. 10, 707710 (1966)
Mell, P., Grance, T., et al.: The NIST definition of cloud computing. National Institute of Standards and Technology, Gaithersburg (2011)
Li, Y.-K., Xu, M., Ng, C.-H., Lee, P.P.: Efficient hybrid inline and out-of-line deduplication for backup storage. ACM Trans. Storage (TOS) 11(2), 1–21 (2015)
Zorn, B.: Comparing mark-and-sweep and stop-and-copy garbage collection. Proceedings of the ACM conference on LISP and functional programming ACM, p. 8798. ACM, New York (1990)
Retnamma, M. K. V., Kottomtharayil, R., Attard, D. R.: Distributed deduplicated storage system. Google Patent 9020900, US,(2015)
Fu, Y., Jiang, H., Xiao, N.: A scalable inline cluster deduplication framework for big data protection. Proceedings of the 13th international middleware conference, p. 354373. Springer-Verlag Inc, New York (2012)
Eshghi, K., Tang, H.K.: A framework for analyzing and improving content-based chunking algorithms. Hewlett-Packard Labs Tech. Rep. TR 30, 2005 (2005)
Zhang, X., Zhang, J.: Data deduplication cluster-based on similarity-locality approach. Green Computing and Communications(GreenCom), IEEE and Internet of Things iThings/CPSCom), IEEE International Conference on and IEEE Cyber, Physical and Social Computing, p. 21682172. IEEE, Piscataway (2013)
Luo, S., Zhang, G., Wu, C., Khan, S., Li, K.: Boafft: Distributed deduplication for big data storage in the cloud. IEEE transactions on cloud computing. IEEE, Piscataway (2015)
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. Proceedings of the ACM SIGMOD International Conference on Management of Data ACM, p. 743754. ACM, New York (2004)
Paulo, J., Pereira, J.: Efficient deduplication in a distributed primary storage infrastructure. ACM Trans. Storage (TOS) 12(4), 20 (2016)
Zhang, P., Huang, P., He, X., Wang, H., Zhou, K.: Resemblance and mergence based indexing for high-performance data deduplication. J. Syst. Softw. 128, 1124 (2017)
Fu, Y., Xiao, N., Jiang, H., Hu, G., Chen, W.: Application-aware big data deduplication in cloud environment. IEEE Transactions on Cloud Computing. IEEE, Piscataway (2017)
Sklavos, N., Koufopavlou, O.: Implementation of the sha-2 hash family standard using fpgas. J. Supercomput. 31, 227248 (2005)
Clements, A.T., Ahmad, I., Vilayannur, M., Li, J., et al.: Decentralized deduplication in san cluster file systems. USENIX Annual Technical Conference, p. 101114. USENIX, Berkeley (2009)
Mike, D.: Understanding data deduplication ratios. SNIA Data Management Forum, p. 7. SNIA, Daryaganj (2008)
He, S., Zhang, C., Hao, P.: Comparative study of features for fingerprint indexing. 16th IEEE International Conference on Image Processing ICIP, pp. 2749–2752. IEEE, Piscataway (2009)
Douceur, J.R., Adya, A., Bolosky, W.J., Simon, P., Theimer, M.: Reclaiming space from duplicate files in a serverless distributed file system. Distributed Computing Systems,Proceedings. 22nd International Conference, p. 617624. IEEE, Piscataway (2002)
Acknowledgments
This research was supported by the Nanjing Municipal Government-Nanjing University of Science and Technology Joint Scholarship for International Student. We thank our colleagues from the School of Computer Science and Engineering who provided insight and expertise that greatly assisted the research, although they may not agree with all of the conclusions of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mohamed, S.M.A., Wang, Y. A survey on novel classification of deduplication storage systems. Distrib Parallel Databases 39, 201–230 (2021). https://doi.org/10.1007/s10619-020-07301-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-020-07301-2