Skip to main content
Log in

A survey on novel classification of deduplication storage systems

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

The huge blast of information caused a lot of dilemmas in both storage and retrieval procedures. The enlargement in a massive quantity of digital data requirements imposes more storage space, which in turn radically increases performance and cost of backup. Data deduplication is one of the techniques that vanishes replicated data, decreases the bandwidth, and minimizes the disk usage and cost. Since various researches have been broadly considered in the literature, this paper reviews the ideas, categories, and different storage approaches that use data deduplication. Apart from the well-known classification that uses Granularity, Side, Timing, and Implementation for classifying the deduplication approaches, a new classification principle is adopted using the storage location. This classification identifies and describes the diverse methods. Moreover, the deduplication systems are comprehensively described according to the storage location, including Local, Centralized, and Clustered storage systems. Furthermore, the describing objectives, used techniques, features, and drawbacks of the most advanced methods of each type are broadly tackled. Finally, the major deduplication systems' challenges are recognized and illustrated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Al-Fares, M., Loukissas, A., Vahdat, A.: ACM SIGCOMM computer communication review, pp. 63–74. ACM, New York (2008)

    Google Scholar 

  2. Greenberg, A., Hamilton, J.R., Jain, N., Kandula, S., Kim, C., Lahiri, P., Maltz, D.A., Patel, P., Sengupta, S.: ACM SIGCOMM computer communication review, pp. 51–62. ACM, New York (2009)

    Google Scholar 

  3. Kandula, S., Sengupta, S., Greenberg, A., Patel, P., Chaiken, R.: Internet measurement, pp. 202–208. ACM, New York (2009)

    Google Scholar 

  4. Mell, P., Grance, T., et al.: The NIST definition of cloud computing, Computer Security Division Information Technology Laboratory. National Institute of Standards and Technology, Gaithersburg (2011)

    Google Scholar 

  5. Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., et al.: A view of cloud computing. Commun. ACM 53, 50–58 (2010)

    Google Scholar 

  6. Antonopoulos, N., Gillam, L.: Cloud computing. Springer, London (2010)

    MATH  Google Scholar 

  7. Kilov, H., Linington, P.F., Romero, J.R., Tanaka, A., Vallecillo, A.: The reference model of open distributed processing: Foundations, experience, and applications. Comput. Stand. Interfaces 35, 247–256 (2013)

    Google Scholar 

  8. Rumelhart, D.E., Hinton, G.E., McClelland, J.L., et al.: A general framework for parallel distributed processing. Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, pp. 45–76. MIT Press, Cambridge (1986)

    Google Scholar 

  9. McClelland, J.L., Rumelhart, D.E.: Explorations in parallel distributed processing: a handbook of models, programs, and exercises. MIT Press, Cambridge (1989)

    Google Scholar 

  10. Kumar, R., Marinov, D., Padua, D., Parthasarathy, M., Patel, S., Roth, D., Snir, M., Torrellas, J.: Parallel Computing Research at Illinois The UPCRC Agenda. University of Illinois, Champaign (2008)

    Google Scholar 

  11. Berman, F., Fox, G., Hey, A.J.: Grid computing: making the global infrastructure a reality, vol. 2. Wiley, Hoboken (2003)

    Google Scholar 

  12. Bote-Lorenzo, M.L., Dimitriadis, Y.A., GmezSnchez, E.: Grid computing, p. 291298. Springer, Berlin (2003)

    Google Scholar 

  13. G. Mittal, D. Kesswani, K. Goswami, et al,” A survey of current trends in distributed, grid and cloud computing”, arXiv preprint arXiv:1308.1806,(2003).

  14. Barroso, L.A., Dean, J., Holzle, U.: Web search fora planet: the Google cluster architecture. IEEE Micro 23, 2228 (2003)

    Google Scholar 

  15. Linden, G., Smith, B., York, J.: Amazon. com recommendations: item-to-item collaborative filtering. IEEE Internet Comput. 7, 7680 (2003)

    Google Scholar 

  16. Tsai, W.-T., Sun, X., Balasooriya, J.: Service-oriented cloud computing architecture. Information technology: new generations (ITNG) Seventh International Conference. IEEE, Piscataway (2010)

    Google Scholar 

  17. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30, 107–117 (1998)

    Google Scholar 

  18. Chu, H., Rosenthal, M.: Search engines for the worldwide web: A comparative study and evaluation methodology. Proc. Ann. Meeting-Am. Soc. Inform. Sci. 33, 127135 (1996)

    Google Scholar 

  19. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107113 (2008)

    Google Scholar 

  20. Shim, K.: Mapreduce algorithms for big data analysis. Proc. VLDB Endow. 5, 2016–2017 (2012)

    Google Scholar 

  21. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. (TOCS) 26, 4 (2008)

    Google Scholar 

  22. Geer, D.: Reducing the storage burden via data deduplication. Comput. IEEE 41(12), 15–17 (2008)

    Google Scholar 

  23. Paulo, J., Pereira, J.: A survey and classification of storage deduplication systems. ACM Comput. Surveys (CSUR) 47, 11 (2014)

    Google Scholar 

  24. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19, 116 (2007)

    Google Scholar 

  25. Burrows, J.H.: Secure hash standard. Department of Commerce Washington DC Tech. Rep, Washington DC (1995)

    Google Scholar 

  26. Kim, D., Song, S., Choi, B.-Y.: SAFE: structure-aware file and email deduplication for cloud-based storage systems, pp. 130–137. Piscataway, IEEE (2013)

    Google Scholar 

  27. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    MathSciNet  MATH  Google Scholar 

  28. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24(5), 530–536 (1978)

    MathSciNet  MATH  Google Scholar 

  29. Biggar, H.: Experiencing data deduplication: improvingefficiency and reducing capacity requirements, White paper. The Enterprise Strategy Group, Milford (2007)

    Google Scholar 

  30. Manager, N., et al.: Demystifying data deduplication. Proceedings of the ACM/IFIP/USENIX Middleware’08 Conference Companion, pp. 12–17. ACM, New York (2008)

    Google Scholar 

  31. ONeill, M., et al.: Low-cost Sha-1 hash function architecture for Rfid tags. RFIDSec. 8, 4151 (2008)

    Google Scholar 

  32. Michail, H., Kakarountas, A.P., Koufopavlou, O., Goutis, C.E.: A low-power and high-throughput implementation of the Sha-1 hash function. International Symposium on Circuits and Systems ISCAS, p. 40864089. IEEE, Piscataway (2005)

    Google Scholar 

  33. Deepakumara, J., Heys, H.M., Venkatesan, R.: FPGA implementation of md5 hash algorithm. IEEE Electr. Comput. Eng. Can. Conf. 2, 919924 (2001)

    Google Scholar 

  34. Rabin, M.O., et al.: Fingerprinting by random polynomials. Center for Research in Computing Technology, Aiken Computation Laboratory, University, Cambridge (1981)

    Google Scholar 

  35. Whiting, D. L., Dilatush, T.: System for backing up files from disk volumes on multiple nodes of a computer network. Google Patents 5778395, US,(1998)

  36. Bigelow, S., Crocetti, P.: Compression, deduplication and encryption: what’s the difference?. TechTarget, Newton (2018)

    Google Scholar 

  37. Constantinescu, C., Glider, J., Chambliss, D.: Mixing deduplication and compression on active data sets. 2011 Data Compression Conference, p. 393402. IEEE, Piscataway (2011)

    Google Scholar 

  38. Rehomed, A.: Data security and reliability in cloud backup systems with deduplication. Ph.D. dissertation. Chinese University of Hong Kong, Hong Kong (2012)

    Google Scholar 

  39. Dong, W., Douglis, F., Li, K., Patterson, R.H., Reddy, S., Shilane, P.: Tradeoffs in scalable data routing for deduplication clusters. FAST 11, 1529 (2011)

    Google Scholar 

  40. Coates, J. L., Bozeman, P. E., Patterson, D. A.: Distributed storage cluster architecture. Google Patent 7590747, US,(2009)

  41. Manogar, E., Abirami, S.: A study on data deduplication techniques for optimized storage. Advanced Computing (ICoAC) 2014 Sixth International Conference, p. 161166. IEEE, Piscataway (2014)

    Google Scholar 

  42. Zhu, B., Li, K., Patterson, R.H.: Avoiding the disk bottleneck in the data domain deduplication file system. Fast 8, 114 (2008)

    Google Scholar 

  43. Heckel, P.C.: Minimizing remote storage usage and synchronization time using deduplication and multi-chunking: Syncany as an example. Technical Report TR-CS-96-05. Universitat Mannheim, School of Business Informatics and Mathematics Laboratory for Dependable Distributed Systems University, Mannheim (2012)

    Google Scholar 

  44. He, Q., Li, Z., Zhang, X.: ”Data deduplication techniques”, in Future Information Technology and Management Engineering (FITME). Int. Conf. IEEE 1, 430433 (2010)

    Google Scholar 

  45. Vikraman, R., Abirami, S.: A study on various data deduplication systems. Int. J. Comput. Appl. 94(4), 35–40 (2014)

    Google Scholar 

  46. Vanish, A., Sankar, K.S.: Study of chunking algorithm in data deduplication. Proceedings of the International Conference on Soft Computing Systems, p. 1320. Springer, New Delhi. (2016)

    Google Scholar 

  47. Xia, W., Zhou, Y., Jiang, H., Feng, D., Hua, Y., Hu, Y., Liu, Q., Zhang, Y.: Fastcdc: a fast and efficient content defined chunking approach for data deduplication. USENIX Annual Technical Conference, p. 101114. USENIX, Berkeley (2016)

    Google Scholar 

  48. Malhotra, J., Bakal, J.: A survey and comparative study of data deduplication techniques. Pervasive Computing (ICPC), International Conference, p. 15. IEEE, Piscataway (2015)

    Google Scholar 

  49. Tang, Y., Yin, J., Dengand, S., Li, Y.: Diode: Dynamicinline-offline deduplication providing efficient space-saving and read/write performance for primary storage systems. Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), p. 481486. IEEE, Piscataway (2016)

    Google Scholar 

  50. Tan, Y., et al.: SAM: A semantic-aware multitiered source de-duplication framework for cloud backup. 2010 39th International Conference on Parallel Processing, pp. 614–623. IEEE, Piscataway (2010)

    Google Scholar 

  51. Fu, Y., et al.: AA-dedupe: an application-aware source deduplication approach for cloud backup services in the personal computing environment. 2011 IEEE International Conference on Cluster Computing, p. 112120. IEEE, Piscataway (2011)

    Google Scholar 

  52. Naga Malleswari T. Y., D. Malathi, and G. Vadivu,” Deduplication techniques: A technical survey.” International J Innovative Res Sci Technol 1.7 (2014).

  53. Montana, D.J.: Strongly typed genetic programming. Evolut. Comput. 3(2), 199230 (1995)

    Google Scholar 

  54. Kaurav, N.: An investigation on data de-duplication methods and its recent advancements. Proceedings of the International Conference on Advances In Engineering And Technology-ICAET. Engineering And Technology ICAET, Talegaon Dabhade (2014)

    Google Scholar 

  55. Talk, G., Keswani, V., Parab, N., Mace, J.: Disaster recovery using local and cloud spanning deduplicated storage system. Google Patent App.12/942,988, US(2011)

  56. Kulkarni, P., Douglis, F., LaVoie, J.D., Tracey, J.M.: Redundancy elimination within large collections of files. USENIX Annual Technical Conference, General Track, p. 5972. USENIX, Berkeley (2004)

    Google Scholar 

  57. Guo, F., Efstathopoulos, P.: Building a high-performance deduplication system in USENIX annual technical conference. USENIX annual technical conference. USENIX, Berkeley (2011)

    Google Scholar 

  58. Broder, A., Mitzenmacher, M., Mitzenmacher, A.B.I.M.: Network applications of bloom filters: A survey. Internet Mathematics. Citeseer, Princeton (2002)

    Google Scholar 

  59. Efstathopoulos, P., Guo, F., Shah, D.: Progressive sampling for deduplication indexing. Google Patent8311964, US,(2012)

  60. Meister, D., Kaiser, J., Brinkmann, A., Cortes, T., Kuhn, M., Kunkel, J.: A study on data deduplication. Proceedings of the International Conference on High-Performance Computing, Networking, Storage and Analysis, p. 7. IEEE Computer Society Press, Washington, DC (2012)

    Google Scholar 

  61. Xia, W., Jiang, H., Feng, D., Tian, L., Fu, M., Wang, Z.: P-dedupe: Exploiting parallelism in data deduplication system. Networking, Architecture, and Storage (NAS), IEEE 7th International Conference, p. 338347. IEEE, Piscataway (2012)

    Google Scholar 

  62. Yan, H., Li, X., Wang, Y., Jia, C.: Centralized duplicate removal video storage system with privacy preservation. Sensors 18(6), 1814 (2018)

    Google Scholar 

  63. Meng, H., Li, J., Liu, W., Zhang, C.: Mmsd: a metadata-aware multi-tiered source deduplication cloud backup system in the personal computing environment. Comput. Softw. 8, 542 (2013)

    Google Scholar 

  64. T. Knutson and R. Carbone, ”Filesystem timestamps: What makes them tick?”, GIAC GCFA Gold Certification,(2016).

  65. Yao, W., Ye, P.: Simdedup: a new deduplication scheme based on simhash. International Conference on Web-Age Information Management, p. 7988. Springer, Berlin (2013)

    Google Scholar 

  66. Fu, Z.-J., Shu, J.-G., Wang, J., Liu, Y.-L., Lee, S.-Y.: Privacy-preserving smart similarity search based on simhash over encrypted data in cloud computing. Internet Technol. J. 16(3), 453460 (2015)

    Google Scholar 

  67. Madhubala, G., Priyadharshini, R., Ranjitham, P., Baskaran, S.: Nature-inspired enhanced data deduplication for efficient cloud storage. Recent Trends in Information Technology (ICRTIT), International Conference, p. 16. IEEE, Piscataway (2014)

    Google Scholar 

  68. Henson, V.: An analysis of compare-by-hash. HotOS, p. 1318. USENIX, Berkeley (2003)

    Google Scholar 

  69. Brown, R. A.: Sequence matching algorithm. 2015, Google Patent 8965935, US(2015)

  70. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Dokl. 10, 707710 (1966)

    MathSciNet  Google Scholar 

  71. Mell, P., Grance, T., et al.: The NIST definition of cloud computing. National Institute of Standards and Technology, Gaithersburg (2011)

    Google Scholar 

  72. Li, Y.-K., Xu, M., Ng, C.-H., Lee, P.P.: Efficient hybrid inline and out-of-line deduplication for backup storage. ACM Trans. Storage (TOS) 11(2), 1–21 (2015)

    Google Scholar 

  73. Zorn, B.: Comparing mark-and-sweep and stop-and-copy garbage collection. Proceedings of the ACM conference on LISP and functional programming ACM, p. 8798. ACM, New York (1990)

    Google Scholar 

  74. Retnamma, M. K. V., Kottomtharayil, R., Attard, D. R.: Distributed deduplicated storage system. Google Patent 9020900, US,(2015)

  75. Fu, Y., Jiang, H., Xiao, N.: A scalable inline cluster deduplication framework for big data protection. Proceedings of the 13th international middleware conference, p. 354373. Springer-Verlag Inc, New York (2012)

    Google Scholar 

  76. Eshghi, K., Tang, H.K.: A framework for analyzing and improving content-based chunking algorithms. Hewlett-Packard Labs Tech. Rep. TR 30, 2005 (2005)

    Google Scholar 

  77. Zhang, X., Zhang, J.: Data deduplication cluster-based on similarity-locality approach. Green Computing and Communications(GreenCom), IEEE and Internet of Things iThings/CPSCom), IEEE International Conference on and IEEE Cyber, Physical and Social Computing, p. 21682172. IEEE, Piscataway (2013)

    Google Scholar 

  78. Luo, S., Zhang, G., Wu, C., Khan, S., Li, K.: Boafft: Distributed deduplication for big data storage in the cloud. IEEE transactions on cloud computing. IEEE, Piscataway (2015)

    Google Scholar 

  79. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. Proceedings of the ACM SIGMOD International Conference on Management of Data ACM, p. 743754. ACM, New York (2004)

    Google Scholar 

  80. Paulo, J., Pereira, J.: Efficient deduplication in a distributed primary storage infrastructure. ACM Trans. Storage (TOS) 12(4), 20 (2016)

    MathSciNet  Google Scholar 

  81. Zhang, P., Huang, P., He, X., Wang, H., Zhou, K.: Resemblance and mergence based indexing for high-performance data deduplication. J. Syst. Softw. 128, 1124 (2017)

    Google Scholar 

  82. Fu, Y., Xiao, N., Jiang, H., Hu, G., Chen, W.: Application-aware big data deduplication in cloud environment. IEEE Transactions on Cloud Computing. IEEE, Piscataway (2017)

    Google Scholar 

  83. Sklavos, N., Koufopavlou, O.: Implementation of the sha-2 hash family standard using fpgas. J. Supercomput. 31, 227248 (2005)

    MATH  Google Scholar 

  84. Clements, A.T., Ahmad, I., Vilayannur, M., Li, J., et al.: Decentralized deduplication in san cluster file systems. USENIX Annual Technical Conference, p. 101114. USENIX, Berkeley (2009)

    Google Scholar 

  85. Mike, D.: Understanding data deduplication ratios. SNIA Data Management Forum, p. 7. SNIA, Daryaganj (2008)

    Google Scholar 

  86. He, S., Zhang, C., Hao, P.: Comparative study of features for fingerprint indexing. 16th IEEE International Conference on Image Processing ICIP, pp. 2749–2752. IEEE, Piscataway (2009)

    Google Scholar 

  87. Douceur, J.R., Adya, A., Bolosky, W.J., Simon, P., Theimer, M.: Reclaiming space from duplicate files in a serverless distributed file system. Distributed Computing Systems,Proceedings. 22nd International Conference, p. 617624. IEEE, Piscataway (2002)

    Google Scholar 

Download references

Acknowledgments

This research was supported by the Nanjing Municipal Government-Nanjing University of Science and Technology Joint Scholarship for International Student. We thank our colleagues from the School of Computer Science and Engineering who provided insight and expertise that greatly assisted the research, although they may not agree with all of the conclusions of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shawgi M. A. Mohamed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mohamed, S.M.A., Wang, Y. A survey on novel classification of deduplication storage systems. Distrib Parallel Databases 39, 201–230 (2021). https://doi.org/10.1007/s10619-020-07301-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-020-07301-2

Keywords

Navigation