Skip to main content
Log in

SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Due to the relatively low bandwidth of WAN that supports cloud backup services and the increasing amount of backed-up data stored at service providers, the deduplication scheme used in the cloud backup environment must remove the redundant data for backup operations to reduce backup times and storage costs and for restore operations to reduce restore times. In this paper, we propose SAFE, a source deduplication framework for efficient cloud backup and restore operations. SAFE consists of three salient features, (1) Hybrid Deduplication, combining the global file-level and local chunk-level deduplication to achieve an optimal tradeoff between the deduplication efficiency and overhead to achieve a short backup time; (2) Semantic-aware Elimination, exploiting file semantics to narrow the search space for the redundant data in hybrid deduplication process to reduce the deduplication overhead; and (3) Unmodified Data Removal, removing the files and data chunks that are kept intact from data transmission for some restore operations. Through extensive experiments driven by real-world datasets, the SAFE framework is shown to maintain a much higher deduplication efficiency/overhead ratio than existing solutions, shortening the backup time by an average of 38.7 %, and reduce the restore time by a ratio of up to 9.7 : 1.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9

Similar content being viewed by others

References

  1. Branch Office Optimization (2007). Enterprise strategy group.

  2. Data loss survey: http://www.idgconnect.com (2010).

  3. Adya, A., Bolosky, W.J., Castro, M., Cermak, G., Chaiken, R., Douceur, J.R., Howell, J., Lorch, J.R., Theimer, M., Wattenhofer, R.P. (2002). FARSITE: federated, available, and reliable storage for an incompletely trusted environment. ACM SIGOPS Operating Systems Review , 36(SI), 1–14.

    Article  Google Scholar 

  4. Agrawal, N., Bolosky, W.J., Douceur, J.R., Lorch, J.R. (2007). A five-year study of file-system metadata. In FAST’07.

  5. Amazon Simple Storage Service: http://aws.amazon.com/s3.

  6. Annapureddy, S., Freedman, M.J., Mazieres, D. (2005). Shark: Scaling file servers via cooperative caching. In NSDI’05.

  7. Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M. (2009). Extreme binning: scalable, parallel deduplication for chunk-based file backup. Technical Report, HPL-2009-10R2 HP Laboratories.

  8. Bloom, B.H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM , 13(7), 422–426.

    Article  MATH  Google Scholar 

  9. Bobbarjung, D.R., Jagannathan, S., Dubnicki, C. (2006). Improving duplicate elimination in storage systems. ACM SIGOPS Transactions on Storage, 2(4), 424–448.

    Article  Google Scholar 

  10. Cabrera, L., Rees, R., Steiner, S., Hineman, W., Pennere, M. (1995). ADSM: A multi-platform, scalable, backup and archive mass storage system. In Compcon’95.

  11. Commvault Simpana: http://www.commvault.com.

  12. Debnath, B., Sengupta, S., Li, J. (2010). ChunkStash: Speeding up inline storage deduplication using flash memory. In USENIX’10.

  13. Dong, W., Douglis, F., Li, K., Patterson, H. (2011). Tradeoffs in scalable data routing for deduplication clusters. In FAST’11.

  14. Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M. (2009). Hydrastor: A scalable secondary storage. In FAST’09.

  15. EMC Avamar: http://www.emc.com.

  16. Exgrid: http://www.exagrid.com.

  17. Falconstor: http://www.falconstor.com.

  18. Ferreira, R.A., Ramanathan, M.K., Grama, A., Jagannathan, S. (2007). Randomized protocols for duplicate elimination in peer-to-peer storage systems. IEEE Transactions on Parallel and Distributed Systems, 18(5), 686–696.

    Article  Google Scholar 

  19. Forman, G., Eshghi, K., Suermondt, J. (2009). Efficient detection of large-scale redundancy in enterprise file systems. ACM SIGOPS Operating Systems Review, 43(1), 84–91.

    Article  Google Scholar 

  20. Gantz, J.F., Chute, C., Manfrediz, A., Minton, S., Reinsel, D., Schlichting, W., Toncheva, A. (2008). The diverse and exploding digital universe: an updated forecast of worldwide information growth through 2011. IDC Report.

  21. Guo, F., & Efstathopoulos, P. (2011). Building a high-performance deduplication system. In USENIX ATC’11.

  22. Jain, N., Dahlin, M., Tewari, R. (2005). TAPER: Tiered approach for eliminating redundancy in replica synchronization. In FAST’05.

  23. Kulkarni, P., Douglis, F., LaVoie, J., Tracey, J.M. (2004). Redundancy elimination within large collections of files. In USENIX’04.

  24. Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Campbell, P. (2009). Sparse indexing: Large scale, inline deduplication using sampling and locality. In FAST’09.

  25. Liu, C., Gu, Y., Sun, L., Yan, B., Wang, D. (2010). R-ADMAD: High reliability provision for large-scale de-duplication archival storage systems. In ICS’09.

  26. Meister, D., & Brinkmann, A. (2009). Multi-level comparison of data deduplication in a backup scenario. In SYSTOR’09.

  27. Meyer, D.T., & Bolosky, W.J. (2011). A study of practical deduplication. In FAST’11.

  28. Muthitacharoen, A., Chen, B., Mazières, D. (2001). A low-bandwidth network file system. In SOSP’01.

  29. NetBackup PureDisk: http://www.symantec.com.

  30. NIST: Secure hash standard (1993). In FIPS PUB (Vol. 180, p. 1).

  31. Policroniades, C., & Pratt, I. (2004). Alternatives for detecting redundancy in storage systems data. In USENIX’04.

  32. Quinlan, S., & Dorward, S. (2002). Venti: A new approach to archival storage. In FAST’02.

  33. Rabin, M.O. (1981). Fingerprinting by random polynomials. Technical Report TR-15-81. Harvard University: Center for Research in Computing Technology.

  34. Rhea, S., Cox, R., Pesterev, A. (2008). Fast, inexpensive content-addressed storage in foundation. In USENIX’08.

  35. Riverbed: http://www.riverbed.com.

  36. Roselli, D., Lorch, J.R., Anderson, T.E. (2000). A comparison of file system workloads. In USENIX’00.

  37. Rsync: http://rsync.samba.org.

  38. Sepaton DeltaStor: http://www.sepaton.com.

  39. Syncsort Backup Express and NetApp: http://www.syncsort.com.

  40. Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z. (2011). CABdedupe: A causality-based deduplication performance booster for cloud backup services. In IPDPS’11.

  41. Tang, J.C., Drews, C., Smith, M., Wu F., Sue, A., Lau, T. (2007). Exploring patterns of social commonality among file directories at work. In CHI’07.

  42. Tolia, N., Harkes, J., Kozuch, M., Satyanarayanan, M. (2004). Integrating portable and distributed storage. In FAST’04.

  43. Tolia, N., Kaminsky, M., Andersen, D.G., Patil, S. (2006). An architecture for internet data transfer. In NSDI’06.

  44. Tolia, N., Kozuch, M., Satyanarayanan, M., Karp, B. (2003). Opportunistic use of content addressable storage for distributed file systems. In USENIX’03.

  45. Unstructured_data: http://en.wikipedia.org/wiki/Unstructureddata.

  46. Vrable, M., Savage, S., Voelker, G.M. (2009). Cumulus: filesystem backup to the cloud. ACM Transactions on Storage, 5(4), 1–28.

    Article  Google Scholar 

  47. Xia, P., Feng, D., Jiang, H., Tian, L., Wang, F. (2008). FARMER: A novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file system performance. In HPDC’08.

  48. Xia, W., Jiang, H., Feng, D., Hua, Y. (2012). SiLo: A similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In USENIX ATC’11.

  49. Yang, T., Feng, D., Niu, Z., Wan, Y. (2010). Scalable high performance de-duplication backup via hash join. Journal of Zhejiang University Science, 11(5), 315–327.

    Google Scholar 

  50. Yang, T., Jiang, H., Feng, D., Niu, Z., Zhou, K., Wan, Y. (2010). DEBAR: a scalable high-performance de-duplication storage system for backup and archiving. IPDPS’10.

  51. You, L.L., Pollack, K.T., Long, D.D.E. (2005). Deep store: An archival storage system architecture. In ICDE’05.

  52. Zhu, B., Li K., Patterson, H. (2008). Avoiding the disk bottleneck in the data domain deduplication file system. In FAST’08.

  53. Qiu, M., Sha, E.H.-M. (2009). Cost minimization while satisfying hard/soft timing constraints for heterogeneous EmbeddedSystems. ACM Transactions on Design Automation of Electronic Systems (TODAES) 14(2), 1–30.

    Article  Google Scholar 

  54. Li, J., Qiu, M., Ming, Z., Quan, G., Qin, X., Gu, Z. (2012). Online optimization for scheduling preemptable tasks on IaaS cloud systems. Journal of Parallel and Distributed Computing (JPDC), 72(5), 666–677.

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by the Fundamental Research Funds for the Central Universities under GrantNo.0903005203206 and No.CDJZR12180006, the National HighTechnology Research and Development (863 Program) of China underGrant No.2013AA013202 and No.2013AA013203, Chongqing High-Tech Research Programcsct2012ggC40005, National Basic Research973 Program of China under Grant No. 2011CB302301, NSFCNo.61025008, No.61232004 and No.61173014, the US NSF undergrants IIS-0916859, CCF-0937993, CNS-1016609, CNS-1116606 andCNS-1015802.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yujuan Tan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tan, Y., Jiang, H., Sha, EM. et al. SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services. J Sign Process Syst 72, 209–228 (2013). https://doi.org/10.1007/s11265-013-0775-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-013-0775-x

Keywords

Navigation