SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

Tan, Yujuan; Jiang, Hong; Sha, Edwin Hsing-Mean; Yan, Zhichao; Feng, Dan

doi:10.1007/s11265-013-0775-x

SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

Published: 21 June 2013

Volume 72, pages 209–228, (2013)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Yujuan Tan¹,
Hong Jiang²,
Edwin Hsing-Mean Sha¹,
Zhichao Yan² &
…
Dan Feng³

730 Accesses
11 Citations
3 Altmetric
Explore all metrics

Abstract

Due to the relatively low bandwidth of WAN that supports cloud backup services and the increasing amount of backed-up data stored at service providers, the deduplication scheme used in the cloud backup environment must remove the redundant data for backup operations to reduce backup times and storage costs and for restore operations to reduce restore times. In this paper, we propose SAFE, a source deduplication framework for efficient cloud backup and restore operations. SAFE consists of three salient features, (1) Hybrid Deduplication, combining the global file-level and local chunk-level deduplication to achieve an optimal tradeoff between the deduplication efficiency and overhead to achieve a short backup time; (2) Semantic-aware Elimination, exploiting file semantics to narrow the search space for the redundant data in hybrid deduplication process to reduce the deduplication overhead; and (3) Unmodified Data Removal, removing the files and data chunks that are kept intact from data transmission for some restore operations. Through extensive experiments driven by real-world datasets, the SAFE framework is shown to maintain a much higher deduplication efficiency/overhead ratio than existing solutions, shortening the backup time by an average of 38.7 %, and reduce the restore time by a ratio of up to 9.7 : 1.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Near-Exact Defragmentation Scheme to Improve Restore Performance for Cloud Backup Systems

A Novel Approach to File Deduplication in Cloud Storage Systems

DASM: A Dynamic Adaptive Forward Assembly Area Method to Accelerate Restore Speed for Deduplication-Based Backup Systems

References

Branch Office Optimization (2007). Enterprise strategy group.
Data loss survey: http://www.idgconnect.com (2010).
Adya, A., Bolosky, W.J., Castro, M., Cermak, G., Chaiken, R., Douceur, J.R., Howell, J., Lorch, J.R., Theimer, M., Wattenhofer, R.P. (2002). FARSITE: federated, available, and reliable storage for an incompletely trusted environment. ACM SIGOPS Operating Systems Review , 36(SI), 1–14.
Article Google Scholar
Agrawal, N., Bolosky, W.J., Douceur, J.R., Lorch, J.R. (2007). A five-year study of file-system metadata. In FAST’07.
Amazon Simple Storage Service: http://aws.amazon.com/s3.
Annapureddy, S., Freedman, M.J., Mazieres, D. (2005). Shark: Scaling file servers via cooperative caching. In NSDI’05.
Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M. (2009). Extreme binning: scalable, parallel deduplication for chunk-based file backup. Technical Report, HPL-2009-10R2 HP Laboratories.
Bloom, B.H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM , 13(7), 422–426.
Article MATH Google Scholar
Bobbarjung, D.R., Jagannathan, S., Dubnicki, C. (2006). Improving duplicate elimination in storage systems. ACM SIGOPS Transactions on Storage, 2(4), 424–448.
Article Google Scholar
Cabrera, L., Rees, R., Steiner, S., Hineman, W., Pennere, M. (1995). ADSM: A multi-platform, scalable, backup and archive mass storage system. In Compcon’95.
Commvault Simpana: http://www.commvault.com.
Debnath, B., Sengupta, S., Li, J. (2010). ChunkStash: Speeding up inline storage deduplication using flash memory. In USENIX’10.
Dong, W., Douglis, F., Li, K., Patterson, H. (2011). Tradeoffs in scalable data routing for deduplication clusters. In FAST’11.
Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M. (2009). Hydrastor: A scalable secondary storage. In FAST’09.
EMC Avamar: http://www.emc.com.
Exgrid: http://www.exagrid.com.
Falconstor: http://www.falconstor.com.
Ferreira, R.A., Ramanathan, M.K., Grama, A., Jagannathan, S. (2007). Randomized protocols for duplicate elimination in peer-to-peer storage systems. IEEE Transactions on Parallel and Distributed Systems, 18(5), 686–696.
Article Google Scholar
Forman, G., Eshghi, K., Suermondt, J. (2009). Efficient detection of large-scale redundancy in enterprise file systems. ACM SIGOPS Operating Systems Review, 43(1), 84–91.
Article Google Scholar
Gantz, J.F., Chute, C., Manfrediz, A., Minton, S., Reinsel, D., Schlichting, W., Toncheva, A. (2008). The diverse and exploding digital universe: an updated forecast of worldwide information growth through 2011. IDC Report.
Guo, F., & Efstathopoulos, P. (2011). Building a high-performance deduplication system. In USENIX ATC’11.
Jain, N., Dahlin, M., Tewari, R. (2005). TAPER: Tiered approach for eliminating redundancy in replica synchronization. In FAST’05.
Kulkarni, P., Douglis, F., LaVoie, J., Tracey, J.M. (2004). Redundancy elimination within large collections of files. In USENIX’04.
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Campbell, P. (2009). Sparse indexing: Large scale, inline deduplication using sampling and locality. In FAST’09.
Liu, C., Gu, Y., Sun, L., Yan, B., Wang, D. (2010). R-ADMAD: High reliability provision for large-scale de-duplication archival storage systems. In ICS’09.
Meister, D., & Brinkmann, A. (2009). Multi-level comparison of data deduplication in a backup scenario. In SYSTOR’09.
Meyer, D.T., & Bolosky, W.J. (2011). A study of practical deduplication. In FAST’11.
Muthitacharoen, A., Chen, B., Mazières, D. (2001). A low-bandwidth network file system. In SOSP’01.
NetBackup PureDisk: http://www.symantec.com.
NIST: Secure hash standard (1993). In FIPS PUB (Vol. 180, p. 1).
Policroniades, C., & Pratt, I. (2004). Alternatives for detecting redundancy in storage systems data. In USENIX’04.
Quinlan, S., & Dorward, S. (2002). Venti: A new approach to archival storage. In FAST’02.
Rabin, M.O. (1981). Fingerprinting by random polynomials. Technical Report TR-15-81. Harvard University: Center for Research in Computing Technology.
Rhea, S., Cox, R., Pesterev, A. (2008). Fast, inexpensive content-addressed storage in foundation. In USENIX’08.
Riverbed: http://www.riverbed.com.
Roselli, D., Lorch, J.R., Anderson, T.E. (2000). A comparison of file system workloads. In USENIX’00.
Rsync: http://rsync.samba.org.
Sepaton DeltaStor: http://www.sepaton.com.
Syncsort Backup Express and NetApp: http://www.syncsort.com.
Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z. (2011). CABdedupe: A causality-based deduplication performance booster for cloud backup services. In IPDPS’11.
Tang, J.C., Drews, C., Smith, M., Wu F., Sue, A., Lau, T. (2007). Exploring patterns of social commonality among file directories at work. In CHI’07.
Tolia, N., Harkes, J., Kozuch, M., Satyanarayanan, M. (2004). Integrating portable and distributed storage. In FAST’04.
Tolia, N., Kaminsky, M., Andersen, D.G., Patil, S. (2006). An architecture for internet data transfer. In NSDI’06.
Tolia, N., Kozuch, M., Satyanarayanan, M., Karp, B. (2003). Opportunistic use of content addressable storage for distributed file systems. In USENIX’03.
Unstructured_data: http://en.wikipedia.org/wiki/Unstructureddata.
Vrable, M., Savage, S., Voelker, G.M. (2009). Cumulus: filesystem backup to the cloud. ACM Transactions on Storage, 5(4), 1–28.
Article Google Scholar
Xia, P., Feng, D., Jiang, H., Tian, L., Wang, F. (2008). FARMER: A novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file system performance. In HPDC’08.
Xia, W., Jiang, H., Feng, D., Hua, Y. (2012). SiLo: A similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In USENIX ATC’11.
Yang, T., Feng, D., Niu, Z., Wan, Y. (2010). Scalable high performance de-duplication backup via hash join. Journal of Zhejiang University Science, 11(5), 315–327.
Google Scholar
Yang, T., Jiang, H., Feng, D., Niu, Z., Zhou, K., Wan, Y. (2010). DEBAR: a scalable high-performance de-duplication storage system for backup and archiving. IPDPS’10.
You, L.L., Pollack, K.T., Long, D.D.E. (2005). Deep store: An archival storage system architecture. In ICDE’05.
Zhu, B., Li K., Patterson, H. (2008). Avoiding the disk bottleneck in the data domain deduplication file system. In FAST’08.
Qiu, M., Sha, E.H.-M. (2009). Cost minimization while satisfying hard/soft timing constraints for heterogeneous EmbeddedSystems. ACM Transactions on Design Automation of Electronic Systems (TODAES) 14(2), 1–30.
Article Google Scholar
Li, J., Qiu, M., Ming, Z., Quan, G., Qin, X., Gu, Z. (2012). Online optimization for scheduling preemptable tasks on IaaS cloud systems. Journal of Parallel and Distributed Computing (JPDC), 72(5), 666–677.
Article Google Scholar

Download references

Acknowledgments

This work is supported by the Fundamental Research Funds for the Central Universities under GrantNo.0903005203206 and No.CDJZR12180006, the National HighTechnology Research and Development (863 Program) of China underGrant No.2013AA013202 and No.2013AA013203, Chongqing High-Tech Research Programcsct2012ggC40005, National Basic Research973 Program of China under Grant No. 2011CB302301, NSFCNo.61025008, No.61232004 and No.61173014, the US NSF undergrants IIS-0916859, CCF-0937993, CNS-1016609, CNS-1116606 andCNS-1015802.

Author information

Authors and Affiliations

College of Computer Science, Chongqing University, Chongqing, China
Yujuan Tan & Edwin Hsing-Mean Sha
Department of Computer Science & Engineering, University of Nebraska-Lincoln, Lincoln, NE, USA
Hong Jiang & Zhichao Yan
School of Computer Science & Technology, Huazhong University of Science & Technology, Wuhan, China
Dan Feng

Authors

Yujuan Tan
View author publications
You can also search for this author in PubMed Google Scholar
Hong Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Edwin Hsing-Mean Sha
View author publications
You can also search for this author in PubMed Google Scholar
Zhichao Yan
View author publications
You can also search for this author in PubMed Google Scholar
Dan Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yujuan Tan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tan, Y., Jiang, H., Sha, EM. et al. SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services. J Sign Process Syst 72, 209–228 (2013). https://doi.org/10.1007/s11265-013-0775-x

Download citation

Received: 06 November 2012
Revised: 30 March 2013
Accepted: 10 May 2013
Published: 21 June 2013
Issue Date: September 2013
DOI: https://doi.org/10.1007/s11265-013-0775-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

Abstract

Access this article

Similar content being viewed by others

A Near-Exact Defragmentation Scheme to Improve Restore Performance for Cloud Backup Systems

A Novel Approach to File Deduplication in Cloud Storage Systems

DASM: A Dynamic Adaptive Forward Assembly Area Method to Accelerate Restore Speed for Deduplication-Based Backup Systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

Abstract

Access this article

Similar content being viewed by others

A Near-Exact Defragmentation Scheme to Improve Restore Performance for Cloud Backup Systems

A Novel Approach to File Deduplication in Cloud Storage Systems

DASM: A Dynamic Adaptive Forward Assembly Area Method to Accelerate Restore Speed for Deduplication-Based Backup Systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation