Abstract
Due to the relatively low bandwidth of WAN that supports cloud backup services and the increasing amount of backed-up data stored at service providers, the deduplication scheme used in the cloud backup environment must remove the redundant data for backup operations to reduce backup times and storage costs and for restore operations to reduce restore times. In this paper, we propose SAFE, a source deduplication framework for efficient cloud backup and restore operations. SAFE consists of three salient features, (1) Hybrid Deduplication, combining the global file-level and local chunk-level deduplication to achieve an optimal tradeoff between the deduplication efficiency and overhead to achieve a short backup time; (2) Semantic-aware Elimination, exploiting file semantics to narrow the search space for the redundant data in hybrid deduplication process to reduce the deduplication overhead; and (3) Unmodified Data Removal, removing the files and data chunks that are kept intact from data transmission for some restore operations. Through extensive experiments driven by real-world datasets, the SAFE framework is shown to maintain a much higher deduplication efficiency/overhead ratio than existing solutions, shortening the backup time by an average of 38.7 %, and reduce the restore time by a ratio of up to 9.7 : 1.
Similar content being viewed by others
References
Branch Office Optimization (2007). Enterprise strategy group.
Data loss survey: http://www.idgconnect.com (2010).
Adya, A., Bolosky, W.J., Castro, M., Cermak, G., Chaiken, R., Douceur, J.R., Howell, J., Lorch, J.R., Theimer, M., Wattenhofer, R.P. (2002). FARSITE: federated, available, and reliable storage for an incompletely trusted environment. ACM SIGOPS Operating Systems Review , 36(SI), 1–14.
Agrawal, N., Bolosky, W.J., Douceur, J.R., Lorch, J.R. (2007). A five-year study of file-system metadata. In FAST’07.
Amazon Simple Storage Service: http://aws.amazon.com/s3.
Annapureddy, S., Freedman, M.J., Mazieres, D. (2005). Shark: Scaling file servers via cooperative caching. In NSDI’05.
Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M. (2009). Extreme binning: scalable, parallel deduplication for chunk-based file backup. Technical Report, HPL-2009-10R2 HP Laboratories.
Bloom, B.H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM , 13(7), 422–426.
Bobbarjung, D.R., Jagannathan, S., Dubnicki, C. (2006). Improving duplicate elimination in storage systems. ACM SIGOPS Transactions on Storage, 2(4), 424–448.
Cabrera, L., Rees, R., Steiner, S., Hineman, W., Pennere, M. (1995). ADSM: A multi-platform, scalable, backup and archive mass storage system. In Compcon’95.
Commvault Simpana: http://www.commvault.com.
Debnath, B., Sengupta, S., Li, J. (2010). ChunkStash: Speeding up inline storage deduplication using flash memory. In USENIX’10.
Dong, W., Douglis, F., Li, K., Patterson, H. (2011). Tradeoffs in scalable data routing for deduplication clusters. In FAST’11.
Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M. (2009). Hydrastor: A scalable secondary storage. In FAST’09.
EMC Avamar: http://www.emc.com.
Exgrid: http://www.exagrid.com.
Falconstor: http://www.falconstor.com.
Ferreira, R.A., Ramanathan, M.K., Grama, A., Jagannathan, S. (2007). Randomized protocols for duplicate elimination in peer-to-peer storage systems. IEEE Transactions on Parallel and Distributed Systems, 18(5), 686–696.
Forman, G., Eshghi, K., Suermondt, J. (2009). Efficient detection of large-scale redundancy in enterprise file systems. ACM SIGOPS Operating Systems Review, 43(1), 84–91.
Gantz, J.F., Chute, C., Manfrediz, A., Minton, S., Reinsel, D., Schlichting, W., Toncheva, A. (2008). The diverse and exploding digital universe: an updated forecast of worldwide information growth through 2011. IDC Report.
Guo, F., & Efstathopoulos, P. (2011). Building a high-performance deduplication system. In USENIX ATC’11.
Jain, N., Dahlin, M., Tewari, R. (2005). TAPER: Tiered approach for eliminating redundancy in replica synchronization. In FAST’05.
Kulkarni, P., Douglis, F., LaVoie, J., Tracey, J.M. (2004). Redundancy elimination within large collections of files. In USENIX’04.
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Campbell, P. (2009). Sparse indexing: Large scale, inline deduplication using sampling and locality. In FAST’09.
Liu, C., Gu, Y., Sun, L., Yan, B., Wang, D. (2010). R-ADMAD: High reliability provision for large-scale de-duplication archival storage systems. In ICS’09.
Meister, D., & Brinkmann, A. (2009). Multi-level comparison of data deduplication in a backup scenario. In SYSTOR’09.
Meyer, D.T., & Bolosky, W.J. (2011). A study of practical deduplication. In FAST’11.
Muthitacharoen, A., Chen, B., Mazières, D. (2001). A low-bandwidth network file system. In SOSP’01.
NetBackup PureDisk: http://www.symantec.com.
NIST: Secure hash standard (1993). In FIPS PUB (Vol. 180, p. 1).
Policroniades, C., & Pratt, I. (2004). Alternatives for detecting redundancy in storage systems data. In USENIX’04.
Quinlan, S., & Dorward, S. (2002). Venti: A new approach to archival storage. In FAST’02.
Rabin, M.O. (1981). Fingerprinting by random polynomials. Technical Report TR-15-81. Harvard University: Center for Research in Computing Technology.
Rhea, S., Cox, R., Pesterev, A. (2008). Fast, inexpensive content-addressed storage in foundation. In USENIX’08.
Riverbed: http://www.riverbed.com.
Roselli, D., Lorch, J.R., Anderson, T.E. (2000). A comparison of file system workloads. In USENIX’00.
Rsync: http://rsync.samba.org.
Sepaton DeltaStor: http://www.sepaton.com.
Syncsort Backup Express and NetApp: http://www.syncsort.com.
Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z. (2011). CABdedupe: A causality-based deduplication performance booster for cloud backup services. In IPDPS’11.
Tang, J.C., Drews, C., Smith, M., Wu F., Sue, A., Lau, T. (2007). Exploring patterns of social commonality among file directories at work. In CHI’07.
Tolia, N., Harkes, J., Kozuch, M., Satyanarayanan, M. (2004). Integrating portable and distributed storage. In FAST’04.
Tolia, N., Kaminsky, M., Andersen, D.G., Patil, S. (2006). An architecture for internet data transfer. In NSDI’06.
Tolia, N., Kozuch, M., Satyanarayanan, M., Karp, B. (2003). Opportunistic use of content addressable storage for distributed file systems. In USENIX’03.
Unstructured_data: http://en.wikipedia.org/wiki/Unstructureddata.
Vrable, M., Savage, S., Voelker, G.M. (2009). Cumulus: filesystem backup to the cloud. ACM Transactions on Storage, 5(4), 1–28.
Xia, P., Feng, D., Jiang, H., Tian, L., Wang, F. (2008). FARMER: A novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file system performance. In HPDC’08.
Xia, W., Jiang, H., Feng, D., Hua, Y. (2012). SiLo: A similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In USENIX ATC’11.
Yang, T., Feng, D., Niu, Z., Wan, Y. (2010). Scalable high performance de-duplication backup via hash join. Journal of Zhejiang University Science, 11(5), 315–327.
Yang, T., Jiang, H., Feng, D., Niu, Z., Zhou, K., Wan, Y. (2010). DEBAR: a scalable high-performance de-duplication storage system for backup and archiving. IPDPS’10.
You, L.L., Pollack, K.T., Long, D.D.E. (2005). Deep store: An archival storage system architecture. In ICDE’05.
Zhu, B., Li K., Patterson, H. (2008). Avoiding the disk bottleneck in the data domain deduplication file system. In FAST’08.
Qiu, M., Sha, E.H.-M. (2009). Cost minimization while satisfying hard/soft timing constraints for heterogeneous EmbeddedSystems. ACM Transactions on Design Automation of Electronic Systems (TODAES) 14(2), 1–30.
Li, J., Qiu, M., Ming, Z., Quan, G., Qin, X., Gu, Z. (2012). Online optimization for scheduling preemptable tasks on IaaS cloud systems. Journal of Parallel and Distributed Computing (JPDC), 72(5), 666–677.
Acknowledgments
This work is supported by the Fundamental Research Funds for the Central Universities under GrantNo.0903005203206 and No.CDJZR12180006, the National HighTechnology Research and Development (863 Program) of China underGrant No.2013AA013202 and No.2013AA013203, Chongqing High-Tech Research Programcsct2012ggC40005, National Basic Research973 Program of China under Grant No. 2011CB302301, NSFCNo.61025008, No.61232004 and No.61173014, the US NSF undergrants IIS-0916859, CCF-0937993, CNS-1016609, CNS-1116606 andCNS-1015802.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tan, Y., Jiang, H., Sha, EM. et al. SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services. J Sign Process Syst 72, 209–228 (2013). https://doi.org/10.1007/s11265-013-0775-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-013-0775-x