Skip to main content
Log in

De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Data deduplication has become a commodity in large-scale storage systems, especially in data backup and archival systems. However, due to the removal of redundant data, data deduplication de-linearizes data placement and forces the data chunks of the same data object to be divided into multiple separate units. In our preliminary study, we found that the de-linearization of data placement compromises the data spatial locality that is used to improve data read performance, deduplication throughput and deduplication efficiency in some deduplication approaches, which significantly affects deduplication performance and makes some deduplication approaches become less effective. In this paper, we first analyze the negative effect of data placement de-linearization to deduplication performance, and then propose an effective approach called De-Frag to reduce the de-linearization of data placement. The key idea of De-Frag is to choose some redundant data to be written to the disks rather than be removed. It quantifies the spatial locality of each chunk group by spatial locality level (SPL for short) and writes the redundant chunks to disks when SPL value is smaller than a preset value, thus to reduce the de-linearization of data placement and enhance the spatial locality. As shown in our experimental results driven by real world datasets, De-Frag effectively enhances data spatial locality and improves deduplication throughput, deduplication efficiency, and data read performance, at the cost of slightly lower compression ratios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the Data Domain deduplication file system, in FAST’08, Feb. 2008

  2. Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar,V., Trezise, G., Campbell, P.: Sparse Indexing: Large scale, inline deduplication using sampling and locality, in FAST’09, Feb. 2009

  3. Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M.: Extreme binning: scalable, parallel deduplication for chunk-based file backup, HP Laboratories, Tech. Rep. HPL-2009-10R2, Sep. 2009.

  4. Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: iDedup: latency-aware, inline data deduplication for primary storage, in FAST’12, Feb. 2012.

  5. Nam, Y.J., Park, D., Du, D.: Assuring demanded read performance of data deduplication storage with backup datasets, in MASCOTS’12, Aug. 2012.

  6. Kaczmarczyk, M., Barczynski, M., Kilian, W., Dubnicki, C.: Reducing impact of data fragmentation caused by in-line deduplication, in SYSTOR’12, Jun. 2012.

  7. Li, X., Lillibridge, M., Uysal, M.: Reliability analysis of deduplicated and erasure-coded storage. ACM SIGMETRICS Perform Eval Rev 38(3), 4–9 (2011)

    Article  Google Scholar 

  8. Liu, C., Gu, Y., Sun, L., Yan, B., Wang, D.: R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems, in ICS’09, Jun. 2010.

  9. Bhagwat, D., Pollack, K., Long, D.D.E., Schwarz, T., Miller, E.L., èaris, J.P.: providing high reliability in a minimum redundancy archival storage system, in MASCOTS’06, Sep. 2006.

  10. Xia, W., Jiang, H., Feng, D., Hua, Y.: SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput, in USENIX’11, Jun. 2011.

  11. Rabin, M.O.: Fingerprinting by random polynomials, Center for Research in Computing Technology, Technical Report, Harvard University, TR-15-81, 1981.

  12. NIST, “Secure Hash Standard”, in FIPS PUB 180–1, May 1993.

  13. Dong, W., Douglis, F., Li, K., Patterson, H.,: TradeOffs in scalable data routing for deduplication clusters, in FAST’11, Feb. 2011.

  14. Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z., Zhou, G.: SAM: A semantic-aware multi-tiered source de-duplication framework for cloud backup, in ICPP’10, Sep. 2010.

  15. Clements, A.T., Ahmad, I., Vilayannur, M., Li, J.: Decentralized deduplication in SAN cluster file systems, in USENIX’09, Jan. 2009.

  16. Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M.: Hydrastor: a scalable secondary storage. in FAST’09, Feb. 2009.

  17. You, L.L., Pollack, K.T., Long, D.D.E.: Deep Store: An archival storage system architecture, in ICDE’05, Apr. 2005.

  18. Vrable, M., Savage, S., Voelker, G.M.: Cumulus: Filesystem backup to the cloud, in FAST’09, Feb. 2009.

  19. Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z.: CABdedupe: A Causality-based deduplication performance booster for cloud backup services, in IPDPS’11, May. 2011.

  20. Adya, A., Bolosky, W.J., Castro, M., Cermak, G., Chaiken, R., Douceur, J.R., Howell, J., Lorch, J.R., Theimer, M., Wattenhofer, R. P.: FARSITE: federated, available, and reliable storage for an incompletely trusted environment, in OSDI’02, Dec. 2002.

  21. Bolosky, W.J., Corbin, S., Goebel, D., Douceur, J.R.: Single instance storage in windows 2000, in USENIX ’00, Aug. 2000.

  22. E. CORPORATION.: EMC Centera: Content Addressed Storage System, 2003.

  23. Quinlan, S., Dorward, S.: Venti: A new approach to archival storage, in FAST’02, Jan. 2002.

  24. Muthitacharoen, A., Chen, B., Mazières, D.: A low-bandwidth network file system, in SOSP’01, Oct. 2001.

  25. Deepak, R., Bobbar, J., Suresh, J.: Improving duplicate elimination in storage systems, ACM Trans Storage, 2(4), 2006.

  26. Eshghi, K.: A framework for analyzing and improving content based chunking algorithms, Hewlett Packard Laboratories, Tech. Rep. HPL-2005-30, Feb. 2005.

  27. Liu, C., Gu, Y., Sun, L., Yan, B., Wang, D.: ADMAD: Application-driven metadata aware de-deduplication archival storage systems, in the 25th IEEE Conference on Mass Storage Systems and Technologies, Sep. 2008.

  28. Rhea, S., Cox, R., Pesterev, A.: Fast, inexpensive content-addressed storage in Foundation, in USENIX’08, Jun. 2008.

  29. Debnath, B., Senguptaz, S., Li, J.: ChunkStash: speeding up inline storage deduplication using flash memory, in USENIX’10, Jun. 2010.

  30. Guo, F., Efstathopoulos, P.: Building a high-performance deduplication system, in USENIX’11, Jun. 2011.

  31. Tan, Y., Yan, Z., Feng, D., Sha, E.H.M.: Reducing the de-linearization of data placement to improve deduplication performance, in International Workshop on Data-Intensive Scalable Computing Systems (DISCS, in conjunction with the 2012 ACM/IEEE Supercomputing Conference), Nov. 2012.

Download references

Acknowledgments

A preliminary version of this work was presented at the International Workshop on Data-Intensive Scalable Computing Systems (DISCS, in conjunction with the 2012 ACM/IEEE Supercomputing Conference) [31] and we have made substantial changes in this manuscript. This work is supported by Central Universities Fundamental Research Foundation of China under Grant No. 106112013CDJZR180009 and CDJZR14185501, Chongqing Basic and Frontier Research Project of China under Grant No. cstc2013jcyjA40016, cstc2012ggC40005 and cstc2013jcyjA40025, Research Fund for the Doctoral Program of Higher Education of China under Grant No. 20130191120031 and 20130191120030, National Natural Science Foundation of China Under Grant No. 61309004, the National High Technology Research and Development (“863” Program) of China under Grant No. 2013AA013202, National Basic Research 973 Program of China under Grant No. 2011CB302301, NSFC No. 61025008. The work at VCU is partially supported by the U.S. National Science Foundation (NSF) under grants CCF-1102624 and CNS-1218960. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yujuan Tan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tan, Y., Yan, Z., Feng, D. et al. De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization. Cluster Comput 18, 79–92 (2015). https://doi.org/10.1007/s10586-014-0397-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-014-0397-5

Keywords

Navigation