De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization

Tan, Yujuan; Yan, Zhichao; Feng, Dan; He, Xubin; Zou, Qiang; Yang, Lei

doi:10.1007/s10586-014-0397-5

De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization

Published: 22 August 2014

Volume 18, pages 79–92, (2015)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Yujuan Tan¹,
Zhichao Yan²,
Dan Feng³,
Xubin He⁴,
Qiang Zou⁵ &
…
Lei Yang¹

370 Accesses
6 Citations
Explore all metrics

Abstract

Data deduplication has become a commodity in large-scale storage systems, especially in data backup and archival systems. However, due to the removal of redundant data, data deduplication de-linearizes data placement and forces the data chunks of the same data object to be divided into multiple separate units. In our preliminary study, we found that the de-linearization of data placement compromises the data spatial locality that is used to improve data read performance, deduplication throughput and deduplication efficiency in some deduplication approaches, which significantly affects deduplication performance and makes some deduplication approaches become less effective. In this paper, we first analyze the negative effect of data placement de-linearization to deduplication performance, and then propose an effective approach called De-Frag to reduce the de-linearization of data placement. The key idea of De-Frag is to choose some redundant data to be written to the disks rather than be removed. It quantifies the spatial locality of each chunk group by spatial locality level (SPL for short) and writes the redundant chunks to disks when SPL value is smaller than a preset value, thus to reduce the de-linearization of data placement and enhance the spatial locality. As shown in our experimental results driven by real world datasets, De-Frag effectively enhances data spatial locality and improves deduplication throughput, deduplication efficiency, and data read performance, at the cost of slightly lower compression ratios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DZIP: A Data Deduplication-Compatible Enhanced Version of Gzip

Simdedup: A New Deduplication Scheme Based on Simhash

A Near-Exact Defragmentation Scheme to Improve Restore Performance for Cloud Backup Systems

References

Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the Data Domain deduplication file system, in FAST’08, Feb. 2008
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar,V., Trezise, G., Campbell, P.: Sparse Indexing: Large scale, inline deduplication using sampling and locality, in FAST’09, Feb. 2009
Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M.: Extreme binning: scalable, parallel deduplication for chunk-based file backup, HP Laboratories, Tech. Rep. HPL-2009-10R2, Sep. 2009.
Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: iDedup: latency-aware, inline data deduplication for primary storage, in FAST’12, Feb. 2012.
Nam, Y.J., Park, D., Du, D.: Assuring demanded read performance of data deduplication storage with backup datasets, in MASCOTS’12, Aug. 2012.
Kaczmarczyk, M., Barczynski, M., Kilian, W., Dubnicki, C.: Reducing impact of data fragmentation caused by in-line deduplication, in SYSTOR’12, Jun. 2012.
Li, X., Lillibridge, M., Uysal, M.: Reliability analysis of deduplicated and erasure-coded storage. ACM SIGMETRICS Perform Eval Rev 38(3), 4–9 (2011)
Article Google Scholar
Liu, C., Gu, Y., Sun, L., Yan, B., Wang, D.: R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems, in ICS’09, Jun. 2010.
Bhagwat, D., Pollack, K., Long, D.D.E., Schwarz, T., Miller, E.L., èaris, J.P.: providing high reliability in a minimum redundancy archival storage system, in MASCOTS’06, Sep. 2006.
Xia, W., Jiang, H., Feng, D., Hua, Y.: SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput, in USENIX’11, Jun. 2011.
Rabin, M.O.: Fingerprinting by random polynomials, Center for Research in Computing Technology, Technical Report, Harvard University, TR-15-81, 1981.
NIST, “Secure Hash Standard”, in FIPS PUB 180–1, May 1993.
Dong, W., Douglis, F., Li, K., Patterson, H.,: TradeOffs in scalable data routing for deduplication clusters, in FAST’11, Feb. 2011.
Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z., Zhou, G.: SAM: A semantic-aware multi-tiered source de-duplication framework for cloud backup, in ICPP’10, Sep. 2010.
Clements, A.T., Ahmad, I., Vilayannur, M., Li, J.: Decentralized deduplication in SAN cluster file systems, in USENIX’09, Jan. 2009.
Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M.: Hydrastor: a scalable secondary storage. in FAST’09, Feb. 2009.
You, L.L., Pollack, K.T., Long, D.D.E.: Deep Store: An archival storage system architecture, in ICDE’05, Apr. 2005.
Vrable, M., Savage, S., Voelker, G.M.: Cumulus: Filesystem backup to the cloud, in FAST’09, Feb. 2009.
Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z.: CABdedupe: A Causality-based deduplication performance booster for cloud backup services, in IPDPS’11, May. 2011.
Adya, A., Bolosky, W.J., Castro, M., Cermak, G., Chaiken, R., Douceur, J.R., Howell, J., Lorch, J.R., Theimer, M., Wattenhofer, R. P.: FARSITE: federated, available, and reliable storage for an incompletely trusted environment, in OSDI’02, Dec. 2002.
Bolosky, W.J., Corbin, S., Goebel, D., Douceur, J.R.: Single instance storage in windows 2000, in USENIX ’00, Aug. 2000.
E. CORPORATION.: EMC Centera: Content Addressed Storage System, 2003.
Quinlan, S., Dorward, S.: Venti: A new approach to archival storage, in FAST’02, Jan. 2002.
Muthitacharoen, A., Chen, B., Mazières, D.: A low-bandwidth network file system, in SOSP’01, Oct. 2001.
Deepak, R., Bobbar, J., Suresh, J.: Improving duplicate elimination in storage systems, ACM Trans Storage, 2(4), 2006.
Eshghi, K.: A framework for analyzing and improving content based chunking algorithms, Hewlett Packard Laboratories, Tech. Rep. HPL-2005-30, Feb. 2005.
Liu, C., Gu, Y., Sun, L., Yan, B., Wang, D.: ADMAD: Application-driven metadata aware de-deduplication archival storage systems, in the 25th IEEE Conference on Mass Storage Systems and Technologies, Sep. 2008.
Rhea, S., Cox, R., Pesterev, A.: Fast, inexpensive content-addressed storage in Foundation, in USENIX’08, Jun. 2008.
Debnath, B., Senguptaz, S., Li, J.: ChunkStash: speeding up inline storage deduplication using flash memory, in USENIX’10, Jun. 2010.
Guo, F., Efstathopoulos, P.: Building a high-performance deduplication system, in USENIX’11, Jun. 2011.
Tan, Y., Yan, Z., Feng, D., Sha, E.H.M.: Reducing the de-linearization of data placement to improve deduplication performance, in International Workshop on Data-Intensive Scalable Computing Systems (DISCS, in conjunction with the 2012 ACM/IEEE Supercomputing Conference), Nov. 2012.

Download references

Acknowledgments

A preliminary version of this work was presented at the International Workshop on Data-Intensive Scalable Computing Systems (DISCS, in conjunction with the 2012 ACM/IEEE Supercomputing Conference) [31] and we have made substantial changes in this manuscript. This work is supported by Central Universities Fundamental Research Foundation of China under Grant No. 106112013CDJZR180009 and CDJZR14185501, Chongqing Basic and Frontier Research Project of China under Grant No. cstc2013jcyjA40016, cstc2012ggC40005 and cstc2013jcyjA40025, Research Fund for the Doctoral Program of Higher Education of China under Grant No. 20130191120031 and 20130191120030, National Natural Science Foundation of China Under Grant No. 61309004, the National High Technology Research and Development (“863” Program) of China under Grant No. 2013AA013202, National Basic Research 973 Program of China under Grant No. 2011CB302301, NSFC No. 61025008. The work at VCU is partially supported by the U.S. National Science Foundation (NSF) under grants CCF-1102624 and CNS-1218960. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.

Author information

Authors and Affiliations

College of Computer Science, Chongqing University, Chongqing, China
Yujuan Tan & Lei Yang
Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, USA
Zhichao Yan
School of Computer Science and Technology, Huazhong University of Science and Technology Wuhan, Hubei, China
Dan Feng
Department of Electrical and Computer Engineering, Virginia Commonwealth University, Richmond, VA, USA
Xubin He
School of Computer, Southwest University, Chongqing, China
Qiang Zou

Authors

Yujuan Tan
View author publications
You can also search for this author in PubMed Google Scholar
Zhichao Yan
View author publications
You can also search for this author in PubMed Google Scholar
Dan Feng
View author publications
You can also search for this author in PubMed Google Scholar
Xubin He
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Zou
View author publications
You can also search for this author in PubMed Google Scholar
Lei Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yujuan Tan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tan, Y., Yan, Z., Feng, D. et al. De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization. Cluster Comput 18, 79–92 (2015). https://doi.org/10.1007/s10586-014-0397-5

Download citation

Received: 23 January 2013
Revised: 22 September 2013
Accepted: 24 July 2014
Published: 22 August 2014
Issue Date: March 2015
DOI: https://doi.org/10.1007/s10586-014-0397-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization

Abstract

Access this article

Similar content being viewed by others

DZIP: A Data Deduplication-Compatible Enhanced Version of Gzip

Simdedup: A New Deduplication Scheme Based on Simhash

A Near-Exact Defragmentation Scheme to Improve Restore Performance for Cloud Backup Systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization

Abstract

Access this article

Similar content being viewed by others

DZIP: A Data Deduplication-Compatible Enhanced Version of Gzip

Simdedup: A New Deduplication Scheme Based on Simhash

A Near-Exact Defragmentation Scheme to Improve Restore Performance for Cloud Backup Systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation