Improving Metadata Caching Efficiency for Data Deduplication via In-RAM Metadata Utilization

Zhou, Bing; Wen, Jiang-Tao

doi:10.1007/s11390-016-1664-0

Improving Metadata Caching Efficiency for Data Deduplication via In-RAM Metadata Utilization

Regular Paper
Published: 08 July 2016

Volume 31, pages 805–819, (2016)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Bing Zhou^1,2,3 &
Jiang-Tao Wen^1,2,3

91 Accesses
3 Citations
Explore all metrics

Abstract

We describe a data deduplication system for backup storage of PC disk images, named in-RAM metadata utilizing deduplication (IR-MUD). In-RAM hash granularity adaptation and miniLZO based data compression are firstly proposed to reduce the in-RAM metadata size and thereby reduce the space overheads required by the in-RAM metadata caches. Secondly, an in-RAM metadata write cache, as opposed to the traditional metadata read cache, is proposed for further reducing metadata-related disk I/O operations and improving deduplication throughput. During deduplication, the metadata write cache is managed following the LRU caching policy. For each manifest that is hit in the metadata write cache, an expensive manifest reloading operation from the disk is avoided. After deduplication, all the manifests in the metadata write cache are cleared and stored on the disk. Our experimental results using 1.5 TB real-world disk image dataset show that 1) IR-MUD achieved about 95% size reduction for the deduplication metadata, with a small time overhead introduced, 2) when the metadata write cache was not utilized, with the same RAM space size for the metadata read cache, IR-MUD achieved a 400% higher RAM hit ratio and a 50% higher deduplication throughput, as compared with the classic Sparse Indexing deduplication system where no metadata utilization approaches are utilized, and 3) when the metadata write cache was utilized and enough RAM space was available, IR-MUD achieved a 500% higher RAM hit ratio compared with Sparse Indexing and a 70% higher deduplication throughput compared with IR-MUD with only a single metadata read cache. The in-RAM metadata harnessing and metadata write caching approaches of IR-MUD can be applied in most parallel deduplication systems for improving metadata caching efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Data Deduplication Framework of Disk Images with Adaptive Block Skipping

Article 08 July 2016

Bing Zhou & Jiang-Tao Wen

DASM: A Dynamic Adaptive Forward Assembly Area Method to Accelerate Restore Speed for Deduplication-Based Backup Systems

De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization

Article 22 August 2014

Yujuan Tan, Zhichao Yan, … Lei Yang

References

Black J. Compare-by-hash: A reasoned analysis. In Proc. the USENIX Annual Technical Conference (ATC), May 2006, pp.85-90.
Meister D, Kaiser J, Brinkmann A, Cortes T, Kuhn M, Kunkel J. A study on data deduplication in HPC storage systems. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2012, Article No. 7.
Bloom B H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, July 1970, 13(7): 422-426.
Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse Indexing: Large scale, inline deduplication using sampling and locality. In Proc. the 7th USENIX Conference on File and Storage Technologies (FAST), February 2009, pp.111-123.
Tanenbaum A S. Modern Operating Systems (2nd edition). Prentice Hall PTR, 2001.
Zhou B, Wen J. Hysteresis re-chunking based metadata harnessing deduplication of disk images. In Proc. the 42nd IEEE International Conference on Parallel Processing (ICPP), October 2013, pp.389-398.
Rabin M O. Fingerprinting by random polynomials. Technical Report, TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
Muthitacharoen A, Chen B, Mazi`eres D. A low-bandwidth network file system. In Proc. the 18th ACM Symposium on Operating Systems Principles, October 2001, pp.174-187.
Roma´nski B, Heldt T, Kilian W et al. Anchor-driven subchunk deduplication. In Proc. the 4th Annual International Conference on Systems and Storage (SYSTOR), May 2011, pp.16:1-6:13.
Tolia N, Kozuch M, Satyanarayanan M, Karp B, Bressoud T, Perrig A. Opportunistic use of content addressable storage for distributed file systems. In Proc. the USENIX Annual Technical Conference (ATC), June 2003, pp.127-140.
Bhagwat D, Eshghi K, Long D D E, Lillibridge M. Extreme Binning: Scalable, parallel deduplication for chunk-based file backup. In Proc. the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), September 2009.
Shilane P, Huang M, Wallace G, Hsu W. WAN-optimized replication of backup datasets using stream-informed delta compression. Transactions on Storage, 2012, 8(4): 13:1-13:26.
Tuduce I C, Gross T. Adaptive main memory compression. In Proc. the USENIX Annual Technical Conference (ATEC), April 2005, pp.237-250.
Wilson P R, Kaplan S F, Smaragdakis Y. The case for compressed caching in virtual memory systems. In Proc. the USENIX Annual Technical Conference (ATEC), June 1999, pp.101-116.
Gupta D, Lee S, Vrable M, Savage S, Snoeren A C, Varghese G, Voelker G M, Vahdat A. Difference engine: Harnessing memory redundancy in virtual machines. Commun. ACM, 2010, 53(10): 85-93.
Article Google Scholar
Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 1977, 23(3): 337-343.
Article MathSciNet MATH Google Scholar
Jin K, Miller E L. The effectiveness of deduplication on virtual machine disk images. In Proc. the 2nd Annual International Systems and Storage Conference: The Israeli Experimental Systems Conference (SYSTOR), May 2009, pp.7:1-7:12.
Zhu B, Li K, Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. the 6th USENIX Conference on File and Storage Technologies (FAST), February 2008, Article No. 18.
Wildani A, Miller E, Rodeh O. HANDS: A heuristically arranged non-backup inline deduplication system. In Proc. the 29th IEEE International Conference on Data Engineering, April 2013, pp.446-457.
Botelho F C, Shilane P, Gang N, Hsu W. Memory efficient sanitization of a deduplicated storage system. In Proc. the 11th USENIX Conference on File and Storage Technologies (FAST), February 2013, pp.81-84.
Alameldeen A R, Wood D A. Adaptive cache compression for high-performance processors. In Proc. the 31st Annual International Symposium on Computer Architecture (ISCA), June 2004, pp.212-223.
Debnath B, Sengupta S, Li J. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proc. the USENIX Annual Technical Conference (ATC), June 2010, Article No. 16.
Meister D, Brinkmann A. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), May 2010.
Chen F, Luo T, Zhang X. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proc. the 9th USENIX Conference on File and Stroage Technologies (FAST), February 2011, Article No. 6.
Agrawal N, Prabhakaran V, Wobber T, Davis J D, Manasse M, Panigrahy R. Design tradeoffs for SSD performance. In Proc. the USENIX Annual Technical Conference (ATC), June 2008, pp.57-70.
Kaiser J, Brinkmann A, Süβ T, Meister D. Deriving and comparing deduplication techniques using a model-based classification. In Proc. the 10th European Conference on Computer Systems (EuroSys), April 2015, pp.11:1-11:13.
Srinivasan K, Bisson T, Goodson G, Voruganti K. iDedup: Latency-aware, inline data deduplication for primary storage. In Proc. the 10th USENIX Conference on File and Storage Technologies (FAST), February 2012, pp.299-312.
Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systems. ACM Transactions on Storage, 2006, 2(4): 424-448.
Article Google Scholar
Kruus E, Ungureanu C, Dubnicki C. Bimodal content defined chunking for backup streams. In Proc. the 8th USENIX Conference on File and Storage Technologies (FAST), February 2010, Article No. 18.
Lu G, Jin Y, Du D. Frequency based chunking for data deduplication. In Proc. IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), August 2010, pp.287-296.

Download references

Author information

Authors and Affiliations

State Key Laboratory on Intelligent Technology and Systems, Tsinghua University, Beijing, 100084, China
Bing Zhou & Jiang-Tao Wen
Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, 100084, China
Bing Zhou & Jiang-Tao Wen
Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Bing Zhou & Jiang-Tao Wen

Authors

Bing Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jiang-Tao Wen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bing Zhou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, B., Wen, JT. Improving Metadata Caching Efficiency for Data Deduplication via In-RAM Metadata Utilization. J. Comput. Sci. Technol. 31, 805–819 (2016). https://doi.org/10.1007/s11390-016-1664-0

Download citation

Received: 08 July 2015
Revised: 10 January 2016
Published: 08 July 2016
Issue Date: July 2016
DOI: https://doi.org/10.1007/s11390-016-1664-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Metadata Caching Efficiency for Data Deduplication via In-RAM Metadata Utilization

Abstract

Access this article

Similar content being viewed by others

A Data Deduplication Framework of Disk Images with Adaptive Block Skipping

DASM: A Dynamic Adaptive Forward Assembly Area Method to Accelerate Restore Speed for Deduplication-Based Backup Systems

De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving Metadata Caching Efficiency for Data Deduplication via In-RAM Metadata Utilization

Abstract

Access this article

Similar content being viewed by others

A Data Deduplication Framework of Disk Images with Adaptive Block Skipping

DASM: A Dynamic Adaptive Forward Assembly Area Method to Accelerate Restore Speed for Deduplication-Based Backup Systems

De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation