Prefetch-aware fingerprint cache management for data deduplication systems

Li, Mei; Zhang, Hongjun; Wu, Yanjun; Zhao, Chen

doi:10.1007/s11704-017-7119-0

Prefetch-aware fingerprint cache management for data deduplication systems

Research Article
Published: 09 June 2018

Volume 13, pages 500–515, (2019)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Mei Li^1,2,
Hongjun Zhang^1,2,
Yanjun Wu¹ &
…
Chen Zhao¹

135 Accesses
5 Citations
7 Altmetric
Explore all metrics

Abstract

Data deduplication has been widely utilized in large-scale storage systems, particularly backup systems. Data deduplication systems typically divide data streams into chunks and identify redundant chunks by comparing chunk fingerprints. Maintaining all fingerprints in memory is not cost-effective because fingerprint indexes are typically very large. Many data deduplication systems maintain a fingerprint cache in memory and exploit fingerprint prefetching to accelerate the deduplication process. Although fingerprint prefetching can improve the performance of data deduplication systems by leveraging the locality of workloads, inaccurately prefetched fingerprints may pollute the cache by evicting useful fingerprints. We observed that most of the prefetched fingerprints in a wide variety of applications are never used or used only once, which severely limits the performance of data deduplication systems. We introduce a prefetch-aware fingerprint cache management scheme for data deduplication systems (PreCache) to alleviate prefetch-related cache pollution. We propose three prefetch-aware fingerprint cache replacement policies (PreCache-UNU, PreCache-UOO, and PreCache-MIX) to handle different types of cache pollution. Additionally, we propose an adaptive policy selector to select suitable policies for prefetch requests. We implement PreCache on two representative data deduplication systems (Block Locality Caching and SiLo) and evaluate its performance utilizing three real-world workloads (Kernel, MacOS, and Homes). The experimental results reveal that PreCache improves deduplication throughput by up to 32.22% based on a reduction of on-disk fingerprint index lookups and improvement of the deduplication ratio by mitigating prefetch-related fingerprint cache pollution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards Optimizing Deduplication on Persistent Memory

Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

Article 29 April 2016

Intelligent cache prefetchers in HPC architecture

Article 21 January 2025

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Meyer D T, Bolosky W J. A study of practical deduplication. ACM Transactions on Storage, 2012, 7(4): 14
Article Google Scholar
Wildani A, Miller E L, Rodeh O. Hands: a heuristically arranged nonbackup in-line deduplication system. In: Proceedings of the 29th IEEE International Conference on Data Engineering. 2013, 446–457
Google Scholar
Wallace G, Douglis F, Qian H, Shilane P, Smaldone S, Chamness M, Hsu W. Characteristics of backup workloads in production systems. In: Proceedings of the 10th USENIX Conference on File and Storage Technologies. 2012
Google Scholar
Meister D, KaiserS J, Brinkmann A, Cortes T, Kuhn M, Kunkel J. A study on data deduplication in HPC storage systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012, 1–11
Google Scholar
Bolosky W J, Corbin S, Goebel D, Douceur J R. Single instance storage in Windows 2000. In: Proceedings of the 4th USENIX Windows Systems Symposium. 2000, 13–24
Google Scholar
Quinlan S, Dorward S. Venti: a new approach to archival storage. In: Proceedings of USENIX Conference on File and Storage Technologies. 2002, 89–101
Google Scholar
Zhu B, Li K, Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies. 2008, 1–14
Google Scholar
Lin B, Li S, Liao X, Zhang J, Liu X. Leach: an automatic learning cache for inline primary deduplication system. Frontiers of Computer Science, 2014, 8(2):175–183
Article MathSciNet Google Scholar
Mandal S, Kuenning G, Ok D, Shastry V, Shilane P, Zhen S, Tarasov V, Zadok E. Using hints to improve inline block-layer deduplication. In: Proceedings of the 14th USENIX Conference on File and Storage Technologies. 2016, 315–322
Google Scholar
Muthitacharoen A, Chen B, Mazieres D. A low-bandwidth network file system. In: Proceedings of the 8th ACM Symposium on Operating Systems Principles. 2001, 174–187
Google Scholar
Shilane P, Huang M, Wallace G, Hsu W. WAN-optimized replication of backup datasets using stream-informed delta compression. ACM Transactions on Storage, 2012, 8(4): 13
Article Google Scholar
Hua Y, Liu X. Scheduling heterogeneous flows with delay-aware deduplication for avionics applications. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(9): 1790–1802
Article MathSciNet Google Scholar
Sun J, Chen H, He L, Tan H. Redundant network traffic elimination with GPU accelerated rabin fingerprinting. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(7): 2130–2142
Article Google Scholar
Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P. Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proceedings of the 7th USENIX Conference on File and Storage Technologies. 2009, 111–123
Google Scholar
Bhagwat D, Eshghi K, Long D D E, Lillibridge M. Extreme binning: scalable, parallel deduplication for chunk-based file backup. In: Proceedings of the 17th Annual Meeting of the IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems. 2009, 1–9
Google Scholar
Meister D, Kaiser J, Brinkmann A. Block locality caching for data deduplication. In: Proceedings of the 6th International Systems and Storage Conference. 2013
Google Scholar
Xia W, Jiang H, Feng D, Hua Y. Similarity and locality based indexing for high performance data deduplication. IEEE Transactions on Computers, 2015, 64(4): 1162–1176
Article MathSciNet MATH Google Scholar
Min J, Yoon D, Won Y. Efficient deduplication techniques for modern backup operation. IEEE Transactions on Computers, 2011, 60(6): 824–840
Article MathSciNet Google Scholar
Debnath B, Sengupta S, Li J. Chunkstash: speeding up inline storage deduplication using flash memory. In: Proceedings of USENIX Annual Technical Conference. 2010, 215–230
Google Scholar
Guo F, Efstathopoulos P. Building a high-performance deduplication system. In: Proceedings of USENIX Annual Technical Conference. 2011, 271–284
Google Scholar
Sun Z, Kuenning G, Mandal S, Shilane P, Tarasov V, Xiao N, Zadok E. A long-term user-centric analysis of deduplication patterns. In: Proceedings of the 32nd International Conference on Massive Storage Systems and Technology. 2016, 1–7
Google Scholar
Xia W, Jiang H, Feng D, Douglis F, Shilane P, Hua Y, Fu M, Zhang Y, Zhou Y. A comprehensive study of the past, present, and future of data deduplication. Proceedings of the IEEE, 2016, 104(9):1681–1710
Article Google Scholar
Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Zhang Y, Tan Y. Design tradeoffs for data deduplication performance in backup workloads. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies. 2015, 331–344
Google Scholar
Meister D, Brinkmann A. Dedupv1: improving deduplication throughput using solid state drives (SSD). In: Proceedings of the 26th Symposium on Massive Storage Systems and Technologies. 2010, 1–6
Google Scholar
Lu G, Nam Y J, Du D H C. Bloomstore: bloom-filter based memoryefficient key-value store for indexing of data deduplication on flash. In: Proceedings of the 28th Symposium on Mass Storage Systems and Technologies. 2012, 1–11
Google Scholar
Chen Z, Shen K. Ordermergededup: efficient, failure-consistent deduplication on flash. In: Proceedings of the 14th USENIX Conference on File and Storage Technologies. 2016, 291–299
Google Scholar
Fu Y, Jiang H, Xiao N. A scalable inline cluster deduplication framework for big data protection. In: Proceedings of the ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing. 2012, 354–373
Google Scholar
Frey D, Kermarrec A M, Kloudas K. Probabilistic deduplication for cluster-based storage systems. In: Proceedings of the 3rd ACM Symposium on Cloud Computing. 2012
Google Scholar
Luo S, Zhang G, Wu C, Khan S, Li K. Boafft: distributed deduplication for big data storage in the cloud. IEEE Transactions on Cloud Computing, 2015, 61(11): 1–13
Article Google Scholar
Jaleel A, Theobald K B, Steely Jr S C, Emer J. High performance cache replacement using re-reference interval prediction (RRIP). In: Proceedings of the 37th Annual International Symposium on Computer Architecture. 2010, 60–71
Google Scholar
Wu C J, Jaleel A, Hasenplaugh W, Martonosi M, Steely Jr S C, Emer J. Ship: signature-based hit predictor for high performance caching. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 2011, 430–441
Google Scholar
Wu C J, Jaleel A, Martonosi M, Steely Jr S C, Emer J. Pacman: prefetch-aware cache management for high performance caching. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 2011, 442–453
Google Scholar
Seshadri V, Yedkar S, Xin H, Mutlu O, Gibbons P B, Kozuch M A, Mowry T C. Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Transactions on Architecture and Code Optimization, 2015, 11(4): 51
Article Google Scholar
Cidon A, Eisenman A, Alizadeh M, Katti S. Cliffhanger: scaling performance cliffs in web memory caches. In: Proceedings of the 13th USENIX Symposium on Networked Systems Design and Implementation. 2016, 379–392
Google Scholar
Li M, Zhang H, Wu Y, Zhao C. Memsc: a scan-resistant and compact cache replacement framework for memory-based key-value cache systems. Journal of Computer Science and Technology, 2017, 32(1): 55–67
Article Google Scholar

Download references

Acknowledgements

This work was supported by the next generation of information technology strategic research program of Chinese Academy of Sciences (XDA06010600).

Author information

Authors and Affiliations

Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China
Mei Li, Hongjun Zhang, Yanjun Wu & Chen Zhao
University of Chinese Academy of Sciences, Beijing, 100049, China
Mei Li & Hongjun Zhang

Authors

Mei Li
View author publications
Search author on:PubMed Google Scholar
Hongjun Zhang
View author publications
Search author on:PubMed Google Scholar
Yanjun Wu
View author publications
Search author on:PubMed Google Scholar
Chen Zhao
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Mei Li.

Additional information

Mei Li is a PhD candidate in computer software and theory of Institute of Software, Chinese Academy of Sciences, China. She received her BS degree in computer science and technology from Beijing University of Posts and Telecommunications, China in 2011. Her research interests include operating system and cloud computing.

Hongjun Zhang is a PhD candidate in computer software and theory of Institute of Software, Chinese Academy of Sciences, China. He received his BS degree in software engineering from Shandong University, China in 2012. His research interests include operating system and distributed system.

Yanjun Wu received his PhD degree in computer software and theory from Institute of Software, Chinese Academy of Sciences (CAS), China in 2006. Currently, he is a research professor at Institute of Software, CAS. His research interests include operating system and system security.

Chen Zhao received his PhD degree in computer software and theory from Institute of Software, Chinese Academy of Sciences (CAS), China in 2000. Currently, he is a research professor at Institute of Software, CAS. His research interests include compiling, auto-testing, operating system, and networking software.

Electronic supplementary material

Supplementary material, approximately 360 KB.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, M., Zhang, H., Wu, Y. et al. Prefetch-aware fingerprint cache management for data deduplication systems. Front. Comput. Sci. 13, 500–515 (2019). https://doi.org/10.1007/s11704-017-7119-0

Download citation

Received: 10 April 2017
Accepted: 04 July 2017
Published: 09 June 2018
Issue Date: June 2019
DOI: https://doi.org/10.1007/s11704-017-7119-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prefetch-aware fingerprint cache management for data deduplication systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Towards Optimizing Deduplication on Persistent Memory

Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

Intelligent cache prefetchers in HPC architecture

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 360 KB.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now