An Enhanced Physical-Locality Deduplication System for Space Efficiency

Li, Peng-Fei; Hua, Yu; Cao, Qin

doi:10.1007/s11390-023-2646-7

An Enhanced Physical-Locality Deduplication System for Space Efficiency

Regular Paper
Computer Architecture and Systems
Published: 16 January 2025

Volume 39, pages 1361–1379, (2024)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Peng-Fei Li (李鹏飞)¹,
Yu Hua (华宇)¹ &
Qin Cao (曹钦)¹

37 Accesses
1 Altmetric
Explore all metrics

Abstract

An abundance of data have been generated from various embedded devices, applications, and systems, and require cost-efficient storage services. Data deduplication removes duplicate chunks and becomes an important technique for storage systems to improve space efficiency. However, stored unique chunks are heavily fragmented, decreasing restore performance and incurs high overheads for garbage collection. Existing schemes fail to achieve an efficient trade-off among deduplication, restore and garbage collection performance, due to failing to explore and exploit the physical locality of different chunks. In this paper, we trace the storage patterns of the fragmented chunks in backup systems, and propose a high-performance deduplication system, called HiDeStore. The main insight is to enhance the physical-locality for the new backup versions during the deduplication phase, which identifies and stores hot chunks in the active containers. The chunks not appearing in new backups become cold and are gathered together in the archival containers. Moreover, we remove the expired data with an isolated container deletion scheme, avoiding the high overheads for expired data detection. Compared with state-of-the-art schemes, HiDeStore improves the deduplication and restore performance by up to 1.4x and 1.6x, respectively, without decreasing the deduplication ratios and incurring high garbage collection overheads.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Khorasani S O, Rellermeyer J S, Epema D. Self-adaptive executors for big data processing. In Proc. the 20th International Middleware Conference, Dec. 2019, pp.176–188. DOI: https://doi.org/10.1145/3361525.3361545.
Chapter MATH Google Scholar
Birke R, Rocha I, Perez J, Schiavoni V, Felber P, Chen L Y. Differential approximation and sprinting for multi-priority big data engines. In Proc. the 20th International Middleware Conference, Dec. 2019, pp.202–214. DOI: https://doi.org/10.1145/3361525.3361547.
Chapter Google Scholar
Akbari A, Martinez J, Jafari R. Facilitating human activity data annotation via context-aware change detection on smartwatches. ACM Trans. Embedded Computing Systems, 2021, 20(2): 15. DOI: https://doi.org/10.1145/3431503.
Article MATH Google Scholar
Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Zhang Y, Tan Y. Design tradeoffs for data deduplication performance in backup workloads. In Proc. the 13th USENIX Conference on File and Storage Technologies, Feb. 2015, pp.331–344.
Google Scholar
Li Y K, Xu M, Ng C H, Lee P P C. Efficient hybrid in-line and out-of-line deduplication for backup storage. ACM Trans. Storage, 2015, 11(1): Article No. 2. DOI: https://doi.org/10.1145/2641572.
Park D, Fan Z, Nam Y J, Du D H C. A lookahead read cache: Improving read performance for deduplication backup storage. Journal of Computer Science and Technology, 2017, 32(1): 26–40. DOI: https://doi.org/10.1007/s11390-017-1680-8.
Article Google Scholar
Duggal A, Jenkins F, Shilane P, Chinthekindi R, Shah R, Kamat M. Data domain cloud tier: Backup here, backup there, deduplicated everywhere! In Proc. the 2019 USENIX Annual Technical Conference, Jul. 2019, pp.647–660.
Google Scholar
Meyer D T, Bolosky W J. A study of practical deduplication. ACM Trans. Storage, 2012, 7(4): Article No. 14. DOI: https://doi.org/10.1145/2078861.2078864.
Muthitacharoen A, Chen B, Mazières D. A low-bandwidth network file system. In Proc. the 18th ACM Symposium on Operating Systems Principles, Oct. 2001, pp.174–187. DOI: https://doi.org/10.1145/502034.502052.
MATH Google Scholar
Wallace G, Douglis F, Qian H, Shilane P, Smaldone S, Chamness M, Hsu W. Characteristics of backup workloads in production systems. In Proc. the 10th USENIX Conference on File and Storage Technologies, Feb. 2012, p.4.
Google Scholar
Yang Q, Jin R, Zhao M. SmartDedup: Optimizing deduplication for resource-constrained devices. In Proc. the 2019 USENIX Annual Technical Conference, Jul. 2019, pp.633–646.
MATH Google Scholar
Quinlan S, Dorward S. Venti: A new approach to archival storage. In Proc. the FAST 2002 Conference on File and Storage Technologies, Jan. 2002, pp.89–101.
MATH Google Scholar
Zhu B, Li K, Patterson R H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. the 6th USENIX Conference on File and Storage Technologies, Feb. 2008, pp.269–282.
MATH Google Scholar
Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proc. the 7th USENIX Conference on File and Storage Technologies, Feb. 2009, pp.111–123.
Google Scholar
Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Huang F, Liu Q. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proc. the 2014 USENIX Annual Technical Conference, Jun. 2014, pp.181–192.
Google Scholar
Kaczmarczyk M, Barczynski M, Kilian W, Dubnicki C. Reducing impact of data fragmentation caused by in-line deduplication. In Proc. the 5th Annual International Systems and Storage Conference, Jun. 2012, Article No. 15. DOI: https://doi.org/10.1145/2367589.2367600.
MATH Google Scholar
Lillibridge M, Eshghi K, Bhagwat D. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proc. the 11th USENIX conference on File and Storage Technologies, Feb. 2013, pp.183–198.
Google Scholar
Cao Z, Wen H, Wu F, Du D H C. ALACC: Accelerating restore performance of data deduplication systems using adaptive look-ahead window assisted chunk caching. In Proc. the 16th USENIX Conference on File and Storage Technologies, Feb. 2018, pp.309–324.
MATH Google Scholar
Mao B, Jiang H, Wu S, Fu Y, Tian L. SAR: SSD assisted restore optimization for deduplication-based storage systems in the cloud. In Proc. the 7th IEEE International Conference on Networking, Architecture, Jun. 2012, pp.328–337. DOI: https://doi.org/10.1109/NAS.2012.48.
MATH Google Scholar
Nam Y J, Park D, Du D H C. Assuring demanded read performance of data deduplication storage with backup datasets. In Proc. the 20th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Aug. 2012, pp.201–208. DOI: https://doi.org/10.1109/MASCOTS.2012.32.
MATH Google Scholar
Cao Z, Liu S, Wu F, Wang G, Li B, Du D H C. Sliding look-back window assisted data chunk rewriting for improving deduplication restore performance. In Proc. the 17th USENIX Conference on File and Storage Technologies, Feb. 2019, pp.129–142.
MATH Google Scholar
Nam Y, Lu G, Park N, Xiao W, Du D H C. Chunk fragmentation level: An effective indicator for read performance degradation in deduplication storage. In Proc. the 13th IEEE International Conference on High Performance Computing and Communications, Sept. 2011, pp.581–586. DOI: https://doi.org/10.1109/HPCC.2011.82.
Google Scholar
Ng C H, Lee P P C. RevDedup: A reverse deduplication storage system optimized for reads to latest backups. In Proc. the 4th Asia-Pacific Workshop on Systems, Jul. 2013, Article No. 15. DOI: https://doi.org/10.1145/2500727.2500731.
MATH Google Scholar
Li P, Hua Y, Cao Q, Zhang M. Improving the restore performance via physical-locality middleware for backup systems. In Proc. the 21st International Middleware Conference, Dec. 2020, pp.341–355. DOI: https://doi.org/10.1145/3423211.3425691.
Chapter MATH Google Scholar
Debnath B K, Sengupta S, Li J. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proc. the 2010 USENIX Annual Technical Conference, Jun. 2010, Article No. 16.
MATH Google Scholar
Meister D, Kaiser J, Brinkmann A. Block locality caching for data deduplication. In Proc. the 6th International Systems and Storage Conference, Jul. 2013, Article No. 15. DOI: https://doi.org/10.1145/2485732.2485748.
MATH Google Scholar
Eshghi K, Tang H K. A framework for analyzing and improving content-based chunking algorithms. Technical Report, HP Laboratory, 2005. http://shiftleft.com/mirrors/www.hpl.hp.com/techreports/2005/HPL-2005-30R1.pdf. Oct. 2024.
MATH Google Scholar
Xia W, Zhou Y, Jiang H, Feng D, Hua Y, Hu Y, Liu Q, Zhang Y. Fastcdc: A fast and efficient content-defined chunking approach for data deduplication. In Proc. the 2016 USENIX Annual Technical Conference, Jun. 2016, pp.101–114.
MATH Google Scholar
Bhagwat D, Eshghi K, Long D D E, Lillibridge M. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proc. the 17th IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, Sept. 2009. DOI: https://doi.org/10.1109/MASCOT.2009.5366623.
MATH Google Scholar
Xia W, Jiang H, Feng D, Hua Y. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proc. the 2011 USENIX Conference on USENIX Annual Technical Conference, Jun. 2011.
MATH Google Scholar
Xu G, Tang B, Lu H, Yu Q, Sung C W. LIPA: A learning-based indexing and prefetching approach for data deduplication. In Proc. the 35th Symposium on Mass Storage Systems and Technologies, May 2019, pp.299–310. DOI: https://doi.org/10.1109/MSST.2019.00010.
MATH Google Scholar
Wei J, Jiang H, Zhou K, Feng D. MAD2: A scalable high-throughput exact deduplication approach for network backup services. In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies, May 2010. DOI: https://doi.org/10.1109/MSST.2010.5496987.
MATH Google Scholar
Guo F, Efstathopoulos P. Building a high-performance deduplication system. In Proc. the 2011 USENIX Conference on USENIX Annual Technical Conference, Jun. 2011.
Google Scholar
Zhang Y, Jiang H, Feng D, Xia W, Fu M, Huang F, Zhou Y. AE: An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. In Proc. the 2015 IEEE Conference on Computer Communications, Apr. 26–May 1, 2015, pp.1337–1345. DOI: https://doi.org/10.1109/INFOCOM.2015.7218510.
MATH Google Scholar
Xia W, Jiang H, Feng D, Hua Y. Similarity and locality based indexing for high performance data deduplication. IEEE Trans. Computers, 2015, 64(4): 1162–1176. DOI: https://doi.org/10.1109/TC.2014.2308181.
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Peng-Fei Li (李鹏飞), Yu Hua (华宇) & Qin Cao (曹钦)

Authors

Peng-Fei Li (李鹏飞)
View author publications
You can also search for this author in PubMed Google Scholar
Yu Hua (华宇)
View author publications
You can also search for this author in PubMed Google Scholar
Qin Cao (曹钦)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Hua (华宇).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

A preliminary version of the paper was published in the Proceedings of ACM/IFIP Middleware 2020.

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62125202 and U22B2022.

Peng-Fei Li received his B.S. degree in computer science and technology from Huazhong University of Science and Technology, Wuhan, in 2017. He is currently a Ph.D. candidate majoring in computer system architecture at Huazhong University of Science and Technology, Wuhan. His research interests include in-memory indexes, network-attached key-value stores, and deduplication techniques.

Yu Hua received his B.S. and Ph.D. degrees in computer science from Wuhan University, Wuhan, in 2001 and 2005, respectively. He is currently a professor at Huazhong University of Science and Technology, Wuhan. His research interests include cloud storage systems, file systems, non-volatile memory architectures, etc.

Qin Cao received her B.S. degree in computer science from Central China Normal University, Wuhan, in 2017, and her Master degree in computer science and technology from Huazhong University of Science and Technology, Wuhan, in 2020. Her research interests include data deduplication techniques and persistent memory systems.

Electronic Supplementary Material