Abstract
An abundance of data have been generated from various embedded devices, applications, and systems, and require cost-efficient storage services. Data deduplication removes duplicate chunks and becomes an important technique for storage systems to improve space efficiency. However, stored unique chunks are heavily fragmented, decreasing restore performance and incurs high overheads for garbage collection. Existing schemes fail to achieve an efficient trade-off among deduplication, restore and garbage collection performance, due to failing to explore and exploit the physical locality of different chunks. In this paper, we trace the storage patterns of the fragmented chunks in backup systems, and propose a high-performance deduplication system, called HiDeStore. The main insight is to enhance the physical-locality for the new backup versions during the deduplication phase, which identifies and stores hot chunks in the active containers. The chunks not appearing in new backups become cold and are gathered together in the archival containers. Moreover, we remove the expired data with an isolated container deletion scheme, avoiding the high overheads for expired data detection. Compared with state-of-the-art schemes, HiDeStore improves the deduplication and restore performance by up to 1.4x and 1.6x, respectively, without decreasing the deduplication ratios and incurring high garbage collection overheads.
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Khorasani S O, Rellermeyer J S, Epema D. Self-adaptive executors for big data processing. In Proc. the 20th International Middleware Conference, Dec. 2019, pp.176–188. DOI: https://doi.org/10.1145/3361525.3361545.
Birke R, Rocha I, Perez J, Schiavoni V, Felber P, Chen L Y. Differential approximation and sprinting for multi-priority big data engines. In Proc. the 20th International Middleware Conference, Dec. 2019, pp.202–214. DOI: https://doi.org/10.1145/3361525.3361547.
Akbari A, Martinez J, Jafari R. Facilitating human activity data annotation via context-aware change detection on smartwatches. ACM Trans. Embedded Computing Systems, 2021, 20(2): 15. DOI: https://doi.org/10.1145/3431503.
Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Zhang Y, Tan Y. Design tradeoffs for data deduplication performance in backup workloads. In Proc. the 13th USENIX Conference on File and Storage Technologies, Feb. 2015, pp.331–344.
Li Y K, Xu M, Ng C H, Lee P P C. Efficient hybrid in-line and out-of-line deduplication for backup storage. ACM Trans. Storage, 2015, 11(1): Article No. 2. DOI: https://doi.org/10.1145/2641572.
Park D, Fan Z, Nam Y J, Du D H C. A lookahead read cache: Improving read performance for deduplication backup storage. Journal of Computer Science and Technology, 2017, 32(1): 26–40. DOI: https://doi.org/10.1007/s11390-017-1680-8.
Duggal A, Jenkins F, Shilane P, Chinthekindi R, Shah R, Kamat M. Data domain cloud tier: Backup here, backup there, deduplicated everywhere! In Proc. the 2019 USENIX Annual Technical Conference, Jul. 2019, pp.647–660.
Meyer D T, Bolosky W J. A study of practical deduplication. ACM Trans. Storage, 2012, 7(4): Article No. 14. DOI: https://doi.org/10.1145/2078861.2078864.
Muthitacharoen A, Chen B, Mazières D. A low-bandwidth network file system. In Proc. the 18th ACM Symposium on Operating Systems Principles, Oct. 2001, pp.174–187. DOI: https://doi.org/10.1145/502034.502052.
Wallace G, Douglis F, Qian H, Shilane P, Smaldone S, Chamness M, Hsu W. Characteristics of backup workloads in production systems. In Proc. the 10th USENIX Conference on File and Storage Technologies, Feb. 2012, p.4.
Yang Q, Jin R, Zhao M. SmartDedup: Optimizing deduplication for resource-constrained devices. In Proc. the 2019 USENIX Annual Technical Conference, Jul. 2019, pp.633–646.
Quinlan S, Dorward S. Venti: A new approach to archival storage. In Proc. the FAST 2002 Conference on File and Storage Technologies, Jan. 2002, pp.89–101.
Zhu B, Li K, Patterson R H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. the 6th USENIX Conference on File and Storage Technologies, Feb. 2008, pp.269–282.
Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proc. the 7th USENIX Conference on File and Storage Technologies, Feb. 2009, pp.111–123.
Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Huang F, Liu Q. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proc. the 2014 USENIX Annual Technical Conference, Jun. 2014, pp.181–192.
Kaczmarczyk M, Barczynski M, Kilian W, Dubnicki C. Reducing impact of data fragmentation caused by in-line deduplication. In Proc. the 5th Annual International Systems and Storage Conference, Jun. 2012, Article No. 15. DOI: https://doi.org/10.1145/2367589.2367600.
Lillibridge M, Eshghi K, Bhagwat D. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proc. the 11th USENIX conference on File and Storage Technologies, Feb. 2013, pp.183–198.
Cao Z, Wen H, Wu F, Du D H C. ALACC: Accelerating restore performance of data deduplication systems using adaptive look-ahead window assisted chunk caching. In Proc. the 16th USENIX Conference on File and Storage Technologies, Feb. 2018, pp.309–324.
Mao B, Jiang H, Wu S, Fu Y, Tian L. SAR: SSD assisted restore optimization for deduplication-based storage systems in the cloud. In Proc. the 7th IEEE International Conference on Networking, Architecture, Jun. 2012, pp.328–337. DOI: https://doi.org/10.1109/NAS.2012.48.
Nam Y J, Park D, Du D H C. Assuring demanded read performance of data deduplication storage with backup datasets. In Proc. the 20th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Aug. 2012, pp.201–208. DOI: https://doi.org/10.1109/MASCOTS.2012.32.
Cao Z, Liu S, Wu F, Wang G, Li B, Du D H C. Sliding look-back window assisted data chunk rewriting for improving deduplication restore performance. In Proc. the 17th USENIX Conference on File and Storage Technologies, Feb. 2019, pp.129–142.
Nam Y, Lu G, Park N, Xiao W, Du D H C. Chunk fragmentation level: An effective indicator for read performance degradation in deduplication storage. In Proc. the 13th IEEE International Conference on High Performance Computing and Communications, Sept. 2011, pp.581–586. DOI: https://doi.org/10.1109/HPCC.2011.82.
Ng C H, Lee P P C. RevDedup: A reverse deduplication storage system optimized for reads to latest backups. In Proc. the 4th Asia-Pacific Workshop on Systems, Jul. 2013, Article No. 15. DOI: https://doi.org/10.1145/2500727.2500731.
Li P, Hua Y, Cao Q, Zhang M. Improving the restore performance via physical-locality middleware for backup systems. In Proc. the 21st International Middleware Conference, Dec. 2020, pp.341–355. DOI: https://doi.org/10.1145/3423211.3425691.
Debnath B K, Sengupta S, Li J. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proc. the 2010 USENIX Annual Technical Conference, Jun. 2010, Article No. 16.
Meister D, Kaiser J, Brinkmann A. Block locality caching for data deduplication. In Proc. the 6th International Systems and Storage Conference, Jul. 2013, Article No. 15. DOI: https://doi.org/10.1145/2485732.2485748.
Eshghi K, Tang H K. A framework for analyzing and improving content-based chunking algorithms. Technical Report, HP Laboratory, 2005. http://shiftleft.com/mirrors/www.hpl.hp.com/techreports/2005/HPL-2005-30R1.pdf. Oct. 2024.
Xia W, Zhou Y, Jiang H, Feng D, Hua Y, Hu Y, Liu Q, Zhang Y. Fastcdc: A fast and efficient content-defined chunking approach for data deduplication. In Proc. the 2016 USENIX Annual Technical Conference, Jun. 2016, pp.101–114.
Bhagwat D, Eshghi K, Long D D E, Lillibridge M. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proc. the 17th IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, Sept. 2009. DOI: https://doi.org/10.1109/MASCOT.2009.5366623.
Xia W, Jiang H, Feng D, Hua Y. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proc. the 2011 USENIX Conference on USENIX Annual Technical Conference, Jun. 2011.
Xu G, Tang B, Lu H, Yu Q, Sung C W. LIPA: A learning-based indexing and prefetching approach for data deduplication. In Proc. the 35th Symposium on Mass Storage Systems and Technologies, May 2019, pp.299–310. DOI: https://doi.org/10.1109/MSST.2019.00010.
Wei J, Jiang H, Zhou K, Feng D. MAD2: A scalable high-throughput exact deduplication approach for network backup services. In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies, May 2010. DOI: https://doi.org/10.1109/MSST.2010.5496987.
Guo F, Efstathopoulos P. Building a high-performance deduplication system. In Proc. the 2011 USENIX Conference on USENIX Annual Technical Conference, Jun. 2011.
Zhang Y, Jiang H, Feng D, Xia W, Fu M, Huang F, Zhou Y. AE: An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. In Proc. the 2015 IEEE Conference on Computer Communications, Apr. 26–May 1, 2015, pp.1337–1345. DOI: https://doi.org/10.1109/INFOCOM.2015.7218510.
Xia W, Jiang H, Feng D, Hua Y. Similarity and locality based indexing for high performance data deduplication. IEEE Trans. Computers, 2015, 64(4): 1162–1176. DOI: https://doi.org/10.1109/TC.2014.2308181.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest The authors declare that they have no conflict of interest.
Additional information
A preliminary version of the paper was published in the Proceedings of ACM/IFIP Middleware 2020.
This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62125202 and U22B2022.
Peng-Fei Li received his B.S. degree in computer science and technology from Huazhong University of Science and Technology, Wuhan, in 2017. He is currently a Ph.D. candidate majoring in computer system architecture at Huazhong University of Science and Technology, Wuhan. His research interests include in-memory indexes, network-attached key-value stores, and deduplication techniques.
Yu Hua received his B.S. and Ph.D. degrees in computer science from Wuhan University, Wuhan, in 2001 and 2005, respectively. He is currently a professor at Huazhong University of Science and Technology, Wuhan. His research interests include cloud storage systems, file systems, non-volatile memory architectures, etc.
Qin Cao received her B.S. degree in computer science from Central China Normal University, Wuhan, in 2017, and her Master degree in computer science and technology from Huazhong University of Science and Technology, Wuhan, in 2020. Her research interests include data deduplication techniques and persistent memory systems.
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Li, PF., Hua, Y. & Cao, Q. An Enhanced Physical-Locality Deduplication System for Space Efficiency. J. Comput. Sci. Technol. 39, 1361–1379 (2024). https://doi.org/10.1007/s11390-023-2646-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-023-2646-7