Skip to main content

Advertisement

Log in

An Enhanced Physical-Locality Deduplication System for Space Efficiency

  • Regular Paper
  • Computer Architecture and Systems
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

An abundance of data have been generated from various embedded devices, applications, and systems, and require cost-efficient storage services. Data deduplication removes duplicate chunks and becomes an important technique for storage systems to improve space efficiency. However, stored unique chunks are heavily fragmented, decreasing restore performance and incurs high overheads for garbage collection. Existing schemes fail to achieve an efficient trade-off among deduplication, restore and garbage collection performance, due to failing to explore and exploit the physical locality of different chunks. In this paper, we trace the storage patterns of the fragmented chunks in backup systems, and propose a high-performance deduplication system, called HiDeStore. The main insight is to enhance the physical-locality for the new backup versions during the deduplication phase, which identifies and stores hot chunks in the active containers. The chunks not appearing in new backups become cold and are gathered together in the archival containers. Moreover, we remove the expired data with an isolated container deletion scheme, avoiding the high overheads for expired data detection. Compared with state-of-the-art schemes, HiDeStore improves the deduplication and restore performance by up to 1.4x and 1.6x, respectively, without decreasing the deduplication ratios and incurring high garbage collection overheads.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Khorasani S O, Rellermeyer J S, Epema D. Self-adaptive executors for big data processing. In Proc. the 20th International Middleware Conference, Dec. 2019, pp.176–188. DOI: https://doi.org/10.1145/3361525.3361545.

    Chapter  MATH  Google Scholar 

  2. Birke R, Rocha I, Perez J, Schiavoni V, Felber P, Chen L Y. Differential approximation and sprinting for multi-priority big data engines. In Proc. the 20th International Middleware Conference, Dec. 2019, pp.202–214. DOI: https://doi.org/10.1145/3361525.3361547.

    Chapter  Google Scholar 

  3. Akbari A, Martinez J, Jafari R. Facilitating human activity data annotation via context-aware change detection on smartwatches. ACM Trans. Embedded Computing Systems, 2021, 20(2): 15. DOI: https://doi.org/10.1145/3431503.

    Article  MATH  Google Scholar 

  4. Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Zhang Y, Tan Y. Design tradeoffs for data deduplication performance in backup workloads. In Proc. the 13th USENIX Conference on File and Storage Technologies, Feb. 2015, pp.331–344.

    Google Scholar 

  5. Li Y K, Xu M, Ng C H, Lee P P C. Efficient hybrid in-line and out-of-line deduplication for backup storage. ACM Trans. Storage, 2015, 11(1): Article No. 2. DOI: https://doi.org/10.1145/2641572.

  6. Park D, Fan Z, Nam Y J, Du D H C. A lookahead read cache: Improving read performance for deduplication backup storage. Journal of Computer Science and Technology, 2017, 32(1): 26–40. DOI: https://doi.org/10.1007/s11390-017-1680-8.

    Article  Google Scholar 

  7. Duggal A, Jenkins F, Shilane P, Chinthekindi R, Shah R, Kamat M. Data domain cloud tier: Backup here, backup there, deduplicated everywhere! In Proc. the 2019 USENIX Annual Technical Conference, Jul. 2019, pp.647–660.

    Google Scholar 

  8. Meyer D T, Bolosky W J. A study of practical deduplication. ACM Trans. Storage, 2012, 7(4): Article No. 14. DOI: https://doi.org/10.1145/2078861.2078864.

  9. Muthitacharoen A, Chen B, Mazières D. A low-bandwidth network file system. In Proc. the 18th ACM Symposium on Operating Systems Principles, Oct. 2001, pp.174–187. DOI: https://doi.org/10.1145/502034.502052.

    MATH  Google Scholar 

  10. Wallace G, Douglis F, Qian H, Shilane P, Smaldone S, Chamness M, Hsu W. Characteristics of backup workloads in production systems. In Proc. the 10th USENIX Conference on File and Storage Technologies, Feb. 2012, p.4.

    Google Scholar 

  11. Yang Q, Jin R, Zhao M. SmartDedup: Optimizing deduplication for resource-constrained devices. In Proc. the 2019 USENIX Annual Technical Conference, Jul. 2019, pp.633–646.

    MATH  Google Scholar 

  12. Quinlan S, Dorward S. Venti: A new approach to archival storage. In Proc. the FAST 2002 Conference on File and Storage Technologies, Jan. 2002, pp.89–101.

    MATH  Google Scholar 

  13. Zhu B, Li K, Patterson R H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. the 6th USENIX Conference on File and Storage Technologies, Feb. 2008, pp.269–282.

    MATH  Google Scholar 

  14. Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proc. the 7th USENIX Conference on File and Storage Technologies, Feb. 2009, pp.111–123.

    Google Scholar 

  15. Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Huang F, Liu Q. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proc. the 2014 USENIX Annual Technical Conference, Jun. 2014, pp.181–192.

    Google Scholar 

  16. Kaczmarczyk M, Barczynski M, Kilian W, Dubnicki C. Reducing impact of data fragmentation caused by in-line deduplication. In Proc. the 5th Annual International Systems and Storage Conference, Jun. 2012, Article No. 15. DOI: https://doi.org/10.1145/2367589.2367600.

    MATH  Google Scholar 

  17. Lillibridge M, Eshghi K, Bhagwat D. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proc. the 11th USENIX conference on File and Storage Technologies, Feb. 2013, pp.183–198.

    Google Scholar 

  18. Cao Z, Wen H, Wu F, Du D H C. ALACC: Accelerating restore performance of data deduplication systems using adaptive look-ahead window assisted chunk caching. In Proc. the 16th USENIX Conference on File and Storage Technologies, Feb. 2018, pp.309–324.

    MATH  Google Scholar 

  19. Mao B, Jiang H, Wu S, Fu Y, Tian L. SAR: SSD assisted restore optimization for deduplication-based storage systems in the cloud. In Proc. the 7th IEEE International Conference on Networking, Architecture, Jun. 2012, pp.328–337. DOI: https://doi.org/10.1109/NAS.2012.48.

    MATH  Google Scholar 

  20. Nam Y J, Park D, Du D H C. Assuring demanded read performance of data deduplication storage with backup datasets. In Proc. the 20th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Aug. 2012, pp.201–208. DOI: https://doi.org/10.1109/MASCOTS.2012.32.

    MATH  Google Scholar 

  21. Cao Z, Liu S, Wu F, Wang G, Li B, Du D H C. Sliding look-back window assisted data chunk rewriting for improving deduplication restore performance. In Proc. the 17th USENIX Conference on File and Storage Technologies, Feb. 2019, pp.129–142.

    MATH  Google Scholar 

  22. Nam Y, Lu G, Park N, Xiao W, Du D H C. Chunk fragmentation level: An effective indicator for read performance degradation in deduplication storage. In Proc. the 13th IEEE International Conference on High Performance Computing and Communications, Sept. 2011, pp.581–586. DOI: https://doi.org/10.1109/HPCC.2011.82.

    Google Scholar 

  23. Ng C H, Lee P P C. RevDedup: A reverse deduplication storage system optimized for reads to latest backups. In Proc. the 4th Asia-Pacific Workshop on Systems, Jul. 2013, Article No. 15. DOI: https://doi.org/10.1145/2500727.2500731.

    MATH  Google Scholar 

  24. Li P, Hua Y, Cao Q, Zhang M. Improving the restore performance via physical-locality middleware for backup systems. In Proc. the 21st International Middleware Conference, Dec. 2020, pp.341–355. DOI: https://doi.org/10.1145/3423211.3425691.

    Chapter  MATH  Google Scholar 

  25. Debnath B K, Sengupta S, Li J. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proc. the 2010 USENIX Annual Technical Conference, Jun. 2010, Article No. 16.

    MATH  Google Scholar 

  26. Meister D, Kaiser J, Brinkmann A. Block locality caching for data deduplication. In Proc. the 6th International Systems and Storage Conference, Jul. 2013, Article No. 15. DOI: https://doi.org/10.1145/2485732.2485748.

    MATH  Google Scholar 

  27. Eshghi K, Tang H K. A framework for analyzing and improving content-based chunking algorithms. Technical Report, HP Laboratory, 2005. http://shiftleft.com/mirrors/www.hpl.hp.com/techreports/2005/HPL-2005-30R1.pdf. Oct. 2024.

    MATH  Google Scholar 

  28. Xia W, Zhou Y, Jiang H, Feng D, Hua Y, Hu Y, Liu Q, Zhang Y. Fastcdc: A fast and efficient content-defined chunking approach for data deduplication. In Proc. the 2016 USENIX Annual Technical Conference, Jun. 2016, pp.101–114.

    MATH  Google Scholar 

  29. Bhagwat D, Eshghi K, Long D D E, Lillibridge M. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proc. the 17th IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, Sept. 2009. DOI: https://doi.org/10.1109/MASCOT.2009.5366623.

    MATH  Google Scholar 

  30. Xia W, Jiang H, Feng D, Hua Y. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proc. the 2011 USENIX Conference on USENIX Annual Technical Conference, Jun. 2011.

    MATH  Google Scholar 

  31. Xu G, Tang B, Lu H, Yu Q, Sung C W. LIPA: A learning-based indexing and prefetching approach for data deduplication. In Proc. the 35th Symposium on Mass Storage Systems and Technologies, May 2019, pp.299–310. DOI: https://doi.org/10.1109/MSST.2019.00010.

    MATH  Google Scholar 

  32. Wei J, Jiang H, Zhou K, Feng D. MAD2: A scalable high-throughput exact deduplication approach for network backup services. In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies, May 2010. DOI: https://doi.org/10.1109/MSST.2010.5496987.

    MATH  Google Scholar 

  33. Guo F, Efstathopoulos P. Building a high-performance deduplication system. In Proc. the 2011 USENIX Conference on USENIX Annual Technical Conference, Jun. 2011.

    Google Scholar 

  34. Zhang Y, Jiang H, Feng D, Xia W, Fu M, Huang F, Zhou Y. AE: An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. In Proc. the 2015 IEEE Conference on Computer Communications, Apr. 26–May 1, 2015, pp.1337–1345. DOI: https://doi.org/10.1109/INFOCOM.2015.7218510.

    MATH  Google Scholar 

  35. Xia W, Jiang H, Feng D, Hua Y. Similarity and locality based indexing for high performance data deduplication. IEEE Trans. Computers, 2015, 64(4): 1162–1176. DOI: https://doi.org/10.1109/TC.2014.2308181.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Hua  (华 宇).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

A preliminary version of the paper was published in the Proceedings of ACM/IFIP Middleware 2020.

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62125202 and U22B2022.

Peng-Fei Li received his B.S. degree in computer science and technology from Huazhong University of Science and Technology, Wuhan, in 2017. He is currently a Ph.D. candidate majoring in computer system architecture at Huazhong University of Science and Technology, Wuhan. His research interests include in-memory indexes, network-attached key-value stores, and deduplication techniques.

Yu Hua received his B.S. and Ph.D. degrees in computer science from Wuhan University, Wuhan, in 2001 and 2005, respectively. He is currently a professor at Huazhong University of Science and Technology, Wuhan. His research interests include cloud storage systems, file systems, non-volatile memory architectures, etc.

Qin Cao received her B.S. degree in computer science from Central China Normal University, Wuhan, in 2017, and her Master degree in computer science and technology from Huazhong University of Science and Technology, Wuhan, in 2020. Her research interests include data deduplication techniques and persistent memory systems.

Electronic Supplementary Material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, PF., Hua, Y. & Cao, Q. An Enhanced Physical-Locality Deduplication System for Space Efficiency. J. Comput. Sci. Technol. 39, 1361–1379 (2024). https://doi.org/10.1007/s11390-023-2646-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-023-2646-7

Keywords