Skip to main content
Log in

DCDedupe: Selective Deduplication and Delta Compression with Effective Routing for Distributed Storage

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Data deduplication has been an essential part of storage systems for big data. Traditional compare-by-hash (CBH) deduplication does not fully address the challenges for similar files with small changes. Delta compression can be complementary to further optimize the storage efficiency. In this paper, we designed and implemented a distributed storage system called DCDedupe that efficiently and intelligently use delta compression or deduplication to improve storage efficiency based on characteristics of data. Unlike prior studies, this system works well when the data locality is weak or even barely exists. In DCDedupe, we propose a pre-processing step to identify content similarity and data chunks are classified into different categories. Then, the appropriate routing algorithm ensures the data chunks are sent to the right target storage nodes in the distributed system to boost the storage efficiency. Our evaluation shows that generally storage space saving by DCDedupe outweighs the performance penalties. In some use cases, DCDeupe may become meaningful to trade off some throughput with optimized storage costs. The overheads to Input/Output (IO) operation and memory usage have also been studied with design recommendations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, pp. 1–10 (2010)

  2. Meister, D., Brinkmann, A.: Multi-level comparison of data deduplication in a backup scenario. In: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference. ACM, p 8 (2009)

  3. MacDonald, J.: File system support for delta compression. Masters thesis, Department of Electrical Engineering and Computer Science, University of California at Berkeley (2000)

  4. Shilane, P., Wallace, G., Huang, M., Hsu, W.: Delta compressed and deduplicated storage using stream-informed locality. In: Proceedings of the 4th USENIX Conference on Hot Topics in Storage and File Systems. USENIX Association, pp. 10–10 (2012)

  5. Xia, W., Jiang, H., Feng, D., Tian, L.: Combining deduplication and delta compression to achieve low-overhead data reduction on backup datasets. In: Data Compression Conference (DCC), 2014. IEEE, pp. 203–212 (2014)

  6. Chen, F., Luo, T., Zhang, X.: CAFTL: a content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In: FAST, vol. 11 (2011)

  7. Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences 1997. Proceedings. IEEE, pp. 21–29 (1997)

  8. Aronovich, L., Asher, R., Bachmat, E., Bitner, H., Hirsch, M., Klein, S.T.: The design of a similarity based deduplication system. In: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference. ACM, p 6 (2009)

  9. Paulo, J., Pereira, J.: A survey and classification of storage deduplication systems. ACM Comput. Surv. (CSUR) 47(1), 11 (2014)

    Article  MathSciNet  Google Scholar 

  10. Xia, W., Jiang, H., Feng, D., Hua, Y.: SiLo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In: USENIX Annual Technical Conference (2011)

  11. Fu, M., Feng, D., Hua, Y., He, X., Chen, Z., Xia, W., Zhang, Y., Tan, Y.: Design tradeoffs for data deduplication performance in backup workloads. In: 13th USENIX Conference on File and Storage Technologies (FAST 15), pp. 331–344. USENIX Association, Santa Clara (2015). [Online]. Available: https://www.usenix.org/conference/fast15/technical-sessions/presentation/fu

  12. Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M.: Extreme binning: scalable, parallel deduplication for chunk-based file backup. In: IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS’09. IEEE, pp. 1–9 (2009)

  13. Xia, W., Jiang, H., Feng, D., Tian, L., Fu, M., Zhou, Y.: Ddelta: a deduplication-inspired fast delta compression approach. Perform. Eval. 79, 258–272 (2014)

    Article  Google Scholar 

  14. Xdelta: Xdelta. [Online]. Available: https://xdelta.org/

  15. Shilane, P., Huang, M., Wallace, G., Hsu, W.: Wan-optimized replication of backup datasets using stream-informed delta compression. ACM Trans. Storage (TOS) 8(4), 13 (2012)

    Google Scholar 

  16. Zhu, B., Li, K., Patterson R.H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Fast, vol. 8, pp. 1–14 (2008)

  17. Meister, D., Brinkmann, A.: dedupv1: improving deduplication throughput using solid state drives (ssd). In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, pp. 1–6 (2010)

  18. Wildani, A., Miller, E.L., Rodeh, O.: HANDS: a heuristically arranged non-backup in-line deduplication system. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, pp. 446–457 (2013)

  19. Zhang, Y., Wu, Y., Yang, G.: Droplet: a distributed solution of data deduplication. In: Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing. IEEE Computer Society, pp. 114–121 (2012)

  20. Jin, K., Miller, E.L.: The effectiveness of deduplication on virtual machine disk images. In: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference. ACM, p 7 (2009)

  21. Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z., Zhou, G.: Sam: a semantic-aware multi-tiered source de-duplication framework for cloud backup. In: 2010 39th International Conference on Parallel Processing (ICPP). IEEE, pp. 614–623 (2010)

  22. Fu, Y., Jiang, H., Xiao, N., Tian, L., Liu, F., Xu, L.: Application-aware local-global source deduplication for cloud backup services of personal storage. IEEE Trans. Parallel Distrib. Syst. 25 (5), 1155–1165 (2014)

    Article  Google Scholar 

  23. Mandal, S., Kuenning, G., Ok, D., Shastry, V., Shilane, P., Zhen, S., Tarasov, V., Zadok, E.: Using hints to improve inline block-layer deduplication. In: The 14th USENIX Conference on File and Storage Technologies (FAST). Santa Clara (2016)

  24. Lin, X., Douglis, F., Li, J., Li, X., Ricci, R., Smaldone, S., Wallace, G.: Metadata considered harmful...to deduplication. In: 7th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 15) (2015)

  25. Eshghi, K., Tang, H.K.: A framework for analyzing and improving content-based chunking algorithms. Hewlett-Packard Labs Technical Report TR, vol. 30, p. 2005 (2005)

  26. Zhang, B., Wang, C., Zhou, B. B., Zomaya, A.Y.: Inline data deduplication for ssd-based distributed storage. In: 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS). IEEE, pp. 593–600 (2015)

  27. EMC Corporation: Isilon Scale-Out NAS and Unstructured Data. [Online]. Available: https://emc.com/en-us/storage/isilon/index.htm

  28. Dong, W., Douglis, F., Li, K., Patterson, R.H., Reddy, S., Shilane, P.: Tradeoffs in scalable data routing for deduplication clusters. In: FAST, vol. 11, pp. 15–29 (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Binqi Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, B., Wang, C., Zhou, B.B. et al. DCDedupe: Selective Deduplication and Delta Compression with Effective Routing for Distributed Storage. J Grid Computing 16, 195–209 (2018). https://doi.org/10.1007/s10723-018-9429-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-018-9429-3

Keywords

Navigation