Skip to main content
Log in

GLE-Dedup: A Globally–Locally Even Deduplication by Request-Aware Placement for Better Read Performance

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Deduplication serves as a fundamental way to eliminate replicas and save space and network bandwidth in various storage systems. However, the performance of most existing deduplication systems can be further improved on normal reads, which carry crucial weight in currently popular WORM access model. Specifically, most existing deduplication systems achieve globally even layout via the simple round-robin algorithm and ignore the interrelationship between chunks and IO requests in the placement policy, thus failing to achieve the local even placement within a request and causing read imbalance problem. In this paper, we focus on deduplication over small-scale storage systems with adequate bandwidth in between and propose a deduplication system with request-aware placement policy named GLE-Dedup to achieve even placement both globally and locally for better read performance. Differing from conventional approaches of chunk-based placement, GLE-Dedup employs a group placement for chunks and the group size is mainly determined by the request ID to achieve request-awareness. We place chunks belonging to the same IO request into different independent nodes as much as possible to achieve even placement locally within a request and meanwhile maintain global balance with rotation among chunk groups. In this way, better parallelism is exploited for higher read performance. Experiment results under the real-world CAFTL trace have shown the effectiveness and advantage of GLE-Dedup over B-Dedup and R-Dedup respectively under round-robin and random placement. For example, our GLE-Dedup could achieve about 18.9 and 24 % read improvement respectively compared with B-Dedup and R-Dedup.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Bobbarjung, D.R., Jagannathan, S., Dubnicki, C.: Improving duplicate elimination in storage systems. ACM Trans. Storage (TOS) 2(4), 424–448 (2006)

    Article  Google Scholar 

  2. Bolosky, W.J., Corbin, S., Goebel, D., Douceur, J.R.: Single instance storage in windows 2000. In Proceedings of the 4th USENIX Windows Systems Symposium, pp. 13–24. Seattle, WA (2000)

  3. Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., McKelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H., et al.: Windows azure storage: a highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 143–157. ACM (2011)

  4. Chen, F., Luo, T., Zhang, X.: Caftl: a content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In FAST, vol. 11 (2011)

  5. Chen, P.M., Lee, E.K., Gibson, G.A., Katz, R.H., Patterson, D.A.: Raid: high-performance, reliable secondary storage. ACM Comput. Surv. (CSUR) 26(2), 145–185 (1994)

    Article  Google Scholar 

  6. Chen, Z., Chen, Z., Xiao, N., Liu, F.: Nf-dedupe: a novel no-fingerprint deduplication scheme for flash-based ssds. In 2015 IEEE Symposium on Computers and Communication (ISCC), pp. 588–594. IEEE (2015)

  7. Debnath, B.K., Sengupta, S., Li, J.: Chunkstash: speeding up inline storage deduplication using flash memory. In USENIX annual technical conference (2010)

  8. Du, Y., Zhang, Y., Xiao, N.: R-dedup: content aware redundancy management for ssd-based raid systems. In Parallel Processing (ICPP), 2014 43rd International Conference on, pp. 111–120. IEEE (2014)

  9. El-Shimi, A., Kalach, R., Kumar, A., Ottean, A., Li, J., Sengupta, S.: Primary data deduplicationlarge scale study and system design. In Presented as part of the 2012 USENIX Annual Technical Conference (USENIX ATC 12), pp. 285–296. (2012)

  10. Fu, M., Feng, D., Hua, Y., He, X., Chen, Z., Xia, W., Huang, F., Liu, Q.: Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pp. 181–192. (2014)

  11. Fu, M., Feng, D., Hua, Y., He, X., Chen, Z., Xia, W., Zhang, Y., Tan, Y.: Design tradeoffs for data deduplication performance in backup workloads. In 13th USENIX Conference on File and Storage Technologies (FAST 15), pp. 331–344. (2015)

  12. Fu, Y., Jian, H., Xiao, N., Tian, L., Liu, F.: Aa-dedupe: an application-aware source deduplication approach for cloud backup services in the personal computing environment. In Cluster Computing (CLUSTER), 2011 IEEE International Conference on, pp. 112–120. IEEE (2011)

  13. Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In ACM SIGOPS operating systems review, vol. 37, pp. 29–43. ACM (2003)

  14. Guo, F., Efstathopoulos, P.: Building a high-performance deduplication system. In USENIX Annual Technical Conference (2011)

  15. Guo, H., Wang, L., Chen, F., Liang, D.: Scientific big data and digital earth. Chin. Sci. Bull. 59(35), 5066–5073 (2014)

    Article  Google Scholar 

  16. He, W., Xiao, N., Liu, F., Chen, Z., Fu, Y.: Dl-dedupe: dual-level deduplication scheme for flash-based ssds. In Web-Age Information Management, pp. 4–15. Springer (2013)

  17. Holland, M., Gibson, G.A., Siewiorek, D.P.: Architectures and algorithms for on-line failure recovery in redundant disk arrays. Distrib. Parallel Database 2(3), 295–335 (1994)

    Article  Google Scholar 

  18. Hu, Y., Jiang, H., Feng, D., Tian, L., Luo, H., Zhang, S.: Performance impact and interplay of ssd parallelism through advanced commands, allocation strategy and data granularity. In Proceedings of the international conference on Supercomputing, pp. 96–107. ACM (2011)

  19. Kaczmarczyk, M., Barczynski, M., Kilian, W., Dubnicki, C.: Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference, p. 15. ACM (2012)

  20. Kim, Y., Tauras, B., Gupta, A., Urgaonkar, B.: Flashsim: A simulator for nand flash-based solid-state drives. In Advances in System Simulation, 2009. SIMUL’09. First International Conference on, pp. 125–131. IEEE (2009)

  21. Koller, R., Rangaswami, R.: I/o deduplication: utilizing content similarity to improve i/o performance. ACM Trans. Storage (TOS) 6(3), 13 (2010)

    Google Scholar 

  22. Lillibridge, M., Eshghi, K., Bhagwat, D.: Improving restore speed for backup systems that use inline chunk-based deduplication. In Presented as part of the 11th USENIX Conference on File and Storage Technologies (FAST 13), pp. 183–197. (2013)

  23. Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezis, G., Camble, P.: Sparse indexing: Large scale, inline deduplication using sampling and locality. In Fast, vol. 9, pp. 111–123. (2009)

  24. McKnight, J., Asaro, T., Babineau, B.: Digital archiving: end-user survey and market forecast 2006–2010. The Enterprise Strategy Group (2006)

  25. Meister, D., Brinkmann, A.: Dedupv1: improving deduplication throughput using solid state drives (ssd). In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pp. 1–6. IEEE (2010)

  26. Meister, D., Kaiser, J., Brinkmann, A., Cortes, T., Kuhn, M., Kunkel, J.: A study on data deduplication in hpc storage systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 7. IEEE Computer Society Press (2012)

  27. Meyer, D.T., Bolosky, W.J.: A study of practical deduplication. ACM Trans. Storage (TOS) 7(4), 14 (2012)

    Google Scholar 

  28. Min, J., Yoon, D., Won, Y.: Efficient deduplication techniques for modern backup operation. Comput. IEEE Trans. 60(6), 824–840 (2011)

    Article  MathSciNet  Google Scholar 

  29. Nam, Y.J., Park, D., Du, D.H.: Assuring demanded read performance of data deduplication storage with backup datasets. In Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2012 IEEE 20th International Symposium on, pp. 201–208. IEEE (2012)

  30. Ng, C.H., Lee, P.P.: Revdedup: A reverse deduplication storage system optimized for reads to latest backups. In Proceedings of the 4th Asia-Pacific Workshop on Systems, p. 15. ACM (2013)

  31. Park, N., Lilja, D.J.: Characterizing datasets for data deduplication in backup applications. In Workload Characterization (IISWC), 2010 IEEE International Symposium on, pp. 1–10. IEEE (2010)

  32. Prabhakaran, V., Wobber, T.: SSD extension for disksim simulation environment. Microsoft Reseach. (2009)

  33. Salomon, D.: Data compression: The Complete Reference. Springer, New York (2004)

    MATH  Google Scholar 

  34. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pp. 1–10. IEEE (2010)

  35. Sun, Z., Kuenning, G., Mandal, S., Shilane, P., Tarasov, V., Xiao, N., Zadok, E.: A long-term user-centric analysis of deduplication patterns

  36. Tan, Y., Yan, Z., Feng, D., Sha, E.H., Ge, X.: Reducing the de-linearization of data placement to improve deduplication performance. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, pp. 796–800. IEEE (2012)

  37. Wallace, G., Douglis, F., Qian, H., Shilane, P., Smaldone, S., Chamness, M., Hsu, W.: Characteristics of backup workloads in production systems. In FAST, vol. 4, p. 500. (2012)

  38. Xia, W., Jiang, H., Feng, D., Hua, Y.: Silo: A similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In USENIX Annual Technical Conference (2011)

  39. Xia, W., Jiang, H., Feng, D., Tian, L., Fu, M., Wang, Z.: P-dedupe: exploiting parallelism in data deduplication system. In Networking, Architecture and Storage (NAS), 2012 IEEE 7th International Conference on, pp. 338–347. IEEE (2012)

  40. Xiao, M., Hassan, M.A., Xiao, W., Wei, Q., Chen, S.: Codeplugin: plugging deduplication into erasure coding for cloud storage. In 7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 15) (2015)

  41. Xu, M., Zhu, Y., Lee, P.P., Xu, Y.: Even data placement for load balance in reliable distributed deduplication storage systems. In Quality of Service (IWQoS), 2015 IEEE 23rd International Symposium on, pp. 349–358. IEEE (2015)

  42. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, vol. 10, p. 10. (2010)

  43. Zhengda, Z., Jingli, Z.: A novel data redundancy scheme for de-duplication storage system. In Knowledge Acquisition and Modeling (KAM), 2010 3rd International Symposium on, pp. 293–296. IEEE (2010)

  44. Zhu, B., Li, K., Patterson, R.H.: Avoiding the disk bottleneck in the data domain deduplication file system. In Fast, vol. 8, pp. 1–14. (2008)

Download references

Acknowledgments

The authors would like to thank all anonymous reviewers for your constructive and insightful suggestions to improve this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mingzhu Deng.

Additional information

This work is supported by the National Natural Science Foundation of China under Grant Nos. 61202121 and 61300218 by the National High Technology Research and Development 863 Program of China under Grant No. 2015AA015305.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Deng, M., Chen, W., Xiao, N. et al. GLE-Dedup: A Globally–Locally Even Deduplication by Request-Aware Placement for Better Read Performance. Int J Parallel Prog 45, 946–964 (2017). https://doi.org/10.1007/s10766-016-0450-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-016-0450-5

Keywords

Navigation