G-Paradex: GPU-Based Parallel Indexing for Fast Data Deduplication

Lin, Bin; Liao, Xiangke; Li, Shanshan; Wang, Yufeng; Huang, He; Wen, Ling

doi:10.1007/978-3-642-45293-2_7

Bin Lin¹⁸,
Xiangke Liao¹⁸,
Shanshan Li¹⁸,
Yufeng Wang¹⁹,
He Huang¹⁸ &
…
Ling Wen¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8299))

Included in the following conference series:

International Workshop on Advanced Parallel Processing Technologies

1475 Accesses

Abstract

Deduplication technology has been increasingly used to reduce the storage cost. In practice, the duplicate detection upon large on-disk index incurs unavoidable and significant overheads in write operations. Most existing deduplication methods perform single-pass processing, while pay little attention to develop highly parallel methods for the emerging parallel processors. In this paper, we present the design of G-Paradex, a novel deduplication framework that can significantly reduce the duplicate detecting time. Utilizing a prefix tree to organize the chunk fingerprints, G-Paradex is able to do fast deduplicating by using GPU to search the target tree in parallel. Leveraging the inherent chunk locality in writing data stream, we group consecutive chunks and extract the handprints into the prefix tree, aiming at shrinking the index size and reducing the on-disk accesses. Our experimental evaluation based on real-world datasets demonstrate that, compared with the traditional single-pass method, G-aparadex achieves a speedup of 2-4X for duplicate detecting.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

DZIP: A Data Deduplication-Compatible Enhanced Version of Gzip

A Data Deduplication Framework of Disk Images with Adaptive Block Skipping

Article 08 July 2016

$$\partial u\partial u$$ Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

References

Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: idedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of the 10th USENIX Conference on File and Storage Technologies, FAST 2012, p. 24. USENIX Association, Berkeley (2012)
Google Scholar
Geer, D.: Reducing the storage burden via data deduplication. Computer 41(12), 15–17 (2008)
Article Google Scholar
Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies. FAST 2008, pp. 18:1–18:14. USENIX Association, Berkeley (2008)
Google Scholar
Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M.: Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In: IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, MASCOTS 2009, pp. 1–9. IEEE (2009)
Google Scholar
Fu, Y., Jiang, H., Xiao, N.: A scalable inline cluster deduplication framework for big data protection. In: Narasimhan, P., Triantafillou, P. (eds.) Middleware 2012. LNCS, vol. 7662, pp. 354–373. Springer, Heidelberg (2012)
Chapter Google Scholar
Xia, W., Jiang, H., Feng, D., Hua, Y.: Silo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In: Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC 2011, pp. 26–28. USENIX Association, Berkeley (2011)
Google Scholar
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P.: Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proccedings of the 7th Conference on File and Storage Technologies, FAST 2009, pp. 111–123. USENIX Association, Berkeley (2009)
Google Scholar
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., et al.: The landscape of parallel computing research: A view from berkeley. Technical report, Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (2006)
Google Scholar
Bhatotia, P., Rodrigues, R., Verma, A.: Shredder: Gpu-accelerated incremental storage and computation. In: Proceedings of the 10th USENIX Conference on File and Storage Technologies, FAST 2012, p. 14. USENIX Association, Berkeley (2012)
Google Scholar
Xia, W., Jiang, H., Feng, D., Tian, L., Fu, M., Wang, Z.: P-dedupe: Exploiting parallelism in data deduplication system. In: Proceedings of the 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage, NAS 2012, pp. 338–347. IEEE Computer Society, Washington, DC (2012)
Chapter Google Scholar
Dal Bianco, G., Galante, R., Heuser, C.A.: A fast approach for parallel deduplication on multicore processors. In: Proceedings of the 2011 ACM Symposium on Applied Computing, SAC 2011, pp. 1027–1032. ACM, New York (2011)
Google Scholar
Bhattacherjee, S., Narang, A., Garg, V.K.: High throughput data redundancy removal algorithm with scalable performance. In: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers, HiPEAC 2011, pp. 87–96. ACM, New York (2011)
Google Scholar
Rao, J., Ross, K.A.: Making b+- trees cache conscious in main memory. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD 2000, pp. 475–486. ACM, New York (2000)
Chapter Google Scholar
Bayer, R., McCreight, E.: Organization and maintenance of large ordered indexes, pp. 245–262. Springer-Verlag New York, Inc., New York (2002)
Google Scholar
Lehman, T.J., Carey, M.J.: A study of index structures for main memory database management systems. In: Proceedings of the 12th International Conference on Very Large Data Bases, VLDB 1986, pp. 294–303. Morgan Kaufmann Publishers Inc., San Francisco (1986)
Google Scholar
Boehm, M., Schlegel, B., Volk, P.B., Fischer, U., Habich, D., Lehner, W.: Efficient in-memory indexing with generalized prefix trees, BTW (2011)
Google Scholar
Kim, C., Chhugani, J., Satish, N., Sedlar, E., Nguyen, A.D., Kaldewey, T., Lee, V.W., Brandt, S.A., Dubey, P.: Fast: fast architecture sensitive tree search on modern cpus and gpus. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 339–350. ACM, New York (2010)
Chapter Google Scholar
Volk, P.B., Habich, D., Lehner, W.: Gpu-based speculative query processing for database operations. In: Proceedings of the 1st International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (2010)
Google Scholar
Nvidia cuda, http://developer.nvidia.com/cuda-downloads
Broder, A.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences, SEQUENCES 1997, pp. 21–29. IEEE Computer Society, Los Alamitos (1997)
Google Scholar
Koller, R., Rangaswami, R.: I/o deduplication: utilizing content similarity to improve i/o performance. In: Proceedings of the 8th USENIX Conference on File and Storage Technologies, FAST 2010, p. 16. USENIX Association, Berkeley (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

National University of Defense Technology, Changsha, China
Bin Lin, Xiangke Liao, Shanshan Li, He Huang & Ling Wen
Zhengzhou Municipal Supervisory Bureau for Quality and Technology, China
Yufeng Wang

Authors

Bin Lin
View author publications
You can also search for this author in PubMed Google Scholar
Xiangke Liao
View author publications
You can also search for this author in PubMed Google Scholar
Shanshan Li
View author publications
You can also search for this author in PubMed Google Scholar
Yufeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
He Huang
View author publications
You can also search for this author in PubMed Google Scholar
Ling Wen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computing Technology, State Key Laboratory of Computer Architecture, Chinese Academy of Sciences, No. 6 Kexueyuan South road, Haifian District, 100190, Beijing, China
Chenggang Wu
Département d’Informatique, INRIA and École Normale Supérieure, 45 rue d’Ulm, 75005, Paris, France
Albert Cohen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, B., Liao, X., Li, S., Wang, Y., Huang, H., Wen, L. (2013). G-Paradex: GPU-Based Parallel Indexing for Fast Data Deduplication. In: Wu, C., Cohen, A. (eds) Advanced Parallel Processing Technologies. APPT 2013. Lecture Notes in Computer Science, vol 8299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45293-2_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-45293-2_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45292-5
Online ISBN: 978-3-642-45293-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics