Skip to main content

GPU Acceleration of Set Similarity Joins

  • Conference paper
  • First Online:
Database and Expert Systems Applications (Globe 2015, DEXA 2015)

Abstract

We propose a scheme of efficient set similarity joins on Graphics Processing Units (GPUs). Due to the rapid growth and diversification of data, there is an increasing demand for fast execution of set similarity joins in applications that vary from data integration to plagiarism detection. To tackle this problem, our solution takes advantage of the massive parallel processing offered by GPUs. Additionally, we employ MinHash to estimate the similarity between two sets in terms of Jaccard similarity. By exploiting the high parallelism of GPUs and the space efficiency provided by MinHash, we can achieve high performance without renouncing accuracy. Experimental results show that our proposed method is more than two orders of magnitude faster than the serial version of CPU implementation, and 25 times faster than the parallel version of CPU implementation, while generating highly precise query results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://archive.ics.uci.edu/ml/datasets/.

  2. 2.

    http://trec.nist.gov/data/t9_filtering.html.

  3. 3.

    http://fimi.ua.ac.be/data/.

References

  1. Böhm, C., Noll, R., Plant, C., Zherdin, A.: Indexsupported similarity join on graphics processors. BTW 144, 57–66 (2009)

    Google Scholar 

  2. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  3. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)

    Google Scholar 

  4. Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a Mapreduce-based method for scalable string similarity joins. In: ICDE, pp. 340–351 (2014)

    Google Scholar 

  5. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)

    Google Scholar 

  6. Greathouse, J.L., Daga, M.: Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In: SC, pp. 769–780 (2014)

    Google Scholar 

  7. He, B., Lu, M., Yang, K., Fang, R., Govindaraju, N.K., Luo, Q., Sander, P.V.: Relational query coprocessing on graphics processors. TODS 34(4), 21:1–21:39 (2009)

    Article  Google Scholar 

  8. He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., Sander, P.: Relational joins on graphics processors. In: SIGMOD, pp. 511–524 (2008)

    Google Scholar 

  9. Hoberock, J., Bell, N.: Thrust: A Productivity-Oriented Library for CUDA (2012)

    Google Scholar 

  10. Jiang, Y., Li, G., Feng, J., Li, W.S.: String similarity joins: an experimental evaluation. pvldb 7(8), 625–636 (2014)

    MATH  Google Scholar 

  11. Li, P., Knig, A.C.: b-bit Minwise Hashing (2009). CoRR. abs/0910.3349

    Google Scholar 

  12. Li, P., Shrivastava, A., König, A.C.: GPU-based minwise hashing. In: WWW, pp. 565–566 (2012)

    Google Scholar 

  13. Li, P., Owen, A.B., Zhang, C.H.: One Permutation Hashing for Efficient Search and Learning (2012). CoRR. abs/1208.1259

    Google Scholar 

  14. Lieberman, M.D., Sankaranarayanan, J., Samet, H.: A fast similarity join algorithm using graphics processing units. In: ICDE, pp. 1111–1120 (2008)

    Google Scholar 

  15. Metwally, A., Faloutsos, C.: V-Smart-Join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012)

    Google Scholar 

  16. NVIDIA Corporation: NVIDIA CUDA Compute Unified Device Architecture Programming Guide (2007)

    Google Scholar 

  17. OpenMP Architecture Review Board: OpenMP Application Program Interface Version 4.0 (2013)

    Google Scholar 

  18. Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krger, J., Lefohn, A., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Computer Graph. Forum 26(1), 80–113 (2007)

    Article  MATH  Google Scholar 

  19. Rares, V., Carey, M.J., Chen, L.: Efficient parallel set-similarity joins using Mapreduce. In: SIGMOD, pp. 495–506 (2010)

    Google Scholar 

  20. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD, pp. 743–754 (2004)

    Google Scholar 

  21. Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan primitives for GPU computing. In: GH, pp. 97–106 (2007)

    Google Scholar 

  22. Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)

    Google Scholar 

  23. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)

    Google Scholar 

Download references

Acknowledgments

We thank Neil Millar and the reviewers for their feedback. This research was partly supported by the Grant-in-Aid for Scientific Research (B) (#26280037) from the Japan Society for the Promotion of Science.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mateus S. H. Cruz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Cruz, M.S.H., Kozawa, Y., Amagasa, T., Kitagawa, H. (2015). GPU Acceleration of Set Similarity Joins. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9261. Springer, Cham. https://doi.org/10.1007/978-3-319-22849-5_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-22849-5_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-22848-8

  • Online ISBN: 978-3-319-22849-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics