Abstract
We propose a scheme of efficient set similarity joins on Graphics Processing Units (GPUs). Due to the rapid growth and diversification of data, there is an increasing demand for fast execution of set similarity joins in applications that vary from data integration to plagiarism detection. To tackle this problem, our solution takes advantage of the massive parallel processing offered by GPUs. Additionally, we employ MinHash to estimate the similarity between two sets in terms of Jaccard similarity. By exploiting the high parallelism of GPUs and the space efficiency provided by MinHash, we can achieve high performance without renouncing accuracy. Experimental results show that our proposed method is more than two orders of magnitude faster than the serial version of CPU implementation, and 25 times faster than the parallel version of CPU implementation, while generating highly precise query results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Böhm, C., Noll, R., Plant, C., Zherdin, A.: Indexsupported similarity join on graphics processors. BTW 144, 57–66 (2009)
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)
Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a Mapreduce-based method for scalable string similarity joins. In: ICDE, pp. 340–351 (2014)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Greathouse, J.L., Daga, M.: Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In: SC, pp. 769–780 (2014)
He, B., Lu, M., Yang, K., Fang, R., Govindaraju, N.K., Luo, Q., Sander, P.V.: Relational query coprocessing on graphics processors. TODS 34(4), 21:1–21:39 (2009)
He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., Sander, P.: Relational joins on graphics processors. In: SIGMOD, pp. 511–524 (2008)
Hoberock, J., Bell, N.: Thrust: A Productivity-Oriented Library for CUDA (2012)
Jiang, Y., Li, G., Feng, J., Li, W.S.: String similarity joins: an experimental evaluation. pvldb 7(8), 625–636 (2014)
Li, P., Knig, A.C.: b-bit Minwise Hashing (2009). CoRR. abs/0910.3349
Li, P., Shrivastava, A., König, A.C.: GPU-based minwise hashing. In: WWW, pp. 565–566 (2012)
Li, P., Owen, A.B., Zhang, C.H.: One Permutation Hashing for Efficient Search and Learning (2012). CoRR. abs/1208.1259
Lieberman, M.D., Sankaranarayanan, J., Samet, H.: A fast similarity join algorithm using graphics processing units. In: ICDE, pp. 1111–1120 (2008)
Metwally, A., Faloutsos, C.: V-Smart-Join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012)
NVIDIA Corporation: NVIDIA CUDA Compute Unified Device Architecture Programming Guide (2007)
OpenMP Architecture Review Board: OpenMP Application Program Interface Version 4.0 (2013)
Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krger, J., Lefohn, A., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Computer Graph. Forum 26(1), 80–113 (2007)
Rares, V., Carey, M.J., Chen, L.: Efficient parallel set-similarity joins using Mapreduce. In: SIGMOD, pp. 495–506 (2010)
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD, pp. 743–754 (2004)
Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan primitives for GPU computing. In: GH, pp. 97–106 (2007)
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Acknowledgments
We thank Neil Millar and the reviewers for their feedback. This research was partly supported by the Grant-in-Aid for Scientific Research (B) (#26280037) from the Japan Society for the Promotion of Science.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Cruz, M.S.H., Kozawa, Y., Amagasa, T., Kitagawa, H. (2015). GPU Acceleration of Set Similarity Joins. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9261. Springer, Cham. https://doi.org/10.1007/978-3-319-22849-5_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-22849-5_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22848-8
Online ISBN: 978-3-319-22849-5
eBook Packages: Computer ScienceComputer Science (R0)