An Approach for Progressive Set Similarity Join with GPU Accelerating

Yu, Lining; Nie, Tiezheng; Shen, Derong; Kou, Yue

doi:10.1007/978-3-030-60029-7_14

Lining Yu¹⁴,
Tiezheng Nie¹⁴,
Derong Shen¹⁴ &
…
Yue Kou¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12432))

Included in the following conference series:

International Conference on Web Information Systems and Applications

1730 Accesses
2 Citations

Abstract

Set similarity join (SSJoin) is an important operation for searching similarity set pairs from the given database and play a core role in data integration, data cleaning, and data mining. In contrast to the traditional SSJoin methods, progressive SSJoin aims to resolve large datasets so that the efficiency of finding similarity pairs in the limited running time is improved. Progressive SSJoin can provide possible partial matching pairs of the dataset as early as possible in the processing. Moreover, recent research has shown that GPUs (Graphics Processing Units) can accelerate the similarity operation. This paper focuses on exploring progressive SSJoin algorithms and accelerating them with GPUs. We proposes two progressive SSJoin methods, PSSJM and PBM. PSSJM uses inverted index and PBM achieves its required functions by utilizing counting Bloom filter and prefix filtering techniques. In addition, we proposed a GPUs-based algorithm based on our proposed progressive method to accelerate the computation. Comprehensive experiments with real-world datasets show that our methods can generate better quality results than the traditional method under limited time and the method implementing on GPUs has high speedups over CPU-base method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chaudhuri, S., Ganti, V., Kaushik, R..: A primitive operator for similarity joins in data cleaning. In: Proceedings of the ICDE, pp. 5–16 (2006)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. TODS 36(3), 15 (2011)
Article Google Scholar
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proceedings of the VLDB, pp. 918–929 (2006)
Google Scholar
Mann, W., Augsten, N.: PEL: Position-enhanced length filter for set similarity joins. In: Proceedings of the GvD (Foundations of Databases), pp. 89–94 (2014)
Google Scholar
MannMann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. Proc. VLDB End. 9, 636–647 (2016)
Article Google Scholar
Zhou, J., et al.: A generic inverted index framework for similarity search on the GPU. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE (2018)
Google Scholar
Sandes, E.F.O., Teodoro, G., Melo, A.C.M.A.: Bitmap filter: Speeding up exact set similarity joins with bitwise operations. (2017)
Google Scholar
Li, C., et al.: A GPU Accelerated Update Efficient Index for kNN queries in road networks. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE Computer Society (2018)
Google Scholar
Kruliš, M., Osipyan, H., Marchand-Maillet, S.: Optimizing sorting and Top-k selection steps in permutation based indexing on GPUs. In: Morzy, T., Valduriez, P., Bellatreche, L. (eds.) ADBIS 2015. CCIS, vol. 539, pp. 305–317. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23201-0_33
Chapter Google Scholar
Wang, Y., et al.: FLASH: Randomized algorithms accelerated over CPU-GPU for ultra-high dimensional similarity search (2017)
Google Scholar
Gowanlock, M., Casanova, H.: Distance threshold similarity searches: Efficient trajectory indexing on the GPU. IEEE Trans. Parallel Distrib. Syst. 27(9), 2533–2545 (2016)
Article Google Scholar
Xiao, C., et al.: Top-k set similarity joins. In: Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, March 29 2009–April 2 2009, Shanghai, China. IEEE Computer Society (2009)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6–10, 2010. ACM (2010)
Google Scholar
Ma, Y., Zhang, R., Zhang, Y.: Similarity histogram estimation based top-k similarity join algorithm on high-dimensional data. In: Ni, W., Wang, X., Song, W., Li, Y. (eds.) WISA 2019. LNCS, vol. 11817, pp. 589–600. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30952-7_60
Chapter Google Scholar
Papenbrock, T., Heise, A., Naumann, F.: Progressive duplicate detection. IEEE Trans. Knowl. Data Eng. 27(5), 1316–1329 (2015)
Article Google Scholar
Whang, S.E., Marmaros, D., Garcia-Molina, H.: Pay-as-you-go entity resolution. IEEE TKDE 25(5), 1111–1124 (2013)
Google Scholar
Giovanni, S., George, P., Themis, P., et al.: Schema-agnostic progressive entity resolution. IEEE Trans. Knowl. Data Eng. 31(6), 1208–1221 (2018)
Google Scholar
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD, pp. 127–138 (1995)
Google Scholar
Bloom, B.: Space/time tradeoffs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Article Google Scholar
Christen, P.: A survey of indexing techniques for scalable set linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2012)
Google Scholar
Nvidia. Nvidia CUDA Programming Guide 8.0.Nvidia (2017)
Google Scholar

Download references

Acknowledgments

This work is supported by the Nation Key R&D Program of China (2018YFB1003404), the National Nature Science Foundation of China (61672142, U1811261).

Author information

Authors and Affiliations

College of Computer Science and Technology, Northeastern University, Shenyang, China
Lining Yu, Tiezheng Nie, Derong Shen & Yue Kou

Authors

Lining Yu
View author publications
You can also search for this author in PubMed Google Scholar
Tiezheng Nie
View author publications
You can also search for this author in PubMed Google Scholar
Derong Shen
View author publications
You can also search for this author in PubMed Google Scholar
Yue Kou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lining Yu .

Editor information

Editors and Affiliations

Guangzhou University, Guangzhou, China
Guojun Wang
The University of New South Wales, Sydney, NSW, Australia
Xuemin Lin
Rensselaer Polytechnic Institute, Troy, NY, USA
James Hendler
Wuhan University, Wuhan, China
Wei Song
Hohai University, Nanjing, China
Zhuoming Xu
Fuzhou University, Fuzhou, China
Genggeng Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, L., Nie, T., Shen, D., Kou, Y. (2020). An Approach for Progressive Set Similarity Join with GPU Accelerating. In: Wang, G., Lin, X., Hendler, J., Song, W., Xu, Z., Liu, G. (eds) Web Information Systems and Applications. WISA 2020. Lecture Notes in Computer Science(), vol 12432. Springer, Cham. https://doi.org/10.1007/978-3-030-60029-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-60029-7_14
Published: 22 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60028-0
Online ISBN: 978-3-030-60029-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)