Skip to main content
Log in

Leveraging set relations in exact and dynamic set similarity join

The VLDB Journal Aims and scope Submit manuscript

Abstract

Set similarity join, which finds all the similar set pairs from two collections of sets, is a fundamental problem with a wide range of applications. Existing works study both exact set similarity join and approximate similarity join problems. In this paper, we focus on the exact set similarity join problem. The existing solutions for exact set similarity join follow a filtering-verification framework, which generates a list of candidate pairs through scanning indexes in the filtering phase and reports those similar pairs in the verification phase. Though much research has been conducted on this problem, set relations have not been well studied on improving the algorithm efficiency through computational cost sharing. Therefore, in this paper, we explore the set relations in different levels to reduce the overall computational cost. First, it has been shown that most of the computational time is spent on the filtering phase, which can be quadratic to the number of sets in the worst case for the existing solutions. Thus, we explore index-level set relations to reduce the filtering cost while keeping the same filtering power. We achieve this by grouping related sets into blocks in the index and skipping useless index probes in joins. Second, we explore answer-level set relations to further improve the algorithm based on the intuition that if two sets are similar, their answers may have a large overlap. We derive an algorithm which incrementally generates the answer of one set from an already computed answer of another similar set rather than compute the answer from scratch to reduce the computational cost. In addition, considering that in real applications, the data are usually updated dynamically, we extend our techniques and design efficient algorithms to incrementally update the join result when any element in the sets is updated. Finally, we conduct extensive performance studies using 21 real datasets with various data properties from a wide range of domains. The experimental results demonstrate that our algorithm outperforms all the existing algorithms across all datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

References

  1. http://liu.cs.uic.edu/download/data/

  2. http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection

  3. http://www.informatik.uni-freiburg.de/~cziegler/BX/

  4. http://www.citeulike.org/faq/data.adp

  5. http://dai-labor.de/IRML/datasets

  6. http://www.discogs.com/

  7. http://www.cs.cmu.edu/~enron

  8. http://fimi.ua.ac.be/data/

  9. http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/

  10. http://konect.uni-koblenz.de/networks/lkml_person-thread

  11. http://socialnetworks.mpi-sws.org/data-imc2007.html

  12. http://trec.nist.gov/data/reuters/reuters.html

  13. http://dbis-twitterdata.uibk.ac.at/spotifyDataset/

  14. http://www.clearbits.net/torrents/1881-dec-2011

  15. http://vi.sualize.us/

  16. http://wiki.dbpedia.org/Downloads

  17. http://dumps.wikimedia.org/

  18. http://konect.uni-koblenz.de/networks/

  19. http://ssjoin.dbresearch.uni-salzburg.at/datasets.html#ZIP

  20. Anastasiu, D.C., Karypis,G.: L2AP: fast cosine similarity search with prefix L-2 norm bounds. In: Proceedings of ICDE’14 (2014)

  21. Arvind, A., Venkatesh, G., Raghav, K.: Efficient exact set-similarity joins. In: Proceedings of VLDB’06 (2006)

  22. Bayardo, R.J., Ma, Y., M., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of WWW’07 (2007)

  23. Bouros, P., Ge, S., Mamoulis, N.: Spatio-textual similarity joins. PVLDB 6(1), 1–12 (2012)

    Google Scholar 

  24. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: Proceeding of STOC’98 (1998)

  25. Chakrabarti, A., Parthasarathy, S.: Sequential hypothesis tests for adaptive locality sensitive hashing. In: Proceedings of WWW’15 (2015)

  26. Chakrabarti, K., Chaudhuri, S., Ganti, V., Xin, D.: An efficient filter for approximate membership checking. In: Proceedings of SIGMOD’08 (2008)

  27. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of ICDE’06 (2006)

  28. Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of SIGMOD’98 (1998)

  29. Das, A., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: Proceedings of WWW’07 (2007)

  30. Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. PVLDB 9(4), 360–371 (2015)

    Google Scholar 

  31. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of VLDB’99 (1999)

  32. Indyk, P.,Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of STOC’98 (1998)

  33. Mann, W., Augsten, N.: PEL: position-enhanced length filter for set similarity joins. In: Proceedings of GVD’14 (2014)

  34. Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. PVLDB 9(9), 636–647 (2016)

    Google Scholar 

  35. Metwally, A., Faloutsos, C.: V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012)

    Google Scholar 

  36. Ribeiro, L.A., Härder, T.: Generalizing prefix filtering to improve set similarity joins. Inf. Syst. 36(1), 62–78 (2011)

    Article  Google Scholar 

  37. Sarma, A.D., He, Y., Chaudhuri, S.: Clusterjoin: a similarity joins framework using map-reduce. PVLDB 7(12), 1059–1070 (2014)

    Google Scholar 

  38. Satuluri, V., Parthasarathy, S.: Bayesian locality sensitive hashing for fast similarity search. PVLDB 5(5), 430–441 (2012)

    Google Scholar 

  39. Schelter, S., Kunegis, J.: Tracking the trackers: a large-scale analysis of embedded web trackers. In: Tenth International AAAI Conference on Web and Social Media (2016)

  40. Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating similarity measures: a large-scale study in the orkut social network. In: Proceedings of SIGKDD’05 (2005)

  41. Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of SIGIR’08 (2008)

  42. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of SIGMOD’10 (2010)

  43. Wang, J., Kraska, T., Franklin, M.J., Feng, J.: Crowder: Crowdsourcing entity resolution. PVLDB 5(11), 1483–1494 (2012)

    Google Scholar 

  44. Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: Proceedings of SIGMOD’12 (2012)

  45. Wang, X., Qin, L., Lin, X., Zhang, Y., Chang, L.: Leveraging set relations in exact set similarity join. PVLDB 10(9), 925–936 (2017)

    Google Scholar 

  46. Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: Proceedings of ICDE’09 (2009)

  47. Xiao, C., Wang, W., Lin, X., Yu, J.X..: Efficient similarity joins for near duplicate detection. In: Proceedings of WWW’08 (2008)

  48. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15 (2011)

    Article  Google Scholar 

  49. Zhai, J., Lou, Y., Gehrke, J.: ATLAS: a probabilistic algorithm for high dimensional similarity search. In: Proceedings of SIGMOD’11 (2011)

  50. Zhu, X., Goldberg, A.B.: Introduction to Semi-supervised Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, San Rafael (2009)

    Google Scholar 

Download references

Acknowledgements

Lu Qin is supported by ARC DE140100999 and DP160101513. Xuemin Lin is supported by NSFC61232006, ARC DP150102728, DP140103578, and DP170101628. Ying Zhang is supported by ARC DE140100679 and DP170103710. Lijun Chang is supported by ARC DE150100563 and ARC DP160101513.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lu Qin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Qin, L., Lin, X. et al. Leveraging set relations in exact and dynamic set similarity join. The VLDB Journal 28, 267–292 (2019). https://doi.org/10.1007/s00778-018-0529-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-018-0529-2

Keywords

Navigation