SETJoin: a novel top-k similarity join algorithm

Wang, Hongya; Yang, Lihong; Xiao, Yingyuan

doi:10.1007/s00500-020-04807-w

SETJoin: a novel top-k similarity join algorithm

Methodologies and Application
Published: 06 March 2020

Volume 24, pages 14577–14592, (2020)
Cite this article

Soft Computing Aims and scope Submit manuscript

Hongya Wang¹,
Lihong Yang¹ &
Yingyuan Xiao²

350 Accesses
1 Citation
Explore all metrics

Abstract

As an important operation in data cleaning, near duplicate Web pages detection and data mining, similarity joins have received much attention recently. Existing similarity joins fall into two broad categories—the similarity-threshold-based similarity join and top-k similarity join (TopkJoin). Compared with the traditional one, TopkJoin is more suitable for cases where the similarity threshold is unknown before hand. In this paper, we focus on the performance optimization problem of TopkJoin. Particularly, we observed that the state-of-the-art TopkJoin algorithm has three serious performance issues, i.e., the inappropriate application of hash table, inefficient use of suffix filtering and unnecessary evaluation of excessive unqualified candidates. To resolve these problems, we proposed a novel algorithm, SETJoin, by combining the existing event-driven framework with three simple yet efficient optimization techniques, viz., (1) reducing the cost in hashing by rearranging the orders of the candidate filtering and hash table lookup operations; (2) maximizing the pruning capability of suffix filtering by judiciously choosing the (near) optimal recursion depth; and (3) terminating join operations earlier by setting a much tighter stop condition for iteration. The experimental results show that SETJoin achieves up to 1.26x–3.49x speedup over the state-of-the-art algorithm on several real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FrepJoin: an efficient partition-based algorithm for edit similarity join

Article 01 October 2017

Ji-zhou Luo, Sheng-fei Shi, … Jian-zhong Li

Improving Performance of Graph Similarity Joins Using Selected Substructures

Large-Scale Similarity Join with Edit-Distance Constraints

Notes

Will be discussed in Sect. 5 in more detail.
For instance, during the execution of the top-500 query, over two hundred million candidate pairs are generated.
We do not present the details of prefix and positional filtering in Algorithm 1 for the sake of conciseness.
Please note that the suffixes of two records are passed to r and s when SuffixFilter is invoked
ppjoin+ is the state-of-the-art SimJoin algorithm proposed in Xiao et al. (2008).
http://www.informatik.uni-trier.de/ ley/db.
http://trec.nist.gov/data/t9-filtering.html.
http://www.cs.cmu.edu/ enron.
Please note that the number of hash lookup operations is equal to the number of generated candidates in topk-join.

References

Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: VLDB, pp 918–929
Arasu A, Chaudhuri S, Kaushik R (2008) Transformation-based framework for record matching. In: ICDE, pp 40–49
Baraglia R, Morales GDF, Lucchese C (2010) Document similarity self-join with mapreduce. In: Webb GI, Zhang C, Gunopulos D, Wu X (eds) ICDM. IEEE Computer Society, Washington, pp 731–736
Google Scholar
Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: WWW, pp 131–140
Behm A, Li C, Carey MJ (2011) Answering approximate string queries on large data sets using external memory. In: ICDE, pp 888–899
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations (extended abstract). In: STOC, pp 327–336
Charikar M (2002) Similarity estimation techniques from rounding algorithms. In: STOC, pp 380–388
Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: ICDE, p 5
Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2000) Closest pair queries in spatial databases. In: SIGMOD, pp 189–200
Deng D, Li G, Hao S, Wang J, Feng J (2014) Massjoin: a mapreduce-based method for scalable string similarity joins. In: ICDE, pp 340–351
Fries S, Boden B, Stepien G, Seidl T (2014) Phidj: parallel similarity self-join for high-dimensional vector data with mapreduce. In: ICDE, pp 796–807
Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (almost) for free. In: VLDB, pp 491–500
Hernández MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1):9–37
Article Google Scholar
Hu H, Li G, Bao Z, Feng J, Wu Y, Gong Z, Xu Y (2016) Top-k spatio-textual similarity join. IEEE Trans Knowl Data Eng 28(2):551–565
Article Google Scholar
Huang J, Zhang R, Buyya R, Chen J (2014) MELODY-JOIN: efficient earth mover’s distance similarity joins using mapreduce. In: ICDE, pp 808–819
Jestes J, Li F, Yan Z, Yi K (2010) Probabilistic string similarity joins. In: SIGMOD, pp 327–338
Jiang Y, Li G, Feng J, Li W (2014) String similarity joins: an experimental evaluation. PVLDB 7(8):625–636
Google Scholar
Kim Y, Shim K (2012) Parallel top-k similarity join algorithms using mapreduce. In: ICDE, pp 510–521
Lam HT, Dung DV, Perego R, Silvestri F (2010) An incremental prefix filtering approach for the all pairs similarity search problem. APWeb 2010:188–194
Google Scholar
Li G, He J, Deng D, Li J (2015) Efficient similarity join and search on multi-attribute data. In: SIGMOD, pp 1137–1151
Mann W, Augsten N, Bouros P (2016) An empirical evaluation of set similarity join techniques. Proc VLDB Endow 9(9):636–647
Article Google Scholar
Metwally A, Faloutsos C (2012) V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8):704–715
Google Scholar
Quirino RD, Ribeiro-Junior S, Ribeiro LA,Martins WS (2018) Efficient filter-based algorithms for exact set similarity join on GPUs. In: Hammoudi S, Śmiałek M, Camp O, Filipe J (eds) Enterprise information systems. ICEIS 2017. Lecture notes in business information processing, vol 321. Springer, Cham, pp 74–95
Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: SIGMOD, pp 743–754
Sarma AD, He Y, Chaudhuri S (2014) Clusterjoin: a similarity joins framework using map-reduce. PVLDB 7(12):1059–1070
Google Scholar
SriUsha I, Choudary KR, Sasikala T et al (2018) Data mining techniques used in the recommendation of e-commerce services. In: second international conference on electronics, communication and aerospace technology (ICECA). IEEE, pp 379–382
Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: SIGMOD, pp 495–506
Wang J, Li G, Feng J (2012) Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: SIGMOD, pp 85–96
Wang X, Qin L, Lin X, Zhang Y, Chang L (2017) Leveraging set relations in exact set similarity join. Proc VLDB Endow 10(9):925–936
Article Google Scholar
Willi M, Augsten N, Jensen CS (2017) Swoop: top-k similarity joins over set streams. arXiv: Databases
Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140
Xiao C, Wang W, Lin X, Shang H (2009) Top-k set similarity joins. In: ICDE, pp 916–927
Xiong Y, Zhu Y, Yu PS (2015) Top-k similarity join in heterogeneous information networks. IEEE Trans Knowl Data Eng 27(6):1710–1723
Article Google Scholar
Zhu M, Papadias D, Zhang J, Lee DL (2005) Top-k spatial joins. IEEE Trans Knowl Data Eng 17(4):567–579
Article Google Scholar

Download references

Acknowledgements

The work reported in this paper is partially supported by NSFC under Grant Numbers 61370205, NSF of Shanghai under Grant Number 13ZR1400800 and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

School of Computer Science and Technology, Donghua University, Shanghai, China
Hongya Wang & Lihong Yang
School of Computer Science and Technology, Tianjin University of Technology, Tianjin, China
Yingyuan Xiao

Authors

Hongya Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lihong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yingyuan Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongya Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human participants or animals rights

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Yang, L. & Xiao, Y. SETJoin: a novel top-k similarity join algorithm. Soft Comput 24, 14577–14592 (2020). https://doi.org/10.1007/s00500-020-04807-w

Download citation

Published: 06 March 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s00500-020-04807-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SETJoin: a novel top-k similarity join algorithm

Abstract

Access this article

Similar content being viewed by others

FrepJoin: an efficient partition-based algorithm for edit similarity join

Improving Performance of Graph Similarity Joins Using Selected Substructures

Large-Scale Similarity Join with Edit-Distance Constraints

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human participants or animals rights

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SETJoin: a novel top-k similarity join algorithm

Abstract

Access this article

Similar content being viewed by others

FrepJoin: an efficient partition-based algorithm for edit similarity join

Improving Performance of Graph Similarity Joins Using Selected Substructures

Large-Scale Similarity Join with Edit-Distance Constraints

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human participants or animals rights

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation