Efficient Set Similarity Joins Using Min-prefixes

Ribeiro, Leonardo A.; Härder, Theo

doi:10.1007/978-3-642-03973-7_8

Leonardo A. Ribeiro¹⁹ &
Theo Härder¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5739))

Included in the following conference series:

East European Conference on Advances in Databases and Information Systems

504 Accesses
10 Citations

Abstract

Identification of all objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most set similarity join methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and verification applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major effort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase. Our experimental findings show that this trade-off is advantageous: we consistently achieve substantial speed-ups as compared to previous algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arasu, A., Ganti, V., Kaushik, R.: Efficient Exact Set-Similarity Joins. In: Proc. VLDB, pp. 918–929 (2006)
Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling Up All Pairs Similarity Search. In: Proc. WWW, pp. 131–140 (2007)
Google Scholar
Broder, A.Z.: On the Resemblance and Containment of Documents. In: Proc. Compression and Complexity of Sequences, p. 21 (1997)
Google Scholar
Chakrabarti, K., Chaudhuri, S., Ganti, V., Xin, D.: An Efficient Filter for Approximate Membership Checking. In: Proc. SIGMOD, pp. 805–818 (2008)
Google Scholar
Chaudhuri, S., Ganjam, K., Kaushik, R.: A Primitive Operator for Similarity Joins in Data Cleaning. In: Proc. ICDE, p. 5 (2006)
Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., et al.: Approximate String Joins in a Database (Almost) for Free. In: Proc. VLDB, pp. 491–500 (2001)
Google Scholar
Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast Indexes and Algorithms for Set Selection Queries. In: Proc. ICDE, pp. 267–276 (2008)
Google Scholar
Li, C., Lu, J., Lu, Y.: Efficient Merging and Filtering Algorithms for Approximate String Searches. In: Proc. ICDE, pp. 257–266 (2008)
Google Scholar
Sarawagi, S., Kirpal, A.: Efficient Set Joins on Similarity Predicates. In: Proc. SIGMOD, pp. 743–754 (2004)
Google Scholar
Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating Similarity Measures: A Large Scale Study in the Orkut Social Network. In: Proc. KDD, pp. 678–684 (2005)
Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. In: PVLDB, vol. 1(1), pp. 933–944 (2008)
Google Scholar
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k Set Similarity Joins. In: Proc. ICDE, pp. 916–927 (2009)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient Similarity Joins for Near Duplicate Detection. In: Proc. WWW, pp. 131–140 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

AG DBIS, Department of Computer Science, University of Kaiserslautern, Germany
Leonardo A. Ribeiro & Theo Härder

Authors

Leonardo A. Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar
Theo Härder
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Applied Computer Systems, Riga Technical University, Kalku iela 1, LV 1658, Riga, Latvia
Janis Grundspenkis
Institute of Computing Science, University of Technology, Piotrowo 2, 60-965, Pozna´n, Poland
Tadeusz Morzy
European Research Center for Information Systems, University of Münster, Leonardo Campus 3, 48149, Münster, Germany
Gottfried Vossen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ribeiro, L.A., Härder, T. (2009). Efficient Set Similarity Joins Using Min-prefixes. In: Grundspenkis, J., Morzy, T., Vossen, G. (eds) Advances in Databases and Information Systems. ADBIS 2009. Lecture Notes in Computer Science, vol 5739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03973-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-03973-7_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03972-0
Online ISBN: 978-3-642-03973-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics