Abstract
Identification of all objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most set similarity join methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and verification applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major effort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase. Our experimental findings show that this trade-off is advantageous: we consistently achieve substantial speed-ups as compared to previous algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arasu, A., Ganti, V., Kaushik, R.: Efficient Exact Set-Similarity Joins. In: Proc. VLDB, pp. 918–929 (2006)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling Up All Pairs Similarity Search. In: Proc. WWW, pp. 131–140 (2007)
Broder, A.Z.: On the Resemblance and Containment of Documents. In: Proc. Compression and Complexity of Sequences, p. 21 (1997)
Chakrabarti, K., Chaudhuri, S., Ganti, V., Xin, D.: An Efficient Filter for Approximate Membership Checking. In: Proc. SIGMOD, pp. 805–818 (2008)
Chaudhuri, S., Ganjam, K., Kaushik, R.: A Primitive Operator for Similarity Joins in Data Cleaning. In: Proc. ICDE, p. 5 (2006)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., et al.: Approximate String Joins in a Database (Almost) for Free. In: Proc. VLDB, pp. 491–500 (2001)
Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast Indexes and Algorithms for Set Selection Queries. In: Proc. ICDE, pp. 267–276 (2008)
Li, C., Lu, J., Lu, Y.: Efficient Merging and Filtering Algorithms for Approximate String Searches. In: Proc. ICDE, pp. 257–266 (2008)
Sarawagi, S., Kirpal, A.: Efficient Set Joins on Similarity Predicates. In: Proc. SIGMOD, pp. 743–754 (2004)
Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating Similarity Measures: A Large Scale Study in the Orkut Social Network. In: Proc. KDD, pp. 678–684 (2005)
Xiao, C., Wang, W., Lin, X.: Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. In: PVLDB, vol. 1(1), pp. 933–944 (2008)
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k Set Similarity Joins. In: Proc. ICDE, pp. 916–927 (2009)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient Similarity Joins for Near Duplicate Detection. In: Proc. WWW, pp. 131–140 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ribeiro, L.A., Härder, T. (2009). Efficient Set Similarity Joins Using Min-prefixes. In: Grundspenkis, J., Morzy, T., Vossen, G. (eds) Advances in Databases and Information Systems. ADBIS 2009. Lecture Notes in Computer Science, vol 5739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03973-7_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-03973-7_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03972-0
Online ISBN: 978-3-642-03973-7
eBook Packages: Computer ScienceComputer Science (R0)