Abstract
Given two sets of strings and a similarity function on strings, similarity joins attempt to find all similar pairs of strings from each respective set. In this paper, we focus on similarity joins with respect to the edit distance, and propose a new metric called the bounded occurrence edit distance and a filter based on the metric. Using the filter, we can reduce the total time required to solve similarity joins because the metric can be computed faster than the edit distance by bitwise operations. We demonstrate the effectiveness of the filter through experiments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proc. of WWW, pp. 131–140 (2007)
Cormode, G., Muthukrishnan, S.: The string edit distance matching problem with moves. ACM Trans. Algorithms 3(1), 2:1–2:19 (2007)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Proc. of VLDB, pp. 491–500 (2001)
Metwally, A., Agrawal, D., El Abbadi, A.: Detectives: detecting coalition hit inflation attacks in advertising networks streams. In: Proc. of WWW, pp. 241–250 (2007)
Narita, K., Nakadai, S., Araki, T.: Landmark-join: hash-join based string similarity joins with edit distance constraints. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2012. LNCS, vol. 7448, pp. 180–191. Springer, Heidelberg (2012)
Ohad, L., Ely, P.: Approximate pattern matching with the l 1, l 2 and l ; metrics. Algorithmica 60(2), 335–348 (2011)
Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: Proc. of WWW, pp. 377–386 (2006)
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)
Wang, J., Feng, J., Li, G.: Trie-join: efficient trie-based string similarity joins with edit-distance constraints. Proceedings of the VLDB Endowment 3(1-2), 1219–1230 (2010)
Warren, H.S.: Hacker’s Delight. Addison-Wesley Longman Publishing Co., Inc. (2002)
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proceedings of the VLDB Endowment 1(1), 933–944 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Komatsu, T., Okuta, R., Narisawa, K., Shinohara, A. (2014). Bounded Occurrence Edit Distance: A New Metric for String Similarity Joins with Edit Distance Constraints. In: Geffert, V., Preneel, B., Rovan, B., Å tuller, J., Tjoa, A.M. (eds) SOFSEM 2014: Theory and Practice of Computer Science. SOFSEM 2014. Lecture Notes in Computer Science, vol 8327. Springer, Cham. https://doi.org/10.1007/978-3-319-04298-5_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-04298-5_32
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04297-8
Online ISBN: 978-3-319-04298-5
eBook Packages: Computer ScienceComputer Science (R0)