Abstract
Efficient extraction of strings or sub-strings similar to an input query string forms a necessity in applications like instant search, record linkage, etc., where the similarity between two strings is usually quantified by edit distance. This paper proposes a novel top-k approximate sub-string matching algorithm, MIST, for a given query, based on Chi-squared statistical significance of string triplets, thereby avoiding expensive edit distance computation. Experiments with real-life data validate the run-time effectiveness and accuracy of our algorithm.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Baeza-Yates, R., Navarro, G.: New and Faster Filters for Multiple Approximate String Matching. Random Structures & Algorithms 20(1), 23–49 (2002)
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval - the Concepts and Technology behind Search. Pearson Edu. Ltd. (2011)
Cucerzan, S., Brill, E.: Spelling Corrections as an Interactive Process that Exploits the Collective Knowledge of Web Users. In: EMNLP, pp. 293–300 (2004)
Deng, D., Li, G., Feng, J., Li, W.S.: Top-k string similarity search with edit-distance constraints. In: ICDE, pp. 925–936 (2013)
Dutta, S., Bhattacharya, A.: Most Significant Substring Mining based on Chi-Square Measure. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010, Part I. LNCS, vol. 6118, pp. 319–327. Springer, Heidelberg (2010)
Fenz, D., Lange, D., Rheinländer, A., Naumann, F., Leser, U.: Efficient Similarity Search in Very Large String Sets. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 262–279. Springer, Heidelberg (2012)
Hotelling, H.: Multivariate Quality Control. Tech. of Statistical Analysis 54, 111–184 (1947)
Kahveci, T., Singh, A.K.: Efficient Index Structures for String Databases. In: VLDB, pp. 351–360 (2001)
Keogh, E., Lonardi, S., Chiu, B.: Finding Surprising Patterns in a Time Series Database in Linear Time and Space. In: SIGKDD, pp. 550–556 (2002)
Kim, Y., Shim, K.: Efficient Top-k Algorithms for Approximate Substring Matching. In: SIGMOD, pp. 385–396 (2013)
Kimura, M., Takasu, A., Adachi, J.: FPI: A Novel Indexing Method Using Frequent Patterns for Approximate String Searches. In: EDBT Workshops, pp. 397–403 (2013)
Kukich, K.: Techniques for Automatically Correcting Words in Texts. ACM Computing Surveys 24(4), 377–439 (1992)
Levenshtein, V.I.: Binary Codes capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
Li, C., Lu, J., Lu, Y.: Efficient Merging and Filtering Algorithms for Apprx. String Searches. In: ICDE, pp. 257–266 (2008)
Li, C., Wang, B., Yang, X.: VGRAM: Improving Performance of Approximate Queries on String Collections using Variable-length Grams. In: VLDB, pp. 303–314 (2007)
Myers, G.: A Sublinear Algorithm for Approximate Keyword Searching. Algorithmica 12(4), 345–374 (1994)
Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys 33(1), 31–88 (2001)
Patil, M., Cai, X., Thankachan, S.V., Shah, R., Park, S.J., Foltz, D.: Approximate String Matching by Position Restricted Alignment. In: EDBT, pp. 384–391 (2013)
Read, T., Cressie, N.: Goodness-of-fit Stats. for Discrete Multivariate Data. Springer (1988)
Yang, Z., Yu, J., Kitsuregawa, M.: Fast Algorithms for Top-k Approximate String Matching. In: AAAI, pp. 1467–1473 (2010)
Zhang, Z., Hadjieleftheriou, M., Ooi, B.C., Srivastava, D.: Bed-Tree: An All-purpose Index Structure for String Similarity Search based on Edit Dist. In: SIGMOD, pp. 915–926 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Dutta, S. (2015). MIST: Top-k Approximate Sub-string Mining Using Triplet Statistical Significance. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds) Advances in Information Retrieval. ECIR 2015. Lecture Notes in Computer Science, vol 9022. Springer, Cham. https://doi.org/10.1007/978-3-319-16354-3_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-16354-3_31
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16353-6
Online ISBN: 978-3-319-16354-3
eBook Packages: Computer ScienceComputer Science (R0)