MIST: Top-k Approximate Sub-string Mining Using Triplet Statistical Significance

Dutta, Sourav

doi:10.1007/978-3-319-16354-3_31

MIST: Top-k Approximate Sub-string Mining Using Triplet Statistical Significance

Sourav Dutta¹⁹

Conference paper

3798 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9022))

Abstract

Efficient extraction of strings or sub-strings similar to an input query string forms a necessity in applications like instant search, record linkage, etc., where the similarity between two strings is usually quantified by edit distance. This paper proposes a novel top-k approximate sub-string matching algorithm, MIST, for a given query, based on Chi-squared statistical significance of string triplets, thereby avoiding expensive edit distance computation. Experiments with real-life data validate the run-time effectiveness and accuracy of our algorithm.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates, R., Navarro, G.: New and Faster Filters for Multiple Approximate String Matching. Random Structures & Algorithms 20(1), 23–49 (2002)
Article MATH MathSciNet Google Scholar
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval - the Concepts and Technology behind Search. Pearson Edu. Ltd. (2011)
Google Scholar
Cucerzan, S., Brill, E.: Spelling Corrections as an Interactive Process that Exploits the Collective Knowledge of Web Users. In: EMNLP, pp. 293–300 (2004)
Google Scholar
Deng, D., Li, G., Feng, J., Li, W.S.: Top-k string similarity search with edit-distance constraints. In: ICDE, pp. 925–936 (2013)
Google Scholar
Dutta, S., Bhattacharya, A.: Most Significant Substring Mining based on Chi-Square Measure. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010, Part I. LNCS, vol. 6118, pp. 319–327. Springer, Heidelberg (2010)
Chapter Google Scholar
Fenz, D., Lange, D., Rheinländer, A., Naumann, F., Leser, U.: Efficient Similarity Search in Very Large String Sets. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 262–279. Springer, Heidelberg (2012)
Chapter Google Scholar
Hotelling, H.: Multivariate Quality Control. Tech. of Statistical Analysis 54, 111–184 (1947)
Google Scholar
Kahveci, T., Singh, A.K.: Efficient Index Structures for String Databases. In: VLDB, pp. 351–360 (2001)
Google Scholar
Keogh, E., Lonardi, S., Chiu, B.: Finding Surprising Patterns in a Time Series Database in Linear Time and Space. In: SIGKDD, pp. 550–556 (2002)
Google Scholar
Kim, Y., Shim, K.: Efficient Top-k Algorithms for Approximate Substring Matching. In: SIGMOD, pp. 385–396 (2013)
Google Scholar
Kimura, M., Takasu, A., Adachi, J.: FPI: A Novel Indexing Method Using Frequent Patterns for Approximate String Searches. In: EDBT Workshops, pp. 397–403 (2013)
Google Scholar
Kukich, K.: Techniques for Automatically Correcting Words in Texts. ACM Computing Surveys 24(4), 377–439 (1992)
Article Google Scholar
Levenshtein, V.I.: Binary Codes capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Li, C., Lu, J., Lu, Y.: Efficient Merging and Filtering Algorithms for Apprx. String Searches. In: ICDE, pp. 257–266 (2008)
Google Scholar
Li, C., Wang, B., Yang, X.: VGRAM: Improving Performance of Approximate Queries on String Collections using Variable-length Grams. In: VLDB, pp. 303–314 (2007)
Google Scholar
Myers, G.: A Sublinear Algorithm for Approximate Keyword Searching. Algorithmica 12(4), 345–374 (1994)
Article MATH MathSciNet Google Scholar
Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys 33(1), 31–88 (2001)
Article Google Scholar
Patil, M., Cai, X., Thankachan, S.V., Shah, R., Park, S.J., Foltz, D.: Approximate String Matching by Position Restricted Alignment. In: EDBT, pp. 384–391 (2013)
Google Scholar
Read, T., Cressie, N.: Goodness-of-fit Stats. for Discrete Multivariate Data. Springer (1988)
Google Scholar
Yang, Z., Yu, J., Kitsuregawa, M.: Fast Algorithms for Top-k Approximate String Matching. In: AAAI, pp. 1467–1473 (2010)
Google Scholar
Zhang, Z., Hadjieleftheriou, M., Ooi, B.C., Srivastava, D.: Bed-Tree: An All-purpose Index Structure for String Similarity Search based on Edit Dist. In: SIGMOD, pp. 915–926 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Max-Planck Institute for Informatics, Germany
Sourav Dutta

Authors

Sourav Dutta
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Vienna University of Technology, Institute of Software Technology and Interactive Systems, Favoritenstraße 9-11/188, 1040, Vienna, Austria
Allan Hanbury
Lumi, Semion Ltd., 111 Charterhouse Street, EC1M 6AW, London, UK
Gabriella Kazai
Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstraße 9-11/188, 1040, Vienna, Austria
Andreas Rauber
Universität Duisburg-Essen, Lotharstraße 65, 47057, Duisburg, Germany
Norbert Fuhr

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dutta, S. (2015). MIST: Top-k Approximate Sub-string Mining Using Triplet Statistical Significance. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds) Advances in Information Retrieval. ECIR 2015. Lecture Notes in Computer Science, vol 9022. Springer, Cham. https://doi.org/10.1007/978-3-319-16354-3_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-16354-3_31
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16353-6
Online ISBN: 978-3-319-16354-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics