Fast Phonetic Similarity Search over Large Repositories

Tissot, Hegler; Peschl, Gabriel; Del Fabro, Marcos Didonet

doi:10.1007/978-3-319-10085-2_6

Hegler Tissot²⁰,
Gabriel Peschl²⁰ &
Marcos Didonet Del Fabro²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8645))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

Abstract

Analysis of unstructured data may be inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, with a supporting dictionary. However, they are not rich enough to encode phonetic information to assist the search. In this paper, we present a novel approach for efficiently perform phonetic similarity search over large data sources, that uses a data structure called PhoneticMap to encode language-specific phonetic information. We validate our approach through an experiment over a data set using a Portuguese variant of a well-known repository, to automatically correct words with spelling errors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Integrating Approximate String Matching with Phonetic String Similarity

A Trainable Method for the Phonetic Similarity Search in German Proper Names

Polyphon: An Algorithm for Phonetic String Matching in Russian Language

References

Allison, L., Dix, T.I.: A Bit-String Longest-Common-Subsequence Algorithm. In: IPL, vol. 26, pp. 305–310 (1986)
Google Scholar
Bocek, T., Hunt, E., Stiller, B., Hecht, F.: Fast similarity search in large dictionaries. Department of Informatics, University of Zurich (2007)
Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: IIWeb, pp. 73–78 (2003)
Google Scholar
Godbole, S., Bhattacharya, I., Gupta, A., Verma, A.: Building re-usable dictionary repositories for real-world text mining. In: CIKM, pp. 1189–1198. ACM (2010)
Google Scholar
Gomaa, W.H., Fahmy, A.A.: A Survey of Text Similarity Approaches. In: IJCA, vol. 68, pp. 13–18. Foundation of Computer Science, New York (2013)
Google Scholar
Hall, P.A.V., Dowling, G.R.: Approximate String Matching. ACM Comput. Surv. 12, 381–402 (1980)
Article MathSciNet Google Scholar
Hamming, R.: Error Detecting and Error Correcting Codes. Bell System Technical Journal BSTJ. 26, 147–160 (1950)
Article MathSciNet Google Scholar
Jellouli, I., Mohajir, M.E.: An ontology-based approach for web information extraction. In: CIST, p. 5 (2011)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting insertions and reversals. Soviet Physics Doklady 10, 707–710 (1966)
MathSciNet Google Scholar
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38, 39–41 (1995)
Article Google Scholar
Stvilia, B.: A model for ontology quality evaluation. First Monday 12 (2007)
Google Scholar
Mann, V.A.: Distinguishing universal and language-dependent levels of speech perception: Evidence from Japanese listeners’ perception of English. Cognition 24, 169–196 (1986)
Article Google Scholar
Paterson, M., Dancik, V.: Longest Common Subsequences. In: Privara, I., Ružička, P., Rovan, B. (eds.) MFCS 1994. LNCS, vol. 841, pp. 127–142. Springer, Heidelberg (1994)
Chapter Google Scholar
Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey Research, pp. S.354–S.359 (1990)
Google Scholar
Zobel, J., Dart, P.W.: Phonetic String Matching: Lessons from Information Retrieval. In: SIGIR, pp. 166–172. ACM (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

C3SL Labs, Federal University of Parana, Curitiba, Brazil
Hegler Tissot, Gabriel Peschl & Marcos Didonet Del Fabro

Authors

Hegler Tissot
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Peschl
View author publications
You can also search for this author in PubMed Google Scholar
Marcos Didonet Del Fabro
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Instituto Tecnológico de Informática, 46022, Valencia, Spain
Hendrik Decker
Faculty of Electrical Engineering, Department of Cybernetics, Czech Technical University in Prague, 166 27, Prague 6, Czech Republic
Lenka Lhotská
Department of Computer Science, The University of Auckland, 1010, Auckland, New Zealand
Sebastian Link
Knowledge Management, LMU University of Munich, Leopoldstraße 13, 80802, Munich, Germany
Marcus Spies
FAW, University of Linz, Altenbergerstrasse 69, 4040, Linz, Austria
Roland R. Wagner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tissot, H., Peschl, G., Del Fabro, M.D. (2014). Fast Phonetic Similarity Search over Large Repositories. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds) Database and Expert Systems Applications. DEXA 2014. Lecture Notes in Computer Science, vol 8645. Springer, Cham. https://doi.org/10.1007/978-3-319-10085-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-10085-2_6
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10084-5
Online ISBN: 978-3-319-10085-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics