Similarity of Names Across Scripts: Edit Distance Using Learned Costs of N-Grams

Pouliquen, Bruno

doi:10.1007/978-3-540-85287-2_39

Bruno Pouliquen²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5221))

Included in the following conference series:

International Conference on Natural Language Processing

1447 Accesses
2 Citations

Abstract

Any cross-language processing application has to first tackle the problem of transliteration when facing a language using another script. The first solution consists of using existing transliteration tools, but these tools are not usually suitable for all purposes. For some specific script pairs they do not even exist. Our aim is to discriminate transliterations across different scripts in a unified way using a learning method that builds a transliteration model out of a set of transliterated proper names. We compare two strings using an algorithm that builds a Levenshtein edit distance using n-grams costs. The evaluations carried out show that our similarity measure is accurate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

AbdulJaleel, N., Larkey, L.S.: Statistical transliteration for English-Arabic cross language information retrieval. In: CIKM, pp. 139–146 (2003)
Google Scholar
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive Name Matching in Information Integration, Intelligent Systems. IEEE, Los Alamitos (2003)
Google Scholar
Brill, E., Kacmarcik, G., Brockett, C.: Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, pp. 393–399 (2001)
Google Scholar
Brill, E., Moore, R.C.: An improved Error Model for Noisy Channel Spelling Correction. In: Proceedings of the ACL 2000, pp. 286–293 (2000)
Google Scholar
Christen, P.: A Comparison of Personal Name Matching: Techniques and Practical Issues, Technical Report TR-CS-06-02, Joint Computer Science Technical Report Series, Department of Computer Science (2006)
Google Scholar
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: IJCAI-2003 Workshop on Information Integration on the Web, Acapulco, Mexico, pp. 73–78 (2003)
Google Scholar
Freeman, A.T., Condon, S.L., Ackerman, C.M.: Cross linguistic name matching in English and Arabic: alone to many mapping extension of the Levenshtein edit distance algorithm. In: HLT-NAACL 2006 (2006)
Google Scholar
Hall, P.A.V., Dowling, G.R.: Approximate string matching. ACM Computing Surveys 12(4), 381–402 (1980)
Article MathSciNet Google Scholar
Klementiev, A., Roth, D.: Named Entity Transliteration and Discovery in Multilingual Corpora. In: Learning Machine Translation (2006)
Google Scholar
Knight, K., Graehl, J.: Machine transliteration. Computational Linguistics 24(4), 599–612 (1998)
Google Scholar
Kuo, J.-S., Li, H., Yang, Y.-K.: Learning Transliteration Lexicons from the Web. In: Proceedings of 44th ACL, pp. 1129–1136 (2006)
Google Scholar
Lee, C.-J., Chang, J.S., Jang, J.S.R.: Extraction of Transliteration Pairs from Parallel Corpora Using a Statistical Transliteration Model. Information Sciences (2006)
Google Scholar
Li, H., Zhang, M., Su, J.: A joint source-channel model for machine transliteration. In: 42nd ACL, pp. 159–166 (2004)
Google Scholar
Lindén, K.: Multilingual Modeling of Cross-Lingual Spelling Variants spelling variants. Information Retrieval 9(3), 295–310 (2006)
Article Google Scholar
Piskorski, J., Wieloch, K., Pikula, M., Sydow, M.: Toward Person Name Matching for Inflective Languages (Forthcoming, 2008)
Google Scholar
Pouliquen, B., Steinberger, R., Ignat, C., Käsper, E., Temnikova, I.: Multilingual and cross-lingual news topic tracking. In: CoLing 2004, Geneva, Switzerland, vol. II, pp. 959–965 (2004)
Google Scholar
Ristad, E.S., Yianilos, P.N.: Learning string-edit distance. In: IEEE Transactions or Pattern Analysis and Machine Intelligence (1998)
Google Scholar
Sherif, T., Kondrak, G.: Substring-Based Transliteration. In: 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), Prague, Czech Republic, pp. 944–951 (2007)
Google Scholar
Steinberger, R., Pouliquen, B.: Cross-lingual Named Entity Recognition. In: Sekine&, S., Ranchhod, E. (eds.) Journal Linguisticae Investigationes, vol. 30(1), pp. 135–162 (2006) (Special Issue on Named Entity Recognition and Categorisation)
Google Scholar
Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical computer science 92(1), 191–211 (1992)
Article MathSciNet MATH Google Scholar
Whitaker, B.: Arabic words and the Roman alphabet (last visit 18/03/2008) (2005), http://www.al-bab.com/arab/language/roman1.htm
Winkler, W.E.: The state of record linkage and current research problems, Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC (1999)
Google Scholar
Zobel, J., Dart, P.W.: Partitioning Number Sequences into Optimal Subsequences. Jour. of Research and Practice in Information Technology 32(2), 121–129 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

European Commission - Joint Research Centre, Via Enrico Fermi, 2749 21027 Ispra (VA), Italy
Bruno Pouliquen

Authors

Bruno Pouliquen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Chalmers University of Technology, 41296, Göteborg, Sweden
Bengt Nordström & Aarne Ranta &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pouliquen, B. (2008). Similarity of Names Across Scripts: Edit Distance Using Learned Costs of N-Grams. In: Nordström, B., Ranta, A. (eds) Advances in Natural Language Processing. GoTAL 2008. Lecture Notes in Computer Science(), vol 5221. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85287-2_39

Download citation

DOI: https://doi.org/10.1007/978-3-540-85287-2_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85286-5
Online ISBN: 978-3-540-85287-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics