Abstract
Any cross-language processing application has to first tackle the problem of transliteration when facing a language using another script. The first solution consists of using existing transliteration tools, but these tools are not usually suitable for all purposes. For some specific script pairs they do not even exist. Our aim is to discriminate transliterations across different scripts in a unified way using a learning method that builds a transliteration model out of a set of transliterated proper names. We compare two strings using an algorithm that builds a Levenshtein edit distance using n-grams costs. The evaluations carried out show that our similarity measure is accurate.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
AbdulJaleel, N., Larkey, L.S.: Statistical transliteration for English-Arabic cross language information retrieval. In: CIKM, pp. 139–146 (2003)
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive Name Matching in Information Integration, Intelligent Systems. IEEE, Los Alamitos (2003)
Brill, E., Kacmarcik, G., Brockett, C.: Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, pp. 393–399 (2001)
Brill, E., Moore, R.C.: An improved Error Model for Noisy Channel Spelling Correction. In: Proceedings of the ACL 2000, pp. 286–293 (2000)
Christen, P.: A Comparison of Personal Name Matching: Techniques and Practical Issues, Technical Report TR-CS-06-02, Joint Computer Science Technical Report Series, Department of Computer Science (2006)
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: IJCAI-2003 Workshop on Information Integration on the Web, Acapulco, Mexico, pp. 73–78 (2003)
Freeman, A.T., Condon, S.L., Ackerman, C.M.: Cross linguistic name matching in English and Arabic: alone to many mapping extension of the Levenshtein edit distance algorithm. In: HLT-NAACL 2006 (2006)
Hall, P.A.V., Dowling, G.R.: Approximate string matching. ACM Computing Surveys 12(4), 381–402 (1980)
Klementiev, A., Roth, D.: Named Entity Transliteration and Discovery in Multilingual Corpora. In: Learning Machine Translation (2006)
Knight, K., Graehl, J.: Machine transliteration. Computational Linguistics 24(4), 599–612 (1998)
Kuo, J.-S., Li, H., Yang, Y.-K.: Learning Transliteration Lexicons from the Web. In: Proceedings of 44th ACL, pp. 1129–1136 (2006)
Lee, C.-J., Chang, J.S., Jang, J.S.R.: Extraction of Transliteration Pairs from Parallel Corpora Using a Statistical Transliteration Model. Information Sciences (2006)
Li, H., Zhang, M., Su, J.: A joint source-channel model for machine transliteration. In: 42nd ACL, pp. 159–166 (2004)
Lindén, K.: Multilingual Modeling of Cross-Lingual Spelling Variants spelling variants. Information Retrieval 9(3), 295–310 (2006)
Piskorski, J., Wieloch, K., Pikula, M., Sydow, M.: Toward Person Name Matching for Inflective Languages (Forthcoming, 2008)
Pouliquen, B., Steinberger, R., Ignat, C., Käsper, E., Temnikova, I.: Multilingual and cross-lingual news topic tracking. In: CoLing 2004, Geneva, Switzerland, vol. II, pp. 959–965 (2004)
Ristad, E.S., Yianilos, P.N.: Learning string-edit distance. In: IEEE Transactions or Pattern Analysis and Machine Intelligence (1998)
Sherif, T., Kondrak, G.: Substring-Based Transliteration. In: 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), Prague, Czech Republic, pp. 944–951 (2007)
Steinberger, R., Pouliquen, B.: Cross-lingual Named Entity Recognition. In: Sekine&, S., Ranchhod, E. (eds.) Journal Linguisticae Investigationes, vol. 30(1), pp. 135–162 (2006) (Special Issue on Named Entity Recognition and Categorisation)
Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical computer science 92(1), 191–211 (1992)
Whitaker, B.: Arabic words and the Roman alphabet (last visit 18/03/2008) (2005), http://www.al-bab.com/arab/language/roman1.htm
Winkler, W.E.: The state of record linkage and current research problems, Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC (1999)
Zobel, J., Dart, P.W.: Partitioning Number Sequences into Optimal Subsequences. Jour. of Research and Practice in Information Technology 32(2), 121–129 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pouliquen, B. (2008). Similarity of Names Across Scripts: Edit Distance Using Learned Costs of N-Grams. In: Nordström, B., Ranta, A. (eds) Advances in Natural Language Processing. GoTAL 2008. Lecture Notes in Computer Science(), vol 5221. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85287-2_39
Download citation
DOI: https://doi.org/10.1007/978-3-540-85287-2_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85286-5
Online ISBN: 978-3-540-85287-2
eBook Packages: Computer ScienceComputer Science (R0)