Skip to main content

Similarity of Names Across Scripts: Edit Distance Using Learned Costs of N-Grams

  • Conference paper
Advances in Natural Language Processing (GoTAL 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5221))

Included in the following conference series:

Abstract

Any cross-language processing application has to first tackle the problem of transliteration when facing a language using another script. The first solution consists of using existing transliteration tools, but these tools are not usually suitable for all purposes. For some specific script pairs they do not even exist. Our aim is to discriminate transliterations across different scripts in a unified way using a learning method that builds a transliteration model out of a set of transliterated proper names. We compare two strings using an algorithm that builds a Levenshtein edit distance using n-grams costs. The evaluations carried out show that our similarity measure is accurate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. AbdulJaleel, N., Larkey, L.S.: Statistical transliteration for English-Arabic cross language information retrieval. In: CIKM, pp. 139–146 (2003)

    Google Scholar 

  2. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive Name Matching in Information Integration, Intelligent Systems. IEEE, Los Alamitos (2003)

    Google Scholar 

  3. Brill, E., Kacmarcik, G., Brockett, C.: Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, pp. 393–399 (2001)

    Google Scholar 

  4. Brill, E., Moore, R.C.: An improved Error Model for Noisy Channel Spelling Correction. In: Proceedings of the ACL 2000, pp. 286–293 (2000)

    Google Scholar 

  5. Christen, P.: A Comparison of Personal Name Matching: Techniques and Practical Issues, Technical Report TR-CS-06-02, Joint Computer Science Technical Report Series, Department of Computer Science (2006)

    Google Scholar 

  6. Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: IJCAI-2003 Workshop on Information Integration on the Web, Acapulco, Mexico, pp. 73–78 (2003)

    Google Scholar 

  7. Freeman, A.T., Condon, S.L., Ackerman, C.M.: Cross linguistic name matching in English and Arabic: alone to many mapping extension of the Levenshtein edit distance algorithm. In: HLT-NAACL 2006 (2006)

    Google Scholar 

  8. Hall, P.A.V., Dowling, G.R.: Approximate string matching. ACM Computing Surveys 12(4), 381–402 (1980)

    Article  MathSciNet  Google Scholar 

  9. Klementiev, A., Roth, D.: Named Entity Transliteration and Discovery in Multilingual Corpora. In: Learning Machine Translation (2006)

    Google Scholar 

  10. Knight, K., Graehl, J.: Machine transliteration. Computational Linguistics 24(4), 599–612 (1998)

    Google Scholar 

  11. Kuo, J.-S., Li, H., Yang, Y.-K.: Learning Transliteration Lexicons from the Web. In: Proceedings of 44th ACL, pp. 1129–1136 (2006)

    Google Scholar 

  12. Lee, C.-J., Chang, J.S., Jang, J.S.R.: Extraction of Transliteration Pairs from Parallel Corpora Using a Statistical Transliteration Model. Information Sciences (2006)

    Google Scholar 

  13. Li, H., Zhang, M., Su, J.: A joint source-channel model for machine transliteration. In: 42nd ACL, pp. 159–166 (2004)

    Google Scholar 

  14. Lindén, K.: Multilingual Modeling of Cross-Lingual Spelling Variants spelling variants. Information Retrieval 9(3), 295–310 (2006)

    Article  Google Scholar 

  15. Piskorski, J., Wieloch, K., Pikula, M., Sydow, M.: Toward Person Name Matching for Inflective Languages (Forthcoming, 2008)

    Google Scholar 

  16. Pouliquen, B., Steinberger, R., Ignat, C., Käsper, E., Temnikova, I.: Multilingual and cross-lingual news topic tracking. In: CoLing 2004, Geneva, Switzerland, vol. II, pp. 959–965 (2004)

    Google Scholar 

  17. Ristad, E.S., Yianilos, P.N.: Learning string-edit distance. In: IEEE Transactions or Pattern Analysis and Machine Intelligence (1998)

    Google Scholar 

  18. Sherif, T., Kondrak, G.: Substring-Based Transliteration. In: 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), Prague, Czech Republic, pp. 944–951 (2007)

    Google Scholar 

  19. Steinberger, R., Pouliquen, B.: Cross-lingual Named Entity Recognition. In: Sekine&, S., Ranchhod, E. (eds.) Journal Linguisticae Investigationes, vol. 30(1), pp. 135–162 (2006) (Special Issue on Named Entity Recognition and Categorisation)

    Google Scholar 

  20. Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical computer science 92(1), 191–211 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  21. Whitaker, B.: Arabic words and the Roman alphabet (last visit 18/03/2008) (2005), http://www.al-bab.com/arab/language/roman1.htm

  22. Winkler, W.E.: The state of record linkage and current research problems, Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC (1999)

    Google Scholar 

  23. Zobel, J., Dart, P.W.: Partitioning Number Sequences into Optimal Subsequences. Jour. of Research and Practice in Information Technology 32(2), 121–129 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pouliquen, B. (2008). Similarity of Names Across Scripts: Edit Distance Using Learned Costs of N-Grams. In: Nordström, B., Ranta, A. (eds) Advances in Natural Language Processing. GoTAL 2008. Lecture Notes in Computer Science(), vol 5221. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85287-2_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85287-2_39

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85286-5

  • Online ISBN: 978-3-540-85287-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics