MDL-Based Models for Transliteration Generation

Nouri, Javad; Pivovarova, Lidia; Yangarber, Roman

doi:10.1007/978-3-642-39593-2_18

Javad Nouri²²,
Lidia Pivovarova^22,23 &
Roman Yangarber²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7978))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

2657 Accesses

Abstract

This paper presents models for automatic transliteration of proper names between languages that use different alphabets. The models are an extension of our work on automatic discovery of patterns of etymological sound change, based on the Minimum Description Length Principle. The models for pairwise alignment are extended with algorithms for prediction that produce transliterated names. We present results on 13 parallel corpora for 7 languages, including English, Russian, and Farsi, extracted from Wikipedia headlines. The transliteration corpora are released for public use. The models achieve up to 88% on word-level accuracy and up to 99% on symbol-level F-score. We discuss the results from several perspectives, and analyze how corpus size, the language pair, the type of names (persons, locations), and noise in the data affect the performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Andrews, N., Eisner, J., Dredze, M.: Name phylogeny: A generative model of string variation. In: Proceeding of the 2012 Joint Conference of Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL (2012)
Google Scholar
Atkinson, M., Piskorski, J., van der Goot, E., Yangarber, R.: Multilingual real-time event extraction for border security intelligence gathering. In: Wiil, U.K. (ed.) Counterterrorism and Open Source Intelligence. Springer Lecture Notes in Social Networks, vol. 2 (2011)
Google Scholar
Bergsma, S., Kondrak, G.: Alignment-based discriminative string similarity. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (2007)
Google Scholar
Ekbal, A., Naskar, S.K., Bandyopadhyay, S.: A modified joint source-channel model for transliteration. In: Proceedings of the COLING/ACL, Stroudsburg, PA (2006)
Google Scholar
Finch, A., Sumita, E.: Phrase-based machine transliteration. In: Proceedings of the Workshop on Technologies and Corpora for Asia-Pacific Speech Translation, TCAST (2008)
Google Scholar
Goldwasser, D., Roth, D.: Transliteration as constrained optimization. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2008)
Google Scholar
Jiampojamarn, S., Cherry, C., Kondrak, G.: Joint processing and discriminative training for letter-to-phoneme conversion. In: Proceedings of ACL 2008: HLT, Columbus, Ohio (2008)
Google Scholar
Jiampojamarn, S., Kondrak, G., Sherif, T.: Applying many-to-many alignments and hidden markov models to letter-to-phoneme conversion. In: Human Language Technologies 2007: North American Chapter of the Association for Computational Linguistics, Rochester, New York (2007)
Google Scholar
Karimi, S., Scholer, F., Turpin, A.: Machine transliteration survey. ACM Computing Surveys 43(3) (2011)
Google Scholar
Karimi, S., Turpin, A., Scholer, F.: Corpus effects on the evaluation of automated transliteration systems. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (2007)
Google Scholar
Li, H., Zhang, M., Su, J.: A joint source-channel model for machine transliteration. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (2004)
Google Scholar
Lindén, K.: Multilingual modeling of cross-lingual spelling variants. Information Retrieval 9(3) (2006)
Google Scholar
Pervouchine, V., Li, H., Lin, B.: Transliteration alignment. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (2009)
Google Scholar
Schafer, C.: Novel probabilistic finite-state transducers for cognate and transliteration modeling. In: 7th Biennial Conference of the Association for Machine Translation in the Americas (AMTA) (2006)
Google Scholar
Sherif, T., Kondrak, G.: Substring-based transliteration. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (2007)
Google Scholar
Wettig, H., Hiltunen, S., Yangarber, R.: MDL-based Models for Alignment of Etymological Data. In: Proceedings of RANLP: The 8th Conference on Recent Advances in Natural Language Processing, Hissar, Bulgaria (2011)
Google Scholar
Wettig, H., Nouri, J., Reshetnikov, K., Yangarber, R.: Information-theoretic modeling of etymological sound change. In: Approaches to Measuring Linguistic Differences, Mouton de Gruyter (2013)
Google Scholar
Wettig, H., Reshetnikov, K., Yangarber, R.: Using context and phonetic features in models of etymological sound change. In: Proceedings of EACL Workshop on Visualization of Linguistic Patterns and Uncovering Language History from Multilingual Resources, Avignon, France (2012)
Google Scholar
Yangarber, R.: Verification of facts across document boundaries. In: Proc. IIIA 2006, Helsinki, Finland (2006)
Google Scholar
Yangarber, R., Best, C., von Etter, P., Fuart, F., Horby, D., Steinberger, R.: Combining information about epidemic threats from multiple sources. In: Proc. RANLP 2007 MMIES Workshop, Borovets, Bulgaria (2007)
Google Scholar
Zelenko, D.: Combining MDL transliteration training with discriminative modeling. In: Proceedings of the Named Entities Workshop: Shared Task on Transliteration (2009)
Google Scholar
Zelenko, D., Aone, C.: Discriminative methods for transliteration. In: Proceedings of EMNLP: Conference on Empirical Methods in Natural Language Processing (2006)
Google Scholar
Zhang, M., Li, H., Kumaran, A., Liu, M.: Report of news 2012 shared task on machine transliteration. In: Proceedings of NEWS 2012 Named Entities Workshop, vol. 12 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Helsinki, Finland
Javad Nouri, Lidia Pivovarova & Roman Yangarber
St.Petersburg State University, Russia
Lidia Pivovarova

Authors

Javad Nouri
View author publications
You can also search for this author in PubMed Google Scholar
Lidia Pivovarova
View author publications
You can also search for this author in PubMed Google Scholar
Roman Yangarber
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research Group on Mathematical Linguistics, Universitat Rovira i Virgili, Avinguda Catalunya, 35, 43002, Tarragona, Spain
Adrian-Horia Dediu & Carlos Martín-Vide &
Research Institute for Information and Language Processing, Research Group in Computational Linguistics, University of Wolverhampton, WV1 1SB, Wolverhampton, UK
Ruslan Mitkov
Fakultät für Informatik, Institut für Wissens- und Sprachverarbeitung, Otto-von-Guericke-Universität Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany
Bianca Truthe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nouri, J., Pivovarova, L., Yangarber, R. (2013). MDL-Based Models for Transliteration Generation. In: Dediu, AH., Martín-Vide, C., Mitkov, R., Truthe, B. (eds) Statistical Language and Speech Processing. SLSP 2013. Lecture Notes in Computer Science(), vol 7978. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39593-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-39593-2_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39592-5
Online ISBN: 978-3-642-39593-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics