Combining Evidence in Cognate Identification

Kondrak, Grzegorz

doi:10.1007/978-3-540-24840-8_4

Grzegorz Kondrak¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3060))

Included in the following conference series:

Conference of the Canadian Society for Computational Studies of Intelligence

1541 Accesses
4 Citations

Abstract

Cognates are words of the same origin that belong to distinct languages. The problem of automatic identification of cognates arises in language reconstruction and bitext-related tasks. The evidence of cognation may come from various information sources, such as phonetic similarity, semantic similarity, and recurrent sound correspondences. I discuss ways of defining the measures of the various types of similarity and propose a method of combining then into an integrated cognate identification program. The new method requires no manual parameter tuning and performs well when tested on the Indoeuropean and Algonquian lexical data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Al-Onaizan, Y., Curin, J., Jahr, M., Knight, K., Lafferty, J., Melamed, D., Och, F., Purdy, D., Smith, N., Yarowsky, D.: Statistical machine translation. Technical report, Johns Hopkins University (1999)
Google Scholar
Brew, C., McKelvie, D.: Word-pair extraction for lexicography. In: Oflazer, K., Somers, H. (eds.) Proceedings of the 2nd International Conference on New Methods in Language Processing, Ankara, Bilkent University, pp. 45–55 (1996)
Google Scholar
Kenneth, W.: Church. Char align: A program for aligning parallel texts at the character level. In: Proceedings of ACL 1993: 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 1–8 (1993)
Google Scholar
Dyen, I., Kruskal, J.B., Black, P.: An Indoeuropean classification: A lexicostatistical experiment. Transactions of the American Philosophical Society 82(5) (1992)
Google Scholar
Fellbaum, C. (ed.): WordNet: an electronic lexical database. The MIT Press, Cambridge (1998)
MATH Google Scholar
Guy, J.B.M.: An algorithm for identifying cognates in bilingual wordlists and its applicability to machine translation. Journal of Quantitative Linguistics 1(1), 35–42 (1994), MS-DOS executable available at http://garbo.uwasa.fi
Article MathSciNet Google Scholar
Hewson, J.: Comparative reconstruction on the computer. In: Proceedings of the 1st International Conference on Historical Linguistics, pp. 191–197 (1974)
Google Scholar
Hewson, J.: A computer-generated dictionary of proto-Algonquian. Canadian Museum of Civilization, Hull (1993)
Google Scholar
Hewson, J.: Vocabularies of Fox, Cree, Menomini, and Ojibwa (1999), Computer file
Google Scholar
Kessler, B.: The Significance of Word Lists. CSLI Publications, Stanford (2001), Word lists available at http://spell.psychology.wayne.edu/~bkessler
Google Scholar
Koehn, P., Knight, K.: Knowledge sources for word-level translation models. In: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pp. 27–35 (2001)
Google Scholar
Kondrak, G.: A new algorithm for the alignment of phonetic sequences. In: Proceedings of NAACL 2000: 1st Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 288–295 (2000)
Google Scholar
Kondrak, G.: Identifying cognates by phonetic and semantic similarity. In: Proceedings of NAACL 2001: 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 103–110 (2001)
Google Scholar
Kondrak, G.: Determining recurrent sound correspondences by inducing translation models. In: Proceedings of COLING 2002: 19th International Conference on Computational Linguistics, pp. 488–494 (2002)
Google Scholar
Kondrak, G.: Identifying complex sound correspondences in bilingual wordlists. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 432–443. Springer, Heidelberg (2003)
Chapter Google Scholar
Kondrak, G., Dorr, B.: Identification of confusable drug names: A new approach and evaluation methodology (2004) (in preparation)
Google Scholar
Kondrak, G., Marcu, D., Knight, K.: Cognates can improve statistical translation models. In: Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 46–48 (2003), Companion volume
Google Scholar
Mann, G.S., Yarowsky, D.: Multipath translation lexicon induction via bridge languages. In: Proceedings of NAACL 2001: 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 151–158 (2001)
Google Scholar
McEnery, T., Oakes, M.: Sentence and word alignment in the CRATER Project. In: Thomas, J., Short, M. (eds.) Using Corpora for Language Research, pp. 211–231. Longman (1996)
Google Scholar
Dan Melamed, I.: Automatic discovery of non-compositional compounds in parallel data. In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pp. 97–108 (1997)
Google Scholar
Dan Melamed, I.: Bitext maps and alignment via pattern recognition. Computational Linguistics 25(1), 107–130 (1999)
Google Scholar
Dan Melamed, I.: Models of translational equivalence among words. Computational Linguistics 26(2), 221–249 (2000)
Article Google Scholar
Oakes, M.P.: Computer estimation of vocabulary in protolanguage from word lists in four daughter languages. Journal of Quantitative Linguistics 7(3), 233–243 (2000)
Article Google Scholar
Simard, M., Foster, G.F., Isabelle, P.: Using cognates to align sentences in bilingual corpora. In: Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, Montreal, Canada, pp. 67–81 (1992)
Google Scholar
Swadesh, M.: Lexico-statistical dating of prehistoric ethnic contacts. Proceedings of the American Philosophical Society 96, 452–463 (1952)
Google Scholar
Tiedemann, J.: Automatic construction of weighted string similarity measures. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, Maryland (1999)
Google Scholar
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the Association for Computing Machinery 21(1), 168–173 (1974)
MATH MathSciNet Google Scholar
Yarowsky, D., Wincentowski, R.: Minimally supervised morphological analysis by multimodal alignment. In: Proceedings of ACL 2000, pp. 207–216 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing Science, University of Alberta, Edmonton, AB, T6G 2E8, Canada
Grzegorz Kondrak

Authors

Grzegorz Kondrak
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Windsor, 401 Sunset Avenue, N9B 3P4, Windsor, Ontario, Canada
Ahmed Y. Tawfik
School of Computer Science, University of Windsor, Windsor, Ontario,
Scott D. Goodwin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kondrak, G. (2004). Combining Evidence in Cognate Identification. In: Tawfik, A.Y., Goodwin, S.D. (eds) Advances in Artificial Intelligence. Canadian AI 2004. Lecture Notes in Computer Science(), vol 3060. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24840-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-540-24840-8_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22004-6
Online ISBN: 978-3-540-24840-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics