Analysis of EU Languages Through Text Compression

Kettunen, Kimmo; Sadeniemi, Markus; Lindh-Knuutila, Tiina; Honkela, Timo

doi:10.1007/11816508_12

Analysis of EU Languages Through Text Compression

Kimmo Kettunen²¹,
Markus Sadeniemi²²,
Tiina Lindh-Knuutila²² &
…
Timo Honkela²²

Conference paper

1603 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4139))

Abstract

In this article, we are studying the differences between the European languages using statistical and unsupervised methods. The analysis is conducted in different levels of language, lexical, morphological and syntactic. Our premise is that the difficulty of the translation could be perceived as differences or similarities in different levels of language. The results are compared to linguistic groupings. The analyses of this paper are based on the concept of Kolmogorov complexity, which is used to compare the language structure in syntactic and morphological levels. The way the languages convey information in these levels is taken as a measure of similarity or dissimilarity between languages and the results are compared to classical linguistic classification. The results will serve as a tool in developing machine translation system(s), e.g., in the following way: if source language conveys more information in the morphological level and the target language more in the syntactic level, it is clear that the (machine) translator must be able to transfer the information from one level to another.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gordon Jr., R.G. (ed.): Ethnologue: Languages of the World, 15th edn. SIL International, Dallas (2005), http://www.ethnologue.com/
Google Scholar
Haarman, H.: Kleines Lexikon der Sprachen. Von Albanisch bis Zulu. Verlag C.H. Beck, München, 2, überarbeitete Auflage (2002)
Google Scholar
Tiedemann, J., Nygaard, L.: The OPUS Corpus - Parallel & Free. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, May 26-28 (2004) (accessed January 30, 2006), http://www.let.rug.nl/~tiedeman/blog/paper/opus_lrec04.pdf
Juola, P.: Measuring Linguistic Complexity: the Morphological Tier. Journal of Quantitative Linguistics 5, 206–213 (1998)
Article Google Scholar
Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and its Applicatrions. Springer, Heidelberg (1994)
Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The Similarity Metric. IEEE Transactions on Information Theory 50, 3250–3264 (2004)
Article Google Scholar
Bennet, C.H., Gács, P., Li, M., Vitányi, P.M.B., Zurek, W.H.: Information Distance. IEEE Transactions on Information Theory 44, 1407–1423 (1998)
Article Google Scholar
Juola, P.: Compression-Based Analysis of Language Complexity. Approaches to Complexity in Language, abstracts (2005) (accessed January 15, 2006), http://www.ling.helsinki.fi/sky/tapahtumat/complexity/Abstracts.pdf
Bakker, D.: Flexibility and Consistency in Word Order Patterns in the Languages of Europe. In: Siewierska, A. (ed.) Constituent Order in the Languages of Europe. Empirical Approaches to Language Typology, pp. 381–419. Mouton de Gruyter, Berlin (1998)
Google Scholar
Cilibrasi, R., Vitányi, P.M.B.: Clustering by Compression. IEEE Transactions on Information Theory 51, 1523–1545 (2005)
Article Google Scholar
Creutz, M., Lagus, K.: Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Espoo: Publications in Computer and Information Science, Helsinki University of Technology, Report A81 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Studies, University of Tampere, Kanslerinrinne 1, FIN-33014, Finland
Kimmo Kettunen
Laboratory of Computer and Information Science, Helsinki University of Technology, P.O. Box 5400, FI-02015 HUT, Finland
Markus Sadeniemi, Tiina Lindh-Knuutila & Timo Honkela

Authors

Kimmo Kettunen
View author publications
You can also search for this author in PubMed Google Scholar
Markus Sadeniemi
View author publications
You can also search for this author in PubMed Google Scholar
Tiina Lindh-Knuutila
View author publications
You can also search for this author in PubMed Google Scholar
Timo Honkela
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Turku Centre for Computer Science (TUCS), Department of Information Technology, University of Turku, Joukahaisenkatu 3-5 B, FIN-20520, Turku, Finland
Tapio Salakoski
Turku Centre for Computer Science (TUCS) and Department of IT, University of Turku, Lemminkäisenkatu 14 A, 20520, Turku, Finland
Filip Ginter & Sampo Pyysalo &
Department of Information Technology, University of Turku, Lemminkäisenkatu 14–18 A, FIN-20520, Turku, Finland
Tapio Pahikkala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kettunen, K., Sadeniemi, M., Lindh-Knuutila, T., Honkela, T. (2006). Analysis of EU Languages Through Text Compression. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_12

Download citation

DOI: https://doi.org/10.1007/11816508_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37334-6
Online ISBN: 978-3-540-37336-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics