Skip to main content

Analysis of EU Languages Through Text Compression

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4139))

Abstract

In this article, we are studying the differences between the European languages using statistical and unsupervised methods. The analysis is conducted in different levels of language, lexical, morphological and syntactic. Our premise is that the difficulty of the translation could be perceived as differences or similarities in different levels of language. The results are compared to linguistic groupings. The analyses of this paper are based on the concept of Kolmogorov complexity, which is used to compare the language structure in syntactic and morphological levels. The way the languages convey information in these levels is taken as a measure of similarity or dissimilarity between languages and the results are compared to classical linguistic classification. The results will serve as a tool in developing machine translation system(s), e.g., in the following way: if source language conveys more information in the morphological level and the target language more in the syntactic level, it is clear that the (machine) translator must be able to transfer the information from one level to another.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gordon Jr., R.G. (ed.): Ethnologue: Languages of the World, 15th edn. SIL International, Dallas (2005), http://www.ethnologue.com/

    Google Scholar 

  2. Haarman, H.: Kleines Lexikon der Sprachen. Von Albanisch bis Zulu. Verlag C.H. Beck, München, 2, überarbeitete Auflage (2002)

    Google Scholar 

  3. Tiedemann, J., Nygaard, L.: The OPUS Corpus - Parallel & Free. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, May 26-28 (2004) (accessed January 30, 2006), http://www.let.rug.nl/~tiedeman/blog/paper/opus_lrec04.pdf

  4. Juola, P.: Measuring Linguistic Complexity: the Morphological Tier. Journal of Quantitative Linguistics 5, 206–213 (1998)

    Article  Google Scholar 

  5. Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and its Applicatrions. Springer, Heidelberg (1994)

    Google Scholar 

  6. Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The Similarity Metric. IEEE Transactions on Information Theory 50, 3250–3264 (2004)

    Article  Google Scholar 

  7. Bennet, C.H., Gács, P., Li, M., Vitányi, P.M.B., Zurek, W.H.: Information Distance. IEEE Transactions on Information Theory 44, 1407–1423 (1998)

    Article  Google Scholar 

  8. Juola, P.: Compression-Based Analysis of Language Complexity. Approaches to Complexity in Language, abstracts (2005) (accessed January 15, 2006), http://www.ling.helsinki.fi/sky/tapahtumat/complexity/Abstracts.pdf

  9. Bakker, D.: Flexibility and Consistency in Word Order Patterns in the Languages of Europe. In: Siewierska, A. (ed.) Constituent Order in the Languages of Europe. Empirical Approaches to Language Typology, pp. 381–419. Mouton de Gruyter, Berlin (1998)

    Google Scholar 

  10. Cilibrasi, R., Vitányi, P.M.B.: Clustering by Compression. IEEE Transactions on Information Theory 51, 1523–1545 (2005)

    Article  Google Scholar 

  11. Creutz, M., Lagus, K.: Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Espoo: Publications in Computer and Information Science, Helsinki University of Technology, Report A81 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kettunen, K., Sadeniemi, M., Lindh-Knuutila, T., Honkela, T. (2006). Analysis of EU Languages Through Text Compression. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_12

Download citation

  • DOI: https://doi.org/10.1007/11816508_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-37334-6

  • Online ISBN: 978-3-540-37336-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics