Skip to main content

Towards Creation of a Lithuanian Lemmatizer for Open Online Collaborative Machine Translation

  • Conference paper
  • First Online:
Information and Software Technologies (ICIST 2017)

Abstract

This paper describes the baseline dictionary-based Lithuanian lemmatizer designed for an open online collaborative Machine Translation system. We evaluated our tool on the gold standard corpus composed of four different domains (official documents, fiction texts, scientific texts, and periodicals) containing ~1 million running words in total and obtained an encouraging accuracy equal to ~85.7%. Afterwards, we have made an error analysis, which will be used for the further improvements of our lemmatizer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Available on https://translate.tilde.com.

  2. 2.

    Available on http://vertimas.vdu.lt.

  3. 3.

    Available at: http://tekstynas.vdu.lt/page.xhtml;jsessionid=13737278C03300B67C9E5CF2C2AB2734?id=morphological-annotator.

  4. 4.

    Available at http://www.semantika.lt/TextAnnotation/Annotation/Annotate.

  5. 5.

    Available at http://www.undlfoundation.org/undlfoundation.

  6. 6.

    Available at http://www.hutchinsweb.me.uk/Routledge-2014.pdf and http://www.hutchinsweb.me.uk/IntroMT-13.pdf.

  7. 7.

    Note that the monolingual nature of the analyses and of the generations makes them directly reusable for building new language pairs.

  8. 8.

    ATEF also allows the analysis of multiword expressions. This will not be described here. It can also be used in conjunction with other steps (as in Guilbaud’s German analyzer) to produce disambiguated analyses.

  9. 9.

    The database and the analyzer (LithuanianMorphoAnalyser.zip) can be downloaded from lingwarium.org/heloise/index.php?Ref=&ws=AnaLTN&lgPair=LTN-ENG.

References

  1. Bejček, E., Hajičová, E., Hajič, J., Jínová, P., Kettnerová, V., Kolářová, V., Mikulová, M., Mírovský, J., Nedoluzhko, A., Panevová, J., Poláková, L., Ševčíková, M., Štěpánek, J., Zikánová, Š.: Prague Dependency Treebank 3.0 (2013)

    Google Scholar 

  2. Berment, V., Boitet, C.: Heloise – a reengineering of Ariane-G5 SLLPs for application to p-languages. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, pp. 113–124 (2012)

    Google Scholar 

  3. Berment, V.: Some thoughts on how to address commercially unprofitable languages and language pairs. In: the 5th Workshop on South and Southeast Asian NLP (WSSANLP) (2014). Keynote speech http://www.sanlp.org/wssanlp2014/KeyNoteSpeach.pdf

  4. Bojar, O.: English-to-Czech factored machine translation. In: Proceedings of the Second Workshop on Statistical Machine Translation (StatMT 2007), pp. 232–239 (2007)

    Google Scholar 

  5. Costa-jussá, M.R., Fonollosa, J.A.R.: Latest trends in hybrid machine translation and its applications. Comput. Speech Lang. 32(1), 3–10 (2015)

    Article  Google Scholar 

  6. Daudaravičius, V., Rimkutė, E., Utka, A.: Morphological annotation of the Lithuanian corpus. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing, Information Extraction and Enabling Technologies of ACL 2007, pp. 94–99 (2007)

    Google Scholar 

  7. Goldwater, S., McClosky, D.: Improving statistical MT through morphological analysis. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT 2005), pp. 676–683 (2005)

    Google Scholar 

  8. Guilbaud, J.-P., Boitet, C., Berment, V.: Un analyseur morphologique étendu de l’allemand traitant les formes verbales `a particule séparée. [An extended morphological analyzer of German handling verbal forms with separated separable particles] (in French). In: Traitement Automatique des Langues Naturelles – Rencontres des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (TALN-RÉCITAL), pp. 755–763 (2013)

    Google Scholar 

  9. Ingason, A.K., Helgadóttir, S., Loftsson, H., Rögnvaldsson, E.: A mixed method lemmatization algorithm using a hierarchy of linguistic identities (HOLI). In: Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS, vol. 5221, pp. 205–216. Springer, Heidelberg (2008). doi:10.1007/978-3-540-85287-2_20

    Chapter  Google Scholar 

  10. Jongejan, B., Dalianis, H.: Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 145–153 (2009.)

    Google Scholar 

  11. Kanis, J., Skorkovská, L.: Comparison of different lemmatization approaches through the means of information retrieval performance. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS, vol. 6231, pp. 93–100. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15760-8_13

    Chapter  Google Scholar 

  12. Marcinkevičienė, R.: Tekstynų lingvistika: teorija ir praktika [Corpus Linguistics: Theory and Practice] (in Lithuanian). Darbai ir dienos 24, 7–64 (2000)

    Google Scholar 

  13. Nakov, P., Ng, H.T.: Translating from morphologically complex languages: a paraphrase-based approach. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT 2011), pp. 1298–1307 (2011)

    Google Scholar 

  14. Schlinger, E., Chahuneau, V., Dyer, C.: morphogen: Translation into morphologically rich languages with synthetic phrases. Prague Bull. Math. Linguist. 100, 51–62 (2013)

    Article  Google Scholar 

  15. Skadiņš, R., Goba, K., Šics, V.: Improving SMT for Baltic languages with factored models. In: Proceedings of the 4th International Conference Human Language Technologies – The Baltic Perspective, pp. 125–132 (2010)

    Google Scholar 

  16. Tran, K.M, Bisazza, A., Monz, C.: Word translation prediction for morphologically rich languages with bilingual neural networks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1676–1688 (2014)

    Google Scholar 

  17. Zinkevičius, V.: Lemuoklis – morfologinei analizei [Morphological Analysis with Lemuoklis] (in Lithuanian). Darbai ir dienos 24, 246–273 (2000)

    Google Scholar 

Download references

Acknowledgments

We would like to express our gratitude to Albert Sy Den for his patient contribution to the transfer of lexical data into database form.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jurgita Kapočiūtė-Dzikienė .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Kapočiūtė-Dzikienė, J., Berment, V., Rimkutė, E. (2017). Towards Creation of a Lithuanian Lemmatizer for Open Online Collaborative Machine Translation. In: Damaševičius, R., Mikašytė, V. (eds) Information and Software Technologies. ICIST 2017. Communications in Computer and Information Science, vol 756. Springer, Cham. https://doi.org/10.1007/978-3-319-67642-5_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67642-5_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67641-8

  • Online ISBN: 978-3-319-67642-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics