Abstract
This paper describes the baseline dictionary-based Lithuanian lemmatizer designed for an open online collaborative Machine Translation system. We evaluated our tool on the gold standard corpus composed of four different domains (official documents, fiction texts, scientific texts, and periodicals) containing ~1 million running words in total and obtained an encouraging accuracy equal to ~85.7%. Afterwards, we have made an error analysis, which will be used for the further improvements of our lemmatizer.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Available on https://translate.tilde.com.
- 2.
Available on http://vertimas.vdu.lt.
- 3.
- 4.
- 5.
Available at http://www.undlfoundation.org/undlfoundation.
- 6.
- 7.
Note that the monolingual nature of the analyses and of the generations makes them directly reusable for building new language pairs.
- 8.
ATEF also allows the analysis of multiword expressions. This will not be described here. It can also be used in conjunction with other steps (as in Guilbaud’s German analyzer) to produce disambiguated analyses.
- 9.
The database and the analyzer (LithuanianMorphoAnalyser.zip) can be downloaded from lingwarium.org/heloise/index.php?Ref=&ws=AnaLTN&lgPair=LTN-ENG.
References
Bejček, E., Hajičová, E., Hajič, J., Jínová, P., Kettnerová, V., Kolářová, V., Mikulová, M., Mírovský, J., Nedoluzhko, A., Panevová, J., Poláková, L., Ševčíková, M., Štěpánek, J., Zikánová, Š.: Prague Dependency Treebank 3.0 (2013)
Berment, V., Boitet, C.: Heloise – a reengineering of Ariane-G5 SLLPs for application to p-languages. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, pp. 113–124 (2012)
Berment, V.: Some thoughts on how to address commercially unprofitable languages and language pairs. In: the 5th Workshop on South and Southeast Asian NLP (WSSANLP) (2014). Keynote speech http://www.sanlp.org/wssanlp2014/KeyNoteSpeach.pdf
Bojar, O.: English-to-Czech factored machine translation. In: Proceedings of the Second Workshop on Statistical Machine Translation (StatMT 2007), pp. 232–239 (2007)
Costa-jussá, M.R., Fonollosa, J.A.R.: Latest trends in hybrid machine translation and its applications. Comput. Speech Lang. 32(1), 3–10 (2015)
Daudaravičius, V., Rimkutė, E., Utka, A.: Morphological annotation of the Lithuanian corpus. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing, Information Extraction and Enabling Technologies of ACL 2007, pp. 94–99 (2007)
Goldwater, S., McClosky, D.: Improving statistical MT through morphological analysis. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT 2005), pp. 676–683 (2005)
Guilbaud, J.-P., Boitet, C., Berment, V.: Un analyseur morphologique étendu de l’allemand traitant les formes verbales `a particule séparée. [An extended morphological analyzer of German handling verbal forms with separated separable particles] (in French). In: Traitement Automatique des Langues Naturelles – Rencontres des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (TALN-RÉCITAL), pp. 755–763 (2013)
Ingason, A.K., Helgadóttir, S., Loftsson, H., Rögnvaldsson, E.: A mixed method lemmatization algorithm using a hierarchy of linguistic identities (HOLI). In: Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS, vol. 5221, pp. 205–216. Springer, Heidelberg (2008). doi:10.1007/978-3-540-85287-2_20
Jongejan, B., Dalianis, H.: Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 145–153 (2009.)
Kanis, J., Skorkovská, L.: Comparison of different lemmatization approaches through the means of information retrieval performance. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS, vol. 6231, pp. 93–100. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15760-8_13
Marcinkevičienė, R.: Tekstynų lingvistika: teorija ir praktika [Corpus Linguistics: Theory and Practice] (in Lithuanian). Darbai ir dienos 24, 7–64 (2000)
Nakov, P., Ng, H.T.: Translating from morphologically complex languages: a paraphrase-based approach. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT 2011), pp. 1298–1307 (2011)
Schlinger, E., Chahuneau, V., Dyer, C.: morphogen: Translation into morphologically rich languages with synthetic phrases. Prague Bull. Math. Linguist. 100, 51–62 (2013)
Skadiņš, R., Goba, K., Šics, V.: Improving SMT for Baltic languages with factored models. In: Proceedings of the 4th International Conference Human Language Technologies – The Baltic Perspective, pp. 125–132 (2010)
Tran, K.M, Bisazza, A., Monz, C.: Word translation prediction for morphologically rich languages with bilingual neural networks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1676–1688 (2014)
Zinkevičius, V.: Lemuoklis – morfologinei analizei [Morphological Analysis with Lemuoklis] (in Lithuanian). Darbai ir dienos 24, 246–273 (2000)
Acknowledgments
We would like to express our gratitude to Albert Sy Den for his patient contribution to the transfer of lexical data into database form.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Kapočiūtė-Dzikienė, J., Berment, V., Rimkutė, E. (2017). Towards Creation of a Lithuanian Lemmatizer for Open Online Collaborative Machine Translation. In: Damaševičius, R., Mikašytė, V. (eds) Information and Software Technologies. ICIST 2017. Communications in Computer and Information Science, vol 756. Springer, Cham. https://doi.org/10.1007/978-3-319-67642-5_43
Download citation
DOI: https://doi.org/10.1007/978-3-319-67642-5_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67641-8
Online ISBN: 978-3-319-67642-5
eBook Packages: Computer ScienceComputer Science (R0)