Skip to main content
Log in

FinnPos: an open-source morphological tagging and lemmatization toolkit for Finnish

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper describes FinnPos, an open-source morphological tagging and lemmatization toolkit for Finnish. The morphological tagging model is based on the averaged structured perceptron classifier. Given training data, new taggers are estimated in a computationally efficient manner using a combination of beam search and model cascade. The lemmatization is performed employing a combination of a rule-based morphological analyzer, OMorFi, and a data-driven lemmatization model. The toolkit is readily applicable for tagging and lemmatization of running text with models learned from the recently published Finnish Turku Dependency Treebank and FinnTreeBank. Empirical evaluation on these corpora shows that FinnPos performs favorably compared to reference systems in terms of tagging and lemmatization accuracy. In addition, we demonstrate that our system is highly competitive with regard to computational efficiency of learning new models and assigning analyses to novel sentences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

  1. https://github.com/mpsilfve/FinnPos

  2. In addition to applying standard beam search and parameter updates, we experimented with the maximum violation and early updates of Huang et al. (2012) but obtained no improvements in model accuracy.

  3. See documentation at https://github.com/mpsilfve/FinnPos/wiki.

  4. Available at https://sites.google.com/site/morfetteweb/.

  5. Available at https://code.google.com/p/cistern/wiki/marmot.

  6. Available at http://code.google.com/p/hunpos/.

  7. The example is taken from FinnTreeBank.

References

  • Bohnet, B., Nivre, J., Boguslavsky, I., Ginter, R. F. F., & Hajič, J. (2013). Joint morphological and syntactic analysis for richly inflected languages. Transactions of the Association for Computational Linguistics, 1, 415–428.

    Google Scholar 

  • Brants, T. (2000). TnT: A statistical part-of-speech tagger. In Proceedings of the 6th conference on applied natural language processing (ANLP 2000) (pp. 224–231). Washington, USA: Seattle.

  • Charniak, E., & Johnson, M. (2005). Coarse-to-fine n-best parsing and maxent discriminative reranking. In Proceedings of the 43rd annual meeting on association for computational linguistics (ACL 2005) (pp. 173–180). Ann Arbor: Michigan, USA.

  • Chrupala, G., Dinu, G., & van Genabith, J. (2008). Learning morphology with Morfette. In Proceedings of the 6th international conference on language resources and evaluation (LREC 2008) (pp. 2362–2367). Morocco: Marrakech.

  • Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP 2002) (Vol. 10, pp. 1–8). Philadelphia, Pennsylvania, USA.

  • Freund, Y., & Schapire, R. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3), 277–296.

    Article  Google Scholar 

  • Hakulinen, A., Korhonen, R., Vilkuna, M., & Koivisto, V. (2004). Iso suomen kielioppi. Suomalaisen kirjallisuuden seura, http://scripta.kotus.fi/visk.

  • Halácsy, P., Kornai, A., & Oravecz, C. (2007). HunPos: An open source trigram tagger. In Proceedings of the 45th annual meeting of the association of computational linguistics (ACL 2007) (pp. 209–212). Prague: Czech Republic.

  • Haverinen, K., Ginter, F., Laippala, V., Viljanen, T., & Salakoski, T. (2009). Dependency annotation of Wikipedia: First steps towards a Finnish treebank. In The 8th international workshop on treebanks and linguistic theories (TLT 2009) (pp. 95–105). Milan: Italy.

  • Haverinen, K., Nyblom, J., Viljanen, T., Laippala, V., Kohonen, S., Missilä, A., Ojala, S., Salakoski, T., & Ginter, F. (2014). Building the essential resources for Finnish: The Turku Dependency Treebank. Language Resources and Evaluation, 48(3), 493–531.

    Article  Google Scholar 

  • Huang, L., Fayong, S., & Guo, Y. (2012). Structured perceptron with inexact search. In Proceedings of the 2012 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies (NAACL HLT 2012) (pp. 142–151). Canada: Montreal.

  • Karlsson, F. (1990). Constraint grammar as a framework for parsing running text. In Proceedings of the 13th conference on computational linguistics (COLING 1990) (pp. 168–173). Finland: Helsinki.

  • Lindén, K., Axelson, E., Hardwick, S., Pirinen, T., & Silfverberg, M. (2011). HFST—Framework for compiling and applying morphologies. Systems and Frameworks for Computational Morphology (SFCM 2011) (pp. 67–85). Switzerland: Zurich.

  • Müller, T., Schmid, H., & Schütze, H. (2013). Efficient higher-order CRFs for morphological tagging. In Proceedings of 2013 empirical methods in natural language processing (EMNLP 2013) (pp. 322–332). Washington, USA: Seattle.

  • Pal, C., Sutton, C., & McCallum, A. (2006). Sparse forward-backward using minimum divergence beams for fast training of conditional random fields. In Internation conference on acoustics, speech and signal processing (ICASP 2006) (Vol. 5, pp. 581–584). Toulouse, France.

  • Pirinen, T. (2008). Automatic finite state morphological analysis of Finnish language using open source resources (in Finnish). Master’s thesis, University of Helsinki.

  • Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proceedings of the 1996 conference on empirical methods in natural language processing (EMNLP 1996) (Vol.1, pp. 133–142). New Brunswick, New Jersey, USA.

  • Rush, A. M., & Petrov, S. (2012). Vine pruning for efficient multi-pass dependency parsing. In Proceedings of the 2012 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies (NAACL HLT 2012) (pp. 498–507). Canada: Montreal.

  • Silfverberg, M., & Linden, K. (2011). Combining statistical models for POS tagging using finite-state calculus. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011) (pp. 183–190). Latvia: Riga.

  • Silfverberg, M., Ruokolainen, T., Lindén, K., & Kurimo, M. (2014). Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy. In Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (ACL 2014) (pp. 259–264). Maryland: Baltimore.

  • Sutton, C., & McCallum, A. (2011). An introduction to conditional random fields. Machine Learning, 4(4), 267–373.

    Article  Google Scholar 

  • Voutilainen, A. (2011). FinnTreeBank: Creating a research resource and service for language researchers with Constraint Grammar. In Proceedings of the NODALIDA 2011 workshop constraint grammar applications (pp. 41–49). Latvia: Riga.

  • Weiss, D., & Taskar, B. (2010). Structured prediction cascades. In International conference on artificial intelligence and statistics (AISTATS 2010) (pp. 916–923). Italy: Sardinia.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Miikka Silfverberg.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Silfverberg, M., Ruokolainen, T., Lindén, K. et al. FinnPos: an open-source morphological tagging and lemmatization toolkit for Finnish. Lang Resources & Evaluation 50, 863–878 (2016). https://doi.org/10.1007/s10579-015-9326-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-015-9326-3

Keywords

Navigation