Skip to main content

Building a Pos Tagger and Lemmatizer for the Italian Language

  • Conference paper
  • First Online:
Advanced Information Networking and Applications (AINA 2021)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 227))

  • 995 Accesses

Abstract

In this work, we present two modules for a python open-source library for the analysis of the Italian language. The modules include a Pos tagger based on Averaged Perceptron Tagger and a Lemmatizer, based on the vast collection of linguistic data held by the Department of Politics and Communication Science of the University of Salerno. While the Averaged Perceptron Tagger algorithm is mostly used for the the English language from famous python libraries such as NLTK or Spacy, the Lemmatizer represents an entirely original module that relies on a vast electronic dictionary characterized by the presence of syntactic, morphological, and semantic tags. We present our approach and a preliminary experiment in which we compare our module results with the results of another widely used Pos-tagger and Lemmatizer as Tree-Tagger.

A. Maisto edited Sects. 1, 2, 3, 4, 5; W. Balzano collaborated in the project.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.nltk.org/.

  2. 2.

    https://spacy.io/.

  3. 3.

    https://github.com/sloria/textblob-aptagger.

  4. 4.

    CONLL.

  5. 5.

    http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.

References

  1. Amato, A., Balzano, W., Cozzolino, G., Moscato, F.: Analysis of consumers perceptions of food safety risk in social networks. In: Barolli, L., Takizawa, M., Xhafa, F., Enokido, T. (eds.) International Conference on Advanced Information Networking and Applications, pp. 1217–1227. Springer, Cham (2019)

    Google Scholar 

  2. Greene, B.B., Rubin, G.M.: Automatic grammatical tagging of English. Department of Linguistics. Brown University (1971)

    Google Scholar 

  3. Francis, W., Kucera, H.: Frequency analysis of English usage (1982)

    Google Scholar 

  4. Church, K.W.: A stochastic parts program and noun phrase parser for unrestricted text. In: Second Conference on Applied Natural Language Processing, pp. 136–143. Association for Computational Linguistics (1988)

    Google Scholar 

  5. Cutting, D., Kupiec, J., Pedersen, J., Sibun, P.: A practical part-of-speech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing, pp. 133–140. Association for Computational Linguistics (1992)

    Google Scholar 

  6. Brill, E.: A simple rule-based part of speech tagger. In: Proceedings of the Workshop on Speech and Natural Language, pp. 112–116. Association for Computational Linguistics (1992)

    Google Scholar 

  7. Ratnaparkhi, A., et al.: A maximum entropy model for part-of-speech tagging. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Philadelphia, USA, vol. 1, pp. 133–142 (1996)

    Google Scholar 

  8. Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction With the 38th Annual Meeting of the Association for Computational Linguistics, vol. 13, pp. 63–70. Association for Computational Linguistics (2000)

    Google Scholar 

  9. Giménez, J., Marquez, L.: SVMTool: a general POS tagger generator based on support vector machines. In: Proceedings of the 4th International Conference on Language Resources and Evaluation. Citeseer (2004)

    Google Scholar 

  10. Denis, P., Sagot, B., et al.: Coupling an annotated corpus and a morphosyntactic Lexicon for state-of-the-art POS tagging with less human effort. In: PACLIC, pp. 110–119 (2009)

    Google Scholar 

  11. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 173–180 (2003)

    Google Scholar 

  12. Shen, L., Satta, G., Joshi, A.: Guided learning for bidirectional sequence classification. In: ACL, vol. 7, pp. 760–767. Citeseer (2007)

    Google Scholar 

  13. Manning, C.D.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Gelbukh, A.F. (ed.) International Conference on Intelligent Text Processing and Computational Linguistics, pp. 171–189. Springer, Berlin (2011)

    Google Scholar 

  14. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)

  15. Choi, J.D.: Dynamic feature induction: the last gist to the state-of-the-art. In: Proceedings of NAACL-HLT, pp. 271–281 (2016)

    Google Scholar 

  16. Collins, M.: Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 1–8. Association for Computational Linguistics (2002)

    Google Scholar 

  17. Amato, F., Casola, V., Mazzocca, N., Romano, S.: A semantic approach for fine-grain access control of e-health documents. Log. J. IGPL 21(4), 692–701 (2013)

    Article  MathSciNet  Google Scholar 

  18. Amato, F., Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M., Moscato,V., Persia, F., Picariello, A.: Challenge: processing web texts for classifying job offers. In: Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015), pp. 460–463. IEEE (2015)

    Google Scholar 

  19. Amato, F., Casola, V., Mazzeo, A., Romano, S.: A semantic based methodology to classify and protect sensitive data in medical records. In: 2010 Sixth International Conference on Information Assurance and Security, pp. 240–246. IEEE (2010)

    Google Scholar 

  20. Votrubec, J.: Morphological tagging based on averaged perceptron. In: WDS 2006 Proceedings of Contributed Papers, pp. 191–195 (2006)

    Google Scholar 

  21. Hajič, J., Raab, J., Spousta, M., et al.: Semi-supervised training for the averaged perceptron POS tagger. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 763–771. Association for Computational Linguistics (2009)

    Google Scholar 

  22. Chrupała, G., Dinu, G., Van Genabith, J.: Learning morphology with Morfette (2008)

    Google Scholar 

  23. Constant, M.,Tellier, I., Duchier, D., Dupont, Y., Sigogne, A., Billot, S.: Intégrer des connaissances linguistiques dans un crf: application à l’apprentissage d’un segmenteur-étiqueteur du français. In: TALN, vol. 1, p. 321 (2011)

    Google Scholar 

  24. Kanis, J., Müller, L.: Automatic lemmatizer construction with focus on OOV words lemmatization. In: Matoušek, V., Mautner, P., Pavelka, T., (eds.) International Conference on Text, Speech and Dialogue, pp. 132–139. Springer, Berlin (2005)

    Google Scholar 

  25. Schmid, H.: Treetagger—a language independent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart 43, 28 (1995)

    Google Scholar 

  26. Morton, T., Kottmann, J., Baldridge, J., Bierner, G.: Opennlp: a Java-based NLP toolkit (2005)

    Google Scholar 

  27. Pianta, E., Zanoli, R.: TagPro: a system for Italian PoS tagging based on SVM. Intelligenza Artificiale 4(2), 8–9 (2007)

    Google Scholar 

  28. Favretti, R.R., Tamburini, F., De Santis, C.: CORIS/CODIS: a corpus of written Italian based on a defined and a dynamic model. A Rainbow of Corpora: Corpus Linguistics and the Languages of the World. Lincom-Europa, Munich (2002)

    Google Scholar 

  29. Attardi, G., Fuschetto, A., Tamberi, F., Simi, M., Vecchi, E.M.: Experiments in tagger combination: arbitrating, guessing, correcting, suggesting. In: Proceedings of Workshop Evalita, p. 10 (2009)

    Google Scholar 

  30. Dell’Orletta, F.: Ensemble system for part-of-speech tagging. In: Proceedings of EVALITA, vol. 9, pp. 1–8 (2009)

    Google Scholar 

  31. De Smedt, T., Daelemans, W.: Pattern for Python. J. Mach. Learn. Res. 13, 2063–2067 (2012)

    Google Scholar 

  32. Lyding, V., Stemle, E., Borghetti, C., Brunello, M., Castagnoli, S., Dell’Orletta, F., Dittmann, H., Lenci, A., Pirrelli, V.: The paisa corpus of Italian web texts. In: Proceedings of the 9th Web as Corpus Workshop (WaC-9), pp. 36–43 (2014)

    Google Scholar 

  33. Hahn, U., Tomanek, K., Beisswanger, E., Faessler, E.: A proposal for a configurable silver standard. In: Proceedings of the Fourth Linguistic Annotation Workshop, pp. 235–242 (2010)

    Google Scholar 

  34. Elia, A.: Dizionari elettronici e applicazioni informatiche. In: JADT (1995)

    Google Scholar 

  35. Elia, A., Marano, F., Monteleone, M., Sabatino, S., Vellutino, D.: Strutture lessicali delle informazioni comunitarie all’interno di domini specialistici. In: Statistical Analysis of Textual Data, Proceedings of 10th International Conference “Journées D’Analyse Statistique des Données Textuelles”, pp. 9–11. Università” La Sapienza, Roma (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alessandro Maisto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Maisto, A., Balzano, W. (2021). Building a Pos Tagger and Lemmatizer for the Italian Language. In: Barolli, L., Woungang, I., Enokido, T. (eds) Advanced Information Networking and Applications. AINA 2021. Lecture Notes in Networks and Systems, vol 227. Springer, Cham. https://doi.org/10.1007/978-3-030-75078-7_7

Download citation

Publish with us

Policies and ethics