Abstract
While there has been a lot of progress in Natural Language Processing (NLP), many basic resources are still missing for many languages, including Italian, especially resources that are free for both research and commercial use. One of these basic resources is a Part-of-Speech tagger, a first processing step in many NLP applications. We describe a weakly-supervised, fast, free and reasonably accurate part-of-speech tagger for the Italian language, created by mining words and their part-of-speech tags from Wiktionary. We have integrated the tagger in Pattern, a freely available Python toolkit. We believe that our approach is general enough to be applied to other languages as well.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, vol. 12, pp. 44–49 (September 1994)
Morton, T., Kottmann, J., Baldridge, J., Bierner, G.: Opennlp: A java-based nlp toolkit (2005)
Pianta, E., Zanoli, R.: TagPro: A system for Italian PoS tagging based on SVM. Intelligenza Artificiale 4(2), 8–9 (2007)
Tamburini, F.: PoS-tagging Italian texts with CORISTagger. In: Proc. of EVALITA 2009. AI*IA Workshop on Evaluation of NLP and Speech Tools for Italian (2009)
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)
Attardi, G., Fuschetto, A., Tamberi, F., Simi, M., Vecchi, E.M.: Experiments in tagger combination: arbitrating, guessing, correcting, suggesting. In: Proc. of Workshop Evalita, p. 10 (2009)
Søgaard, A.: Ensemble-based POS tagging of Italian. In: The 11th Conference of the Italian Association for Artificial Intelligence, EVALITA, Reggio Emilia, Italy (2009)
Dell’Orletta, F.: Ensemble system for Part-of-Speech tagging. In: Proceedings of EVALITA, p. 9 (2009)
De Smedt, T., Daelemans, W.: Pattern for Python. The Journal of Machine Learning Research 98888, 2063–2067 (2012)
Brill, E.: A simple rule-based part of speech tagger. In: Proceedings of the Workshop on Speech and Natural Language, pp. 112–116. Association for Computational Linguistics (February 1992)
Reese, S., Boleda, G., Cuadros, M., Padró, L., Rigau, G.: Wikicorpus: A word-sense disambiguated multilingual Wikipedia corpus (2010)
Schneider, G., Volk, M.: Adding manual constraints and lexical look-up to a Brill-tagger for German. In: Proceedings of the ESSLLI 1998 Workshop on Recent Advances in Corpus Annotation, Saarbrücken (1998)
Sagot, B.: The Lefff, a freely available and large-coverage morphological and syntactic lexicon for French. In: 7th International Conference on Language Resources and Evaluation, LREC 2010 (2010)
Daelemans, W., Zavrel, J., Berck, P., Gillis, S.: MBT: A memory-based part of speech tagger generator. In: Proceedings of the Fourth Workshop on Very Large Corpora, pp. 14–27 (August 1996)
Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 1–8. Association for Computational Linguistics (July 2002)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 173–180. Association for Computational Linguistics (May 2003)
Täckström, O., Das, D., Petrov, S., McDonald, R., Nivre, J.: Token and type constraints for cross-lingual part-of-speech tagging. Transactions of the Association for Computational Linguistics 1, 1–12 (2013)
Li, S., Graça, J.V., Taskar, B.: Wiki-ly supervised part-of-speech tagging. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1389–1398. Association for Computational Linguistics (July 2012)
Ding, W.: Weakly supervised part-of-speech tagging for chinese using label propagation (2012)
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)
Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset.arXiv preprint arXiv:1104 (2011)
Collins, A.M., Loftus, E.F.: A spreading-activation theory of semantic processing. Psychological Review 82(6), 407 (1975)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
De Smedt, T., Marfia, F., Matteucci, M., Daelemans, W. (2014). Using Wiktionary to Build an Italian Part-of-Speech Tagger. In: Métais, E., Roche, M., Teisseire, M. (eds) Natural Language Processing and Information Systems. NLDB 2014. Lecture Notes in Computer Science, vol 8455. Springer, Cham. https://doi.org/10.1007/978-3-319-07983-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-07983-7_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07982-0
Online ISBN: 978-3-319-07983-7
eBook Packages: Computer ScienceComputer Science (R0)