Abstract
The analysis of textual data may start by classifying words usinga predefined tag set. However, it is still a problem for natural language text understanding the assignment of part-of-speech tags to words in unrestricted text (called POS-tagging). Most part of current taggers require huge amounts of hand tagged text for training (in the order of 105 pretagged words): it requires linguistically highly trained man power for a highly repetitive and boring job, and the results obtained have no optimal quality. Moreover, when one wants to change to another text genre the same kind of problem must be faced again. Our proposal goes in another direction. By carefully combininga large lexicon with an efficient neural network based generator of taggers we can generate POS-taggers using no more than 104 hand corrected tagged words for training. This training tagged text size can be feasibly hand corrected. Experimental results are presented and discussed for the SUSANNE Corpus. Results in three additional different Portuguese corpora are also discussed. 96% precision rates are obtained when unknown words occur in the test set. 98% precision rates are obtained when every word in the test set is known.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Eric Brill. Unsupervised learning of disambiguation rules for part of speech tagging. In Proceedings of the Very Large Corpora Workshop, 1995.
H. Baayen and Richard Sproat. Estimatinglex ical priors for low-frequency morphologically ambiguous forms. Computational Linguistics, 22(2):155–166, 1996.
Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun. A practical part of-speech tagger. In Proceedings of the third ACL Conference on Applied Natural Language Processing, pages 133–140, Trento, Italy, 1992.
Simon Haykin. Neural Networks: A comprehensive Foundation. Macmillan College Publishing Company, Inc., 1994.
V. Hoste and W. Daelemans. Comparing bagging and boosting for natural language processingta sks: a typically approach. In Bernard Lang, editor, BENELEARN 2000: proceedings of the Tenth Belgian-Dutch Conference on Machine Learning, pages 101–109, Tilburg University, 2000, 2000.
José Gabriel Lopes, Nuno Cavalheiro Marques, and Vitor Ramos Rocio. Polaris, a POrtuguese Lexicon Acquisition and Retrieval Interactive System. In Proceedings of the conference on Pratical Applications of PROLOG, 1994.
Nuno Cavalheiro Marques. Uma Metodologia Estatística para a Modelação da Subcategorização Verbal. PhD thesis, Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa, 2000.
Bernard Merialdo. Tagging english text with a probabilistic model. Computacional Linguistics, 20(2):155–171, 1994.
Nuno C. Marques and José Gabriel Lopes. Usingn eural networks for portuguese part-of-speech tagging. In Proceedings of the Fifth International Conference on Cognitive Science and Natural Language Processing, Dublin City University, Ireland, September 2-5 1996.
Nuno Cavalheiro Marques and José Gabriel Lopes. Neural networks, part-of-speech tagging and lexicon. Technical report, Departamento de Informática, Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa, Febuary 1997.
Nuno C. Marques and José Gabriel Lopes. A POS-Tagger Generator for Unknown Languages. In Proceedings of the XVII Congreso de la SEPLN, Jaén-Spain, to appear, September 2001.
Adwait Ratnaparkhi. Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD thesis, University of Pennsylvania, 1998.
Helmut Schmid. Part-of-speech tagging with neural networks. In Proceedings of the International Conference on Computational Linguistics, Kyoto, Japan, 1994.
Christer Samuelsson and Atro Voutilainen. Tagging french-comparing a statistical and a constraint-based method. In Proceedings of the European Chapter of the Annual Meeting of ACL, 1997.
University of Stuttgart-Institute for Parallel and Distributed High Performance Systems (IPVR). User Manual of the Stuttgart Neural Network Simulator, 1994. Report No. 3//94.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Sppringer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Marques, N.C., Lopes, G.P. (2001). Tagging with Small Training Corpora. In: Hoffmann, F., Hand, D.J., Adams, N., Fisher, D., Guimaraes, G. (eds) Advances in Intelligent Data Analysis. IDA 2001. Lecture Notes in Computer Science, vol 2189. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44816-0_7
Download citation
DOI: https://doi.org/10.1007/3-540-44816-0_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42581-6
Online ISBN: 978-3-540-44816-7
eBook Packages: Springer Book Archive