Skip to main content

Tagging with Small Training Corpora

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2189))

Abstract

The analysis of textual data may start by classifying words usinga predefined tag set. However, it is still a problem for natural language text understanding the assignment of part-of-speech tags to words in unrestricted text (called POS-tagging). Most part of current taggers require huge amounts of hand tagged text for training (in the order of 105 pretagged words): it requires linguistically highly trained man power for a highly repetitive and boring job, and the results obtained have no optimal quality. Moreover, when one wants to change to another text genre the same kind of problem must be faced again. Our proposal goes in another direction. By carefully combininga large lexicon with an efficient neural network based generator of taggers we can generate POS-taggers using no more than 104 hand corrected tagged words for training. This training tagged text size can be feasibly hand corrected. Experimental results are presented and discussed for the SUSANNE Corpus. Results in three additional different Portuguese corpora are also discussed. 96% precision rates are obtained when unknown words occur in the test set. 98% precision rates are obtained when every word in the test set is known.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Eric Brill. Unsupervised learning of disambiguation rules for part of speech tagging. In Proceedings of the Very Large Corpora Workshop, 1995.

    Google Scholar 

  2. H. Baayen and Richard Sproat. Estimatinglex ical priors for low-frequency morphologically ambiguous forms. Computational Linguistics, 22(2):155–166, 1996.

    Google Scholar 

  3. Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun. A practical part of-speech tagger. In Proceedings of the third ACL Conference on Applied Natural Language Processing, pages 133–140, Trento, Italy, 1992.

    Google Scholar 

  4. Simon Haykin. Neural Networks: A comprehensive Foundation. Macmillan College Publishing Company, Inc., 1994.

    Google Scholar 

  5. V. Hoste and W. Daelemans. Comparing bagging and boosting for natural language processingta sks: a typically approach. In Bernard Lang, editor, BENELEARN 2000: proceedings of the Tenth Belgian-Dutch Conference on Machine Learning, pages 101–109, Tilburg University, 2000, 2000.

    Google Scholar 

  6. José Gabriel Lopes, Nuno Cavalheiro Marques, and Vitor Ramos Rocio. Polaris, a POrtuguese Lexicon Acquisition and Retrieval Interactive System. In Proceedings of the conference on Pratical Applications of PROLOG, 1994.

    Google Scholar 

  7. Nuno Cavalheiro Marques. Uma Metodologia Estatística para a Modelação da Subcategorização Verbal. PhD thesis, Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa, 2000.

    Google Scholar 

  8. Bernard Merialdo. Tagging english text with a probabilistic model. Computacional Linguistics, 20(2):155–171, 1994.

    Google Scholar 

  9. Nuno C. Marques and José Gabriel Lopes. Usingn eural networks for portuguese part-of-speech tagging. In Proceedings of the Fifth International Conference on Cognitive Science and Natural Language Processing, Dublin City University, Ireland, September 2-5 1996.

    Google Scholar 

  10. Nuno Cavalheiro Marques and José Gabriel Lopes. Neural networks, part-of-speech tagging and lexicon. Technical report, Departamento de Informática, Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa, Febuary 1997.

    Google Scholar 

  11. Nuno C. Marques and José Gabriel Lopes. A POS-Tagger Generator for Unknown Languages. In Proceedings of the XVII Congreso de la SEPLN, Jaén-Spain, to appear, September 2001.

    Google Scholar 

  12. Adwait Ratnaparkhi. Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD thesis, University of Pennsylvania, 1998.

    Google Scholar 

  13. Helmut Schmid. Part-of-speech tagging with neural networks. In Proceedings of the International Conference on Computational Linguistics, Kyoto, Japan, 1994.

    Google Scholar 

  14. Christer Samuelsson and Atro Voutilainen. Tagging french-comparing a statistical and a constraint-based method. In Proceedings of the European Chapter of the Annual Meeting of ACL, 1997.

    Google Scholar 

  15. University of Stuttgart-Institute for Parallel and Distributed High Performance Systems (IPVR). User Manual of the Stuttgart Neural Network Simulator, 1994. Report No. 3//94.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Sppringer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Marques, N.C., Lopes, G.P. (2001). Tagging with Small Training Corpora. In: Hoffmann, F., Hand, D.J., Adams, N., Fisher, D., Guimaraes, G. (eds) Advances in Intelligent Data Analysis. IDA 2001. Lecture Notes in Computer Science, vol 2189. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44816-0_7

Download citation

  • DOI: https://doi.org/10.1007/3-540-44816-0_7

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42581-6

  • Online ISBN: 978-3-540-44816-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics