Part-of-Speech Tagging from “Small” Data Sets

Neufeld, Eric; Adams, Greg

doi:10.1007/978-1-4612-2404-4_42

Part-of-Speech Tagging from “Small” Data Sets

Eric Neufeld³ &
Greg Adams³

Chapter

859 Accesses
1 Citations

Part of the book series: Lecture Notes in Statistics ((LNS,volume 112))

Abstract

Probabilistic approaches to part-of-speech (POS) tagging compile statistics from massive corpora such as the Lancaster-Oslo-Bergen (LOB) corpus. Training on a 900,000 token training corpus, the hidden Markov model (HMM) method easily achieves a 95 per cent success rate on a 100,000 token test corpus. However, even such large corpora contain relatively few words and new words are subsequently encountered in test corpora. For example, the million-token LOB contains only about 45,000 different words, most of which occur only once or twice. We find that 3–4 per cent of tokens in a disjoint test corpus are unseen, that is, unknown to the tagger after training, and cause a significant proportion of errors. A corpus representative of all possible tag sequences seems implausible enough, let alone a corpus that also represents, even in small numbers, enough of English to make the problem of unseen words insignificant. Experimental results confirm that this extreme course is not necessary. Variations on the HMM approach, including ending-based approaches, incremental learning strategies, and the use of approximate distributions, result in a tagger that tags unseen works nearly as accurately as seen words.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adams, G. and Neufeld, E. (1993) Automated word-class tagging of unseen words in text. In Proceedings of the Sixth International Symposium on Artificial Intelligence, pages 390–397.
Google Scholar
Charniak, E., Henrickson, C., Jacobson, N., and Perkowitz, M. (1993) Equations for part-of-speech tagging. In Proceedings of the Eleventh National Conference on Artificial Intelligence, pages 784–789.
Google Scholar
Church, K. W. (1989) A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of IEEE International Conference on Acoustics, Speech,and Signal Processing, Glasgow, U.K.
Google Scholar
Kenneth W. Church and William A. Gale. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5:19–54, 1991.
Article Google Scholar
Foster, G. F. (1991) Statistical lexical disambiguation. Master’s thesis, McGill University, Montreal.
Google Scholar
I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40:237–264, 1953.
MathSciNet MATH Google Scholar
Johansson, S. (1980) The LOB Corpus of British English texts: Presentation and comments. ALLC Journal, 1(1):25–36.
Google Scholar
Johansson, S., Atwell, E., Garside, R., and Leech, G. (1986) The Tagged LOB Corpus: Users’ Manual. Norwegian Computing Centre for the Humanities, Bergen, Norway.
Google Scholar
Kupiec, J. (1992) Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6:225–242.
Article Google Scholar
Kyburg, Jr., Henry E. 1983. The reference class. Philosophy of Science, 50:374–397.
Article MathSciNet Google Scholar
Merialdo, B. (1990) Tagging text with a probabilistic model. In Proceedings of the IBM Natural Language ITL,pages 161–172, Paris.
Google Scholar
Meteer, M., Schwartz, R., and Weischedel, R. (1991) POST: Using probabilities in language processing. In IJCAI 91: Proceedings of the 13th International Joint Conference on Artificial Intelligence,pages 960–965, Sydney, Australia.
Google Scholar
Weischedel, R., Meteer, M., Schwartz R., Ramshaw, L. and Palmucci, J. (1993) Coping with Ambiguity and Unknown Words through Probabilistic Models, Computational Linguistics 50:359–382.
Google Scholar
Zipf, G. K. (1932) Selected Studies of the Principle of Relative Frequency in Language. Harvard University Press, Cambridge, Massachusetts.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada, S7N 0W0
Eric Neufeld & Greg Adams

Authors

Eric Neufeld
View author publications
You can also search for this author in PubMed Google Scholar
Greg Adams
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Vanderbilt University, Box 1679, Station B, Nashville, Tennessee, 37235, USA
Doug Fisher
Department of Economics Institute of Statistics and Econometrics, Free University of Berlin, 14185, Berlin, Garystre 21, Germany
Hans-J. Lenz

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Neufeld, E., Adams, G. (1996). Part-of-Speech Tagging from “Small” Data Sets. In: Fisher, D., Lenz, HJ. (eds) Learning from Data. Lecture Notes in Statistics, vol 112. Springer, New York, NY. https://doi.org/10.1007/978-1-4612-2404-4_42

Download citation

DOI: https://doi.org/10.1007/978-1-4612-2404-4_42
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-94736-5
Online ISBN: 978-1-4612-2404-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics