A Sequential Algorithm for Training Text Classifiers

Lewis, David D.; Gale, William A.

doi:10.1007/978-1-4471-2099-5_1

David D. Lewis³ &
William A. Gale³

599 Accesses

Abstract

The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully textual. An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task. This method, which we call uncertainty sampling, reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Assessing Intelligence Text Classification Techniques

Simple Baseline Machine Learning Text Classifiers for Small Datasets

Article Open access 30 March 2021

Text Classification Using Novel “Anti-Bayesian” Techniques

References

P. J. Hayes. Intelligent high-volume text processing using shallow, domain-specific techniques. In Paul. S. Jacobs, editor, Text-Based Intelligent Systems: Current Research in Text Analysis, Information Extraction, and Retrieval, pages 227–241. Lawrence Erlbaum, Hillsdale, NJ, 1992.
Google Scholar
P. Biebricher, N. Fuhr, G. Lustig, M. Schwantner, and G. Knorz. The automatic indexing system AIR/PHYS—from research to application. In Proc. SIGIR-88, pages 333–342, 1988.
Google Scholar
W. G. Cochran. Sampling Techniques. John Wiley & Sons, New York, 3rd edition, 1977.
MATH Google Scholar
G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41 (4): 288–297, 1990.
Article Google Scholar
W. A. Gale, K. W. Church, and D. Yarowsky. A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26: 415–439, 1993.
Article Google Scholar
B. K. Ghosh. A brief history of sequential analysis. In B. K. Ghosh and P. K. Sen, editors, Handbook of Sequential Analysis, chapter 1, pages 1–19. Marcel Dekker, New York, 1991.
Google Scholar
D. Angluin. Queries and concept learning. Machine Learning, 2: 319–342, 1988.
Google Scholar
M. Plutowski and H. White. Selecting concise training sets from clean data. IEEE Transactions on Neural Networks, 4 (2): 305–318, March 1993.
Article Google Scholar
D. Cohn, L. Atlas, and R. Ladner. Improving generalization with self-directed learning, 1992. To appear in Machine Learning.
Google Scholar
D. J. C. MacKay. The evidence framework applied to classification networks. Neural Computation, 4: 720–736, 1992.
Article Google Scholar
H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pages 287–294, 1992.
Book Google Scholar
T. M. Mitchell. Generalization as search. Artificial Intelligence, 18: 203–226, 1982.
Article MathSciNet Google Scholar
Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Information, prediction, and query by committee. In Advances in Neural Informations Processing Systems 5, San Mateo, CA, 1992. Morgan Kaufmann.
Google Scholar
J. Hwang, J. J. Choi, S. Oh, and R. J. Marks II. Query-based learning applied to partially trained multilayer perceptrons. IEEE Transactions on Neural Networks, 2 (1): 131–136, January 1991.
Article Google Scholar
D. T. Davis and J. Hwang. Attentional focus training by boundary region data selection. In International Joint Conference on Neural Networks, pages 1–676 to I-681, Baltimore, MD, June 7–11 1992.
Google Scholar
P. E. Hart. The condensed nearest neighbor rule. IEEE Transactions on Information Theory, IT-14: 515–516, May 1968.
Google Scholar
P. E. Utgoff. Improved training via incremental learning. In Sixth International Workshop on Machine Learning, pages 362–365, 1989.
Google Scholar
N. Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management, 25 (1): 55–72, 1989.
Article MathSciNet Google Scholar
D. D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In Proc. SIGIR-92, pages 37–50, 1992.
Chapter Google Scholar
M. E. Maron. Automatic indexing: An experimental inquiry. Journal of the Association for Computing Machinery, 8: 404–417, 1961.
MATH Google Scholar
W. S. Cooper. Some inconsistencies and misnomers in probabilistic information retrieval. In Proc. SIGIR-91, pages 57–61, 1991.
Chapter Google Scholar
P. McCullagh and J. A. Neider. Generalized Linear Models. Chapman & Hall, London, 2nd edition, 1989.
MATH Google Scholar
W. S. Cooper, F. C. Gey, and D. P. Dabney. Probabilistic retrieval based on staged logistic regression. In Proc. SIGIR-92, pages 198–210, 1992.
Chapter Google Scholar
N. Fuhr and U. Pfeifer. Combining model-oriented and description-oriented approaches for probabilistic indexing. In Proc. SIGIR-91, pages 46–56, 1991.
Chapter Google Scholar
S. Robertson and J. Hovey. Statistical problems in the application of probabilistic models to information retrieval. Report 5739, British Library, London, 1982.
Google Scholar
W. A. Gale and K. W. Church. Poor estimates of context are worse than none. In Speech and Natural Language Workshop, pages 283–287, San Mateo, CA, June 1990. DARPA, Morgan Kaufmann.
Google Scholar
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley-Interscience, New York, 1973.
MATH Google Scholar
N. Goldstein, editor. The Associated Press Stylebook and Libel Manual. Addison-Wesley, Reading, MA, 1992.
Google Scholar
W. B. Croft and D. J. Harper. Using probabilistic models of document retrieval without relevance feedback. Journal of Documentation, 35 (4): 285–295, 1979.
Article Google Scholar
C. J. van Rijsbergen. Information Retrieval. Butterworths, London, second edition, 1979.
Google Scholar
A. Bookstein. Information retrieval: A sequential learning process. Journal of the American Society for Information Science, 34: 331–342, September 1983.
Article Google Scholar
David D. Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on Machine Learning, 1994. To appear.
Google Scholar

Download references

Author information

Authors and Affiliations

AT&T Bell Laboratories, Murray Hill, NJ, 07974, USA
David D. Lewis & William A. Gale

Authors

David D. Lewis
View author publications
You can also search for this author in PubMed Google Scholar
William A. Gale
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Massachusetts, 01003, Amherst, MA, USA
Bruce W. Croft
Department of Computer Science, University of Glasgow, G12 8RZ, 8–17 Lilybank Gardens, Glasgow, Scotland
C. J. van Rijsbergen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lewis, D.D., Gale, W.A. (1994). A Sequential Algorithm for Training Text Classifiers. In: Croft, B.W., van Rijsbergen, C.J. (eds) SIGIR ’94. Springer, London. https://doi.org/10.1007/978-1-4471-2099-5_1

Download citation

DOI: https://doi.org/10.1007/978-1-4471-2099-5_1
Publisher Name: Springer, London
Print ISBN: 978-3-540-19889-5
Online ISBN: 978-1-4471-2099-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics