Skip to main content

A Sequential Algorithm for Training Text Classifiers

  • Conference paper
SIGIR ’94

Abstract

The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully textual. An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task. This method, which we call uncertainty sampling, reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. P. J. Hayes. Intelligent high-volume text processing using shallow, domain-specific techniques. In Paul. S. Jacobs, editor, Text-Based Intelligent Systems: Current Research in Text Analysis, Information Extraction, and Retrieval, pages 227–241. Lawrence Erlbaum, Hillsdale, NJ, 1992.

    Google Scholar 

  2. P. Biebricher, N. Fuhr, G. Lustig, M. Schwantner, and G. Knorz. The automatic indexing system AIR/PHYS—from research to application. In Proc. SIGIR-88, pages 333–342, 1988.

    Google Scholar 

  3. W. G. Cochran. Sampling Techniques. John Wiley & Sons, New York, 3rd edition, 1977.

    MATH  Google Scholar 

  4. G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41 (4): 288–297, 1990.

    Article  Google Scholar 

  5. W. A. Gale, K. W. Church, and D. Yarowsky. A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26: 415–439, 1993.

    Article  Google Scholar 

  6. B. K. Ghosh. A brief history of sequential analysis. In B. K. Ghosh and P. K. Sen, editors, Handbook of Sequential Analysis, chapter 1, pages 1–19. Marcel Dekker, New York, 1991.

    Google Scholar 

  7. D. Angluin. Queries and concept learning. Machine Learning, 2: 319–342, 1988.

    Google Scholar 

  8. M. Plutowski and H. White. Selecting concise training sets from clean data. IEEE Transactions on Neural Networks, 4 (2): 305–318, March 1993.

    Article  Google Scholar 

  9. D. Cohn, L. Atlas, and R. Ladner. Improving generalization with self-directed learning, 1992. To appear in Machine Learning.

    Google Scholar 

  10. D. J. C. MacKay. The evidence framework applied to classification networks. Neural Computation, 4: 720–736, 1992.

    Article  Google Scholar 

  11. H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pages 287–294, 1992.

    Book  Google Scholar 

  12. T. M. Mitchell. Generalization as search. Artificial Intelligence, 18: 203–226, 1982.

    Article  MathSciNet  Google Scholar 

  13. Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Information, prediction, and query by committee. In Advances in Neural Informations Processing Systems 5, San Mateo, CA, 1992. Morgan Kaufmann.

    Google Scholar 

  14. J. Hwang, J. J. Choi, S. Oh, and R. J. Marks II. Query-based learning applied to partially trained multilayer perceptrons. IEEE Transactions on Neural Networks, 2 (1): 131–136, January 1991.

    Article  Google Scholar 

  15. D. T. Davis and J. Hwang. Attentional focus training by boundary region data selection. In International Joint Conference on Neural Networks, pages 1–676 to I-681, Baltimore, MD, June 7–11 1992.

    Google Scholar 

  16. P. E. Hart. The condensed nearest neighbor rule. IEEE Transactions on Information Theory, IT-14: 515–516, May 1968.

    Google Scholar 

  17. P. E. Utgoff. Improved training via incremental learning. In Sixth International Workshop on Machine Learning, pages 362–365, 1989.

    Google Scholar 

  18. N. Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management, 25 (1): 55–72, 1989.

    Article  MathSciNet  Google Scholar 

  19. D. D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In Proc. SIGIR-92, pages 37–50, 1992.

    Chapter  Google Scholar 

  20. M. E. Maron. Automatic indexing: An experimental inquiry. Journal of the Association for Computing Machinery, 8: 404–417, 1961.

    MATH  Google Scholar 

  21. W. S. Cooper. Some inconsistencies and misnomers in probabilistic information retrieval. In Proc. SIGIR-91, pages 57–61, 1991.

    Chapter  Google Scholar 

  22. P. McCullagh and J. A. Neider. Generalized Linear Models. Chapman & Hall, London, 2nd edition, 1989.

    MATH  Google Scholar 

  23. W. S. Cooper, F. C. Gey, and D. P. Dabney. Probabilistic retrieval based on staged logistic regression. In Proc. SIGIR-92, pages 198–210, 1992.

    Chapter  Google Scholar 

  24. N. Fuhr and U. Pfeifer. Combining model-oriented and description-oriented approaches for probabilistic indexing. In Proc. SIGIR-91, pages 46–56, 1991.

    Chapter  Google Scholar 

  25. S. Robertson and J. Hovey. Statistical problems in the application of probabilistic models to information retrieval. Report 5739, British Library, London, 1982.

    Google Scholar 

  26. W. A. Gale and K. W. Church. Poor estimates of context are worse than none. In Speech and Natural Language Workshop, pages 283–287, San Mateo, CA, June 1990. DARPA, Morgan Kaufmann.

    Google Scholar 

  27. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley-Interscience, New York, 1973.

    MATH  Google Scholar 

  28. N. Goldstein, editor. The Associated Press Stylebook and Libel Manual. Addison-Wesley, Reading, MA, 1992.

    Google Scholar 

  29. W. B. Croft and D. J. Harper. Using probabilistic models of document retrieval without relevance feedback. Journal of Documentation, 35 (4): 285–295, 1979.

    Article  Google Scholar 

  30. C. J. van Rijsbergen. Information Retrieval. Butterworths, London, second edition, 1979.

    Google Scholar 

  31. A. Bookstein. Information retrieval: A sequential learning process. Journal of the American Society for Information Science, 34: 331–342, September 1983.

    Article  Google Scholar 

  32. David D. Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on Machine Learning, 1994. To appear.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag London Limited

About this paper

Cite this paper

Lewis, D.D., Gale, W.A. (1994). A Sequential Algorithm for Training Text Classifiers. In: Croft, B.W., van Rijsbergen, C.J. (eds) SIGIR ’94. Springer, London. https://doi.org/10.1007/978-1-4471-2099-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-2099-5_1

  • Publisher Name: Springer, London

  • Print ISBN: 978-3-540-19889-5

  • Online ISBN: 978-1-4471-2099-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics