Skip to main content

Preventing Overfitting in Learning Text Patterns for Document Categorization

  • Conference paper
  • First Online:
Advances in Pattern Recognition — ICAPR 2001 (ICAPR 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2013))

Included in the following conference series:

Abstract

There is an increasing interest in categorizing texts using learning algorithms. While the majority of approaches rely on learning linear classifiers, there is also some interest in describing document categories by text patterns. We introduce a model for learning patterns for text categorization (the LPT-model) that does not rely on an attribute-value representation of documents but represents documents essentially “as they are”. Based on the LPT-model, we focus on learning patterns within a relatively simple pattern language. We compare different search heuristics and pruning methods known from various symbolic rule learners on a set of representative text categorization problems. The best results were obtained using the m-estimate as search heuristics combined with the likelihood-ratio-statics for pruning. Even better results can be obtained, when replacing the likelihood-ratio- statics by a new measure for pruning; this we call l-measure. In contrast to conventional measures for pruning, the l-measure takes into account properties of the search space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. C. Apté, F. Damerau and S. Weiss. Towards Language Independent Automated Learning of Text Categorization Models. In: Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94), page: 23–30, Dublin, Ireland, July 3–6 1994.

    Google Scholar 

  2. P. Clark and T. Niblett. The CN2 Algorithm. Machine Learning, 3(4) Seite: 261–283, 1989.

    Google Scholar 

  3. W.W. Cohen. Learning to Classify English Text with ILP Methods. In: Advances in Inductive Logic Programming, page: 124–143. IOS Press, 1996.

    Google Scholar 

  4. A. Dengel und K. Hinkelmann. The Specialist Board — A Technology Workbench for Document Analysis and Understanding. In: Proceedings of the 2nd World Conference on Integrated Design and Process Technology (IDPT’ 96), page: 36–47, Austin, TX, USA, December 1996.

    Google Scholar 

  5. J. Fürnkranz. Separate-and-Conquer Rule Learning. Artificial Intelligence Review, 13(1) Seite: 3–54, 1999.

    Article  MATH  Google Scholar 

  6. P.J. Hayes, P.M. Anderson, I.B. Nirenburg und L.M. Schmandt. TCS: A Shell for Content-Based Text Categorization. In: Proceedings of 6th Conference on Artificial Intelligence Applications, page: 320–326, Santa Barbara, CA, USA, May 5–9 1990.

    Google Scholar 

  7. M. Junker. Heuristisches Lernen von Regeln für die Textkategorisierung. Dissertation, University of Kaiserslautern, Germany, 2000 (in German).

    Google Scholar 

  8. J.R. Quinlan. Introduction of Decision Trees. Machine Learning, 3 Seite: 81–106, 1986.

    Google Scholar 

  9. C. van Rijsbergen. Information Retrieval. Butterworth, London, England, 1979.

    Google Scholar 

  10. C. Schaffer. Overfitting Avoidance as Bias. Machine Learning, 10(2) Seite: 233–241, February 1993.

    Google Scholar 

  11. H. Theron und I. Cloete. BEXA: A Covering Algorithm for Learning Propositional Concept Descriptions. Machine Learning, 24 Seite: 5–40, 1996.

    Google Scholar 

  12. Y. Yang und X. Liu. A Re-Examination of Text Categorization Methods. In: Proceedings of the 22th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94), page: 42–49, Berkeley, CA, USA, August 15–19 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Junker, M., Dengel, A. (2001). Preventing Overfitting in Learning Text Patterns for Document Categorization. In: Singh, S., Murshed, N., Kropatsch, W. (eds) Advances in Pattern Recognition — ICAPR 2001. ICAPR 2001. Lecture Notes in Computer Science, vol 2013. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44732-6_14

Download citation

  • DOI: https://doi.org/10.1007/3-540-44732-6_14

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41767-5

  • Online ISBN: 978-3-540-44732-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics