Preventing Overfitting in Learning Text Patterns for Document Categorization

Junker, Markus; Dengel, Andreas

doi:10.1007/3-540-44732-6_14

Markus Junker⁷ &
Andreas Dengel⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2013))

Included in the following conference series:

International Conference on Advances in Pattern Recognition

665 Accesses
2 Citations

Abstract

There is an increasing interest in categorizing texts using learning algorithms. While the majority of approaches rely on learning linear classifiers, there is also some interest in describing document categories by text patterns. We introduce a model for learning patterns for text categorization (the LPT-model) that does not rely on an attribute-value representation of documents but represents documents essentially “as they are”. Based on the LPT-model, we focus on learning patterns within a relatively simple pattern language. We compare different search heuristics and pruning methods known from various symbolic rule learners on a set of representative text categorization problems. The best results were obtained using the m-estimate as search heuristics combined with the likelihood-ratio-statics for pruning. Even better results can be obtained, when replacing the likelihood-ratio- statics by a new measure for pruning; this we call l-measure. In contrast to conventional measures for pruning, the l-measure takes into account properties of the search space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

C. Apté, F. Damerau and S. Weiss. Towards Language Independent Automated Learning of Text Categorization Models. In: Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94), page: 23–30, Dublin, Ireland, July 3–6 1994.
Google Scholar
P. Clark and T. Niblett. The CN2 Algorithm. Machine Learning, 3(4) Seite: 261–283, 1989.
Google Scholar
W.W. Cohen. Learning to Classify English Text with ILP Methods. In: Advances in Inductive Logic Programming, page: 124–143. IOS Press, 1996.
Google Scholar
A. Dengel und K. Hinkelmann. The Specialist Board — A Technology Workbench for Document Analysis and Understanding. In: Proceedings of the 2nd World Conference on Integrated Design and Process Technology (IDPT’ 96), page: 36–47, Austin, TX, USA, December 1996.
Google Scholar
J. Fürnkranz. Separate-and-Conquer Rule Learning. Artificial Intelligence Review, 13(1) Seite: 3–54, 1999.
Article MATH Google Scholar
P.J. Hayes, P.M. Anderson, I.B. Nirenburg und L.M. Schmandt. TCS: A Shell for Content-Based Text Categorization. In: Proceedings of 6th Conference on Artificial Intelligence Applications, page: 320–326, Santa Barbara, CA, USA, May 5–9 1990.
Google Scholar
M. Junker. Heuristisches Lernen von Regeln für die Textkategorisierung. Dissertation, University of Kaiserslautern, Germany, 2000 (in German).
Google Scholar
J.R. Quinlan. Introduction of Decision Trees. Machine Learning, 3 Seite: 81–106, 1986.
Google Scholar
C. van Rijsbergen. Information Retrieval. Butterworth, London, England, 1979.
Google Scholar
C. Schaffer. Overfitting Avoidance as Bias. Machine Learning, 10(2) Seite: 233–241, February 1993.
Google Scholar
H. Theron und I. Cloete. BEXA: A Covering Algorithm for Learning Propositional Concept Descriptions. Machine Learning, 24 Seite: 5–40, 1996.
Google Scholar
Y. Yang und X. Liu. A Re-Examination of Text Categorization Methods. In: Proceedings of the 22th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94), page: 42–49, Berkeley, CA, USA, August 15–19 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

German Research Center for Artificial Intelligence (DFKI) GmbH, P.O. 2080, D-67608, Kaiserslautern, Germany
Markus Junker & Andreas Dengel

Authors

Markus Junker
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Dengel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Exeter, EX4 4PT, Exeter, UK
Sameer Singh
Computational Intelligence Group, Tuiuti University of Parana, Curitiba, Brazil
Nabeel Murshed
Institute of Computer Aided Automation PRIP-Group 1832, Vienna University of Technology, Favoritenstr. 9/2/4, 1040, Wien, Austria
Walter Kropatsch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Junker, M., Dengel, A. (2001). Preventing Overfitting in Learning Text Patterns for Document Categorization. In: Singh, S., Murshed, N., Kropatsch, W. (eds) Advances in Pattern Recognition — ICAPR 2001. ICAPR 2001. Lecture Notes in Computer Science, vol 2013. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44732-6_14

Download citation

DOI: https://doi.org/10.1007/3-540-44732-6_14
Published: 09 May 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41767-5
Online ISBN: 978-3-540-44732-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics