Text Categorization

Shen, Dou

doi:10.1007/978-1-4614-8265-9_414

Text Categorization

Dou Shen^3,4

Reference work entry
First Online: 01 January 2018

23 Accesses

Synonyms

Text classification

Definition

Text classification is to automatically assign textual documents (such as documents in plain text and Web pages) into some predefined categories based their content. Formally speaking, text classification works on an instance space X where each instance is a document d and a fixed set of classes C = {C₁, C₂, … , C_|C|} where |C| is the number of classes. Given a training set D_l of training documents 〈d, C_i〉 where 〈d, C_i〉 ∈ X × C, using a learning method or learning algorithm, the goal of document classification is to learn a classifier or classification function γ that maps instances to classes: γ : X → C [7].

Historical Background

Text classification, which is to classify documents into some predefined categories, provides an effective way to organize documents. Text classification dates back to the early 1960s, but only in the early 1990s did it become a major subfield of the information systems discipline. Recently, with the explosive growth of...

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 4,499.99; Price excludes VAT (USA)

Hardcover Book: USD 6,499.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

Dumais S, Platt J, Heckerman D, Sahami M. Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th International Conference on Information and Knowledge Management; 1998. p. 148–55.
Google Scholar
Glover EJ, Tsioutsiouliklis K, Lawrence S, Pennock DM, Flake GW. Using web structure for classifying and describing web pages. In: Proceedings of the 11th International World Wide Web Conference; 2002. p. 562–9.
Google Scholar
Joachims T. Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning; 1998. p. 137–42.
Chapter Google Scholar
Kehagias A, Petridis V, Kaburlasos VG, Fragkou P. A comparison of word- and sense-based text categorization using several classification algorithms. J Intell Inf Syst. 2003;21(3):227–47.
Article Google Scholar
Kolcz A, Prabakarmurthi V, Kalita JK. String match and text extraction: summarization as feature selection for text categorization. In: Proceedings of the 10th International Conference on Information and Knowledge Management; 2001. p. 365–70.
Google Scholar
Lewis DD. Representation quality in text classification: an introduction and experiment. In: Proceedings of the Workshop on Speech and Natural Language; 1990. p. 288–95.
Google Scholar
Manning CD, Raghavan P, SchÜZe H. Introduction to information retrieval. Cambridge University Press, 2007.
Google Scholar
Mccallum A, Nigam K. A comparison of event models for naive bayes text classication. In: Proceedings of the AAAI-98 Workshop on Learning for Text Categorization; 1998.
Google Scholar
Peng F, Schuurmans D, Wang S. Augmenting naive bayes classifiers with statistical language models. Inf. Retr. 2004;7(3–4):317–45.
Article Google Scholar
Rijsbergen CV. Information retrieval. 2nd ed. London: Butterworths; 1979.
MATH Google Scholar
Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv. 2002;34(1):1–47.
Article MathSciNet Google Scholar
Shen D, Chen Z, Yang Q, Zeng HJ, Zhang B, Lu Y, Ma WY. Web-page classification through summarization. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2004. p. 242–9.
Google Scholar
Shen D, Sun JT, Yang Q, Chen Z. A comparison of implicit and explicit links for web page classification. In: Proceedings of the 15th International World Wide Web Conference; 2006. p. 643–50.
Google Scholar
Yang Y. An evaluation of statistical approaches to text categorization. Inf Retr. 1999;1(1–2):69–90.
Article Google Scholar
Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning; 1997. p. 412–20.
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Corporation, Redmond, WA, USA
Dou Shen
Baidu, Inc., Beijing City, China
Dou Shen

Authors

Dou Shen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dou Shen .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, GA, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, ON, Canada
M. Tamer Özsu

Section Editor information

Microsoft Research Asia, Microsoft Corporation, Beijing, Haidian, China
Zheng Chen

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Shen, D. (2018). Text Categorization. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_414

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8265-9_414
Published: 07 December 2018
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics