Skip to main content

Text Categorization

  • Reference work entry
  • First Online:
  • 23 Accesses

Synonyms

Text classification

Definition

Text classification is to automatically assign textual documents (such as documents in plain text and Web pages) into some predefined categories based their content. Formally speaking, text classification works on an instance space X where each instance is a document d and a fixed set of classes C = {C1, C2, … , C|C|} where |C| is the number of classes. Given a training set Dl of training documents 〈d, Ci〉 where 〈d, Ci〉 ∈ X × C, using a learning method or learning algorithm, the goal of document classification is to learn a classifier or classification function γ that maps instances to classes: γ : XC [7].

Historical Background

Text classification, which is to classify documents into some predefined categories, provides an effective way to organize documents. Text classification dates back to the early 1960s, but only in the early 1990s did it become a major subfield of the information systems discipline. Recently, with the explosive growth of...

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   4,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   6,499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

  1. Dumais S, Platt J, Heckerman D, Sahami M. Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th International Conference on Information and Knowledge Management; 1998. p. 148–55.

    Google Scholar 

  2. Glover EJ, Tsioutsiouliklis K, Lawrence S, Pennock DM, Flake GW. Using web structure for classifying and describing web pages. In: Proceedings of the 11th International World Wide Web Conference; 2002. p. 562–9.

    Google Scholar 

  3. Joachims T. Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning; 1998. p. 137–42.

    Chapter  Google Scholar 

  4. Kehagias A, Petridis V, Kaburlasos VG, Fragkou P. A comparison of word- and sense-based text categorization using several classification algorithms. J Intell Inf Syst. 2003;21(3):227–47.

    Article  Google Scholar 

  5. Kolcz A, Prabakarmurthi V, Kalita JK. String match and text extraction: summarization as feature selection for text categorization. In: Proceedings of the 10th International Conference on Information and Knowledge Management; 2001. p. 365–70.

    Google Scholar 

  6. Lewis DD. Representation quality in text classification: an introduction and experiment. In: Proceedings of the Workshop on Speech and Natural Language; 1990. p. 288–95.

    Google Scholar 

  7. Manning CD, Raghavan P, SchÜZe H. Introduction to information retrieval. Cambridge University Press, 2007.

    Google Scholar 

  8. Mccallum A, Nigam K. A comparison of event models for naive bayes text classication. In: Proceedings of the AAAI-98 Workshop on Learning for Text Categorization; 1998.

    Google Scholar 

  9. Peng F, Schuurmans D, Wang S. Augmenting naive bayes classifiers with statistical language models. Inf. Retr. 2004;7(3–4):317–45.

    Article  Google Scholar 

  10. Rijsbergen CV. Information retrieval. 2nd ed. London: Butterworths; 1979.

    MATH  Google Scholar 

  11. Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv. 2002;34(1):1–47.

    Article  MathSciNet  Google Scholar 

  12. Shen D, Chen Z, Yang Q, Zeng HJ, Zhang B, Lu Y, Ma WY. Web-page classification through summarization. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2004. p. 242–9.

    Google Scholar 

  13. Shen D, Sun JT, Yang Q, Chen Z. A comparison of implicit and explicit links for web page classification. In: Proceedings of the 15th International World Wide Web Conference; 2006. p. 643–50.

    Google Scholar 

  14. Yang Y. An evaluation of statistical approaches to text categorization. Inf Retr. 1999;1(1–2):69–90.

    Article  Google Scholar 

  15. Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning; 1997. p. 412–20.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dou Shen .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Shen, D. (2018). Text Categorization. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_414

Download citation

Publish with us

Policies and ethics