Skip to main content
Log in

Discriminative features for text document classification

  • ORIGINAL ARTICLE
  • Published:
Formal Pattern Analysis & Applications Aims and scope Submit manuscript

An Erratum to this article was published on 16 June 2004

Abstract

The bag-of-words approach to text document representation typically results in vectors of the order of 5000–20,000 components as the representation of documents. To make effective use of various statistical classifiers, it may be necessary to reduce the dimensionality of this representation. We point out deficiencies in class discrimination of two popular such methods, Latent Semantic Indexing (LSI), and sequential feature selection according to some relevant criterion. As a remedy, we suggest feature transforms based on Linear Discriminant Analysis (LDA). Since LDA requires operating both with large and dense matrices, we propose an efficient intermediate dimension reduction step using either a random transform or LSI. We report good classification results with the combined feature transform on a subset of the Reuters-21578 database. Drastic reduction of the feature vector dimensionality from 5000 to 12 actually improves the classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Torkkola.

Additional information

An erratum to this article can be found at http://dx.doi.org/10.1007/s10044-004-0216-3

Rights and permissions

Reprints and permissions

About this article

Cite this article

Torkkola, K. Discriminative features for text document classification. Formal Pattern Analysis & Applications 6, 301–308 (2004). https://doi.org/10.1007/s10044-003-0196-8

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-003-0196-8

Keywords

Navigation