Text categorization via generalized discriminant analysis

https://doi.org/10.1016/j.ipm.2008.03.005Get rights and content

Abstract

Text categorization is an important research area and has been receiving much attention due to the growth of the on-line information and of Internet. Automated text categorization is generally cast as a multi-class classification problem. Much of previous work focused on binary document classification problems. Support vector machines (SVMs) excel in binary classification, but the elegant theory behind large-margin hyperplane cannot be easily extended to multi-class text classification. In addition, the training time and scaling are also important concerns. On the other hand, other techniques naturally extensible to handle multi-class classification are generally not as accurate as SVM. This paper presents a simple and efficient solution to multi-class text categorization. Classification problems are first formulated as optimization via discriminant analysis. Text categorization is then cast as the problem of finding coordinate transformations that reflects the inherent similarity from the data. While most of the previous approaches decompose a multi-class classification problem into multiple independent binary classification tasks, the proposed approach enables direct multi-class classification. By using generalized singular value decomposition (GSVD), a coordinate transformation that reflects the inherent class structure indicated by the generalized singular values is identified. Extensive experiments demonstrate the efficiency and effectiveness of the proposed approach.

Introduction

With the ever-increasing growth of the on-line information and the permeation of Internet into daily life, methods that assist users in organizing large volumes of documents are in huge demand. In particular, automatic text categorization has been extensively studied recently. This categorization problem is usually viewed as supervised learning, where the goal is to assign predefined category labels to unlabeled documents based on the likelihood inferred from the training set of labeled documents. Numerous approaches have been applied, including Bayesian probabilistic approaches (Lewis, 1998, Tzeras and Hartmann, 1993), nearest neighbor (Lam & Ho, 1998; Masand, Linoff, & Waltz, 1992), neural networks (Wiener, Pedersen, & Weigend, 1995), decision trees (Apte, Damerau, & Weiss, 1998), inductive rule learning (Cohen and Singer, 1996, Dumais et al., 1998), support vector machines (Godbole et al., 2002, Joachims, 2001), maximum entropy (Nigam, Lafferty, & McCallum, 1999), boosting (Schapire & Singer, 2000), and linear discriminate projection (Chakrabarti, Roy, & Soundalgekar, 2002) (see (Yang & Liu, 1999) for comparative studies of text categorization methods).

Although document collections are likely to contain many different categories, most of the previous work was focused on binary document classification. One of the most effective binary classification techniques is the support vector machines (SVMs) (Vapnik, 1998). It has been demonstrated that the method performs superbly in binary discriminative text classification (Joachims, 2001, Yang and Liu, 1999). SVMs are accurate and robust, and can quickly adapt to test instances. However, the elegant theory behind the use of large-margin hyperplanes cannot be easily extended to multi-class text categorization problems. A number of techniques for reducing multi-class problems to binary problems have been proposed, including one-versus-the-rest method, pairwise comparison (Hastie & Tibshirani, 1998) and error-correcting output coding (Allwein et al., 2000, Dietterich and Bakiri, 1995). In these approaches, the original problems are decomposed into a collection of binary problems, where the assertions of the binary classifiers are integrated to produce the final output. In practice, which reduction method is best suited is problem-dependent, so it is a non-trivial task to select the decomposition method. Indeed, each reduction method has its own merits and limitations (Allwein et al., 2000). In addition, regardless of specific details, these reduction techniques do not appear to be well suited for text categorization tasks with a large number of categories, because training of a single, binary SVM requires O(nα) time for 1.7α2.1, where n is the number of training data (Joachims, 1998). Thus, having to train many classifiers has a significant impact on the overall training time. Also, the use of multiple classifiers slows down prediction. Thus, despite its elegance and superiority, the use of SVM may not be best suited for multi-class document classification. However, there do not appear to exist many alternatives, since many other techniques that can be naturally extended to handle multi-class classification problems, such as neural networks and decision trees, are not so accurate as SVMs (Yang and Liu, 1999, Yang and Pederson, 1997).

In statistics pattern recognition literature, discriminant analysis approaches are well-known to be able to learn discriminative feature transformations (see, e.g., (Fukunaga, 1990)). For example, Fisher discriminant analysis (Fisher, 1936) finds a discriminative feature transformation as eigenvectors associated with the largest eigenvalues of matrix T=Sw-1Sb, where Sw is the intra-class covariance matrix and Sb is the inter-class covariance matrix.1 Intuitively, T captures not only compactness of individual classes but separations among them. Thus, eigenvectors corresponding to the largest eigenvalues of T are likely to constitute a discriminative feature transform. However, for text categorization, Sw is usually singular owing to the large number of terms. Simply removing the null space of Sw would eliminate important discriminant information when the projections of Sb along those directions are not zeros (Fukunaga, 1990). This issue has stymied attempts to use traditional discriminant approaches in document analysis.

In this paper we resolve this problem. We extend discriminant analysis and present a simple, efficient, but effective solution to text categorization. We cast text categorization as the problem of finding transformations to reflect the inherent similarity from the data. In this framework, given a document of unknown class membership, we compare the distance of the new document to the centroid of each category in the transformed space and assign it to the class having the smallest distance to it. We call this method generalized discriminant analysis (GDA), since it uses generalized singular value decomposition to optimize transformation. We show that the transformation derived using GDA is equivalent to optimization via the determinant ratios and a new criterion. A preliminary version of the work has been presented in a conference paper (Li, Zhu, & Ogihara, 2003).

GDA has several favorable properties: first, it is simple and can be programmed in a few lines in MATLAB. Second, it is efficient. (Most of our experiments only took several seconds.) Third, the algorithm does not involve parameter tuning. Finally, and probably the most importantly, it is very accurate. We have conducted extensive experiments on various datasets to evaluate its performance. The rest of the paper is organized as follows: Section 2 reviews the related work on text categorization. Section 3 introduces classical linear discriminant analysis. Section 4 presents the generalized discriminant analysis for handling singular problems. Section 5 shows that the transformation of derived using GDA can also be obtained by optimizing determinant ratios and a new criterion. Section 6 presents some illustrating examples. Section 7 shows experimental results. Finally, Section 8 provides conclusions and discussions.

Section snippets

Related work

Text categorization algorithms can be roughly classified into two types: those algorithms that can be naturally extended to handle multi-class cases and those require decomposition into binary classification problems. The first consists of such algorithms as Naive Bayes (Lam and Ho, 1998, Masand et al., 1992), Neural Networks (Ng et al., 1997, Wiener et al., 1995), K-Nearest Neighbors (Lam and Ho, 1998, Masand et al., 1992), Maximum Entropy (Nigam et al., 1999) and decision trees. Naive Bayes

Classical linear discriminant analysis

The notations used through the discussion of this paper are listed in Table 1.

Given a document-term matrix A=(aij)Rn×N, where each row corresponds to a document and each column corresponds to a particular term, we consider finding a linear transformation GRN× (<N) that maps each row ai (1in) of A in the N-dimensional space to a row yi in the -dimensional space. The resulting data matrix AL=AGRn× contains columns, i.e., there are features for each document in the reduced

Generalized discriminant analysis

In general, the within-class scatter matrix Sw may be singular especially for document-term matrix where the dimension is very high. Usually, this problem is overcome by using a non-singular intermediate space of Sw obtained by removing the null space of Sw and then computing eigenvectors. However, the removal of the null space of Sw possibly eliminates some useful information because some of the most discriminant dimensions may be lost by the removal. In fact, the null space of Sw is

Connections

Here we show that the above derivation can also be obtained by optimizing the determinant ratios or using a new optimization criterion.

Text classification via GDA: examples

A well-known transformation method in information retrieval is latent semantic indexing (LSI) (Deerwester et al., 1990), which applies singular value decomposition (SVD) to the document-term matrix and computes eigenvectors having largest eigenvalues as the directions related to the dominant combinations of the terms occurring in the dataset (latent semantics). A transformation matrix constructed from these eigenvectors projects a document onto the latent semantic space. Although LSI has been

The datasets

For our experiments we used a variety of datasets, most of which are frequently used in the information retrieval research. The range of the number of classes is from 4 to 105 and the range of the number of documents is from 476 to 20,000, which seem varied enough to obtain good insights as to how GDA performs. Table 3 summarizes the characteristics of the datasets.

Discussions and conclusions

In this paper, we presented GDA, a simple, efficient, and yet accurate, direct approach to multi-class text categorization. GDA utilizes GSVD to transform the original data into a new space, which could reflect the inherent similarities between classes based on a new optimization criterion. Extensive experiments clearly demonstrate its efficiency and effectiveness.

Interestingly enough, although traditional discriminant approaches have been successfully applied in pattern recognition, little

Acknowledgements

This work is supported in part by NSF Grants EIA-0080124, DUE-9980943, and EIA-0205061, and NIH Grant P30-AG18254.

References (39)

  • Allwein, E. L., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin...
  • Apte, C., Damerau, F., & Weiss, S. (1998). Text mining with decision rules and decision trees. In Proceedings of the...
  • Bai, Z. (1992). The CSD, GSVD, their applications and computations. Technical report IMA preprint series 958,...
  • Chakrabarti, S., Roy, S., & Soundalgekar, M. V. (2002). Fast and accurate text classification via multiple linear...
  • Cohen, W. W., & Singer, Y. (1996). Context-sensitive learning methods for text categorization. In...
  • R. Collobert et al.

    SVMTorch: Support vector machines for large-scale regression problems

    Journal of Machine Learning Research

    (2001)
  • S.C. Deerwester et al.

    Indexing by latent semantic analysis

    Journal of the American Society of Information Science

    (1990)
  • J. Demmel et al.

    Jacobi’s method is more accurate than QR

    SIAM Journal of Matrix Analysis and Applications

    (1992)
  • T.G. Dietterich et al.

    Solving multiclass learning problems via error-correcting output codes

    Journal of Artificial Intelligence Research

    (1995)
  • Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text...
  • R. Fisher

    The use of multiple measurements in taxonomic problems

    Annals of Eugenics

    (1936)
  • Fragoudis, D., Meretakis, D., & Likothanassis, S. (2002). Integrating feature and instance selection for text...
  • K. Fukunaga

    Introduction to statistical pattern recognition

    (1990)
  • Ghani, R. (2000). Using error-correcting codes for text classification. In...
  • Godbole, S., Sarawagi, S., & Chakrabarti, S. (2002). Scaling multi-class support vector machine using inter-class...
  • G.H. Golub et al.

    Matrix computations

    (1996)
  • Han, E.-H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., et al. (1998). WebACE: A web agent for document...
  • T. Hastie et al.

    Classification by pairwise coupling

  • P. Howland et al.

    Generalizing discriminant analysis using the generalized singular value decomposition

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2003)
  • Cited by (0)

    View full text