Elsevier

Information Sciences

Volume 176, Issue 12, 22 June 2006, Pages 1712-1738
Information Sciences

Class normalization in centroid-based text categorization

https://doi.org/10.1016/j.ins.2005.05.010Get rights and content

Abstract

Centroid-based categorization is one of the most popular algorithms in text classification. In this approach, normalization is an important factor to improve performance of a centroid-based classifier when documents in text collection have quite different sizes and/or the numbers of documents in classes are unbalanced. In the past, most researchers applied document normalization, e.g., document-length normalization, while some consider a simple kind of class normalization, so-called class-length normalization, to solve the unbalancedness problem. However, there is no intensive work that clarifies how these normalizations affect classification performance and whether there are any other useful normalizations. The purpose of this paper is three folds; (1) to investigate the effectiveness of document- and class-length normalizations on several data sets, (2) to evaluate a number of commonly used normalization functions and (3) to introduce a new type of class normalization, called term-length normalization, which exploits term distribution among documents in the class. The experimental results show that a classifier with weight–merge–normalize approach (class-length normalization) performs better than one with weight–normalize–merge approach (document-length normalization) for the data sets with unbalanced numbers of documents in classes, and is quite competitive for those with balanced numbers of documents. For normalization functions, the normalization based on term weighting performs better than the others on average. For term-length normalization, it is useful for improving classification accuracy. The combination of term- and class-length normalizations outperforms pure class-length normalization and pure term-length normalization as well as unnormalization with the gaps of 4.29%, 11.50%, 30.09%, respectively.

Introduction

In the past, several text categorization models were developed in different schemes, such as probabilistic models [1], [2], decision trees and rules [3], [4], regression models [5], [6], example-based models (e.g., k-nearest neighbor or k−NN) [6], [7], [8], [9], linear models [10], [11], [12], [13], [14], support vector machine [6], [15], neural networks [16], [17], [18] and so on. Among these models, a variant of linear models called a centroid-based method is attractive since it has relatively less computation than other methods in both the learning and classification stages. The traditional centroid-based method [19] can be viewed as a specialization of so-called Rocchio method [20] and used in several works on text categorization [11], [12], [21], [22]. Based on the vector space model, a centroid-based method computes beforehand, for each category, an explicit profile (or class prototype), which is a centroid vector for all positive training documents of that category. The classification task is to find the most similar class to the vector of the document we would like to classify, for example by the means of cosine similarity. This type of classifiers is easy to implement and effective in computation. Several literatures including those in [10], [13], [18], [19], [21], [22], [23], showed that they achieved relatively high classification accuracy. In a centroid-based approach, each individual class is modeled by weighting terms appearing in training documents of that class. This makes classification performance of the model strongly depend on the weighting method applied. As a part of term weighting calculation, normalization is an important factor for constructing a better representation for a class. Unlike instance-based methods such as k−NN, centroid-based methods can utilize class normalization as alternative factor to improve the representation of a class. In the past, most researchers applied document-length normalization to prevent the advantage of a long document over a short one and some researches consider a simple kind of class normalization, so-called class-length normalization, to solve the problem. However, there is no intensive work that clarifies how these normalizations affect classification performance and whether there are any other useful normalizations. In this paper, a more systematic experiment is made to explore normalization approaches in centroid-based methods. The purpose of this paper is three folds; (1) to investigate the effectiveness of document- and class-length normalizations, (2) to evaluate a number of commonly used normalization functions and (3) to introduce a new type of class normalization, called term-length normalization, which exploits term distribution among documents in the class. Using various data sets, the performance is compared to that of the standard centroid-based classifier with tf × idf, the base line. In the rest of this paper, Section 2 presents centroid-based text categorization. Section 3 describes the concept of normalization in text categorization as well as our proposed class-length and term-length normalization. The data sets and experimental settings are described in Section 4. Section 5 shows a set of experimental results using four data sets. A conclusion and future works are given in Section 6.

Section snippets

Centroid-based text categorization

Given a set of classes C={c1,c2,,c|C|} and a set of training documents D={d1,d2,,d|D|} where each training document dj is assigned to one or more classes, text categorization is a task to use this given information to find one or more suitable categories for a new document. By this definition, a class ck is given a set of its training documents assigned with the class (Ck={dj|djbelongstotheclassck}). In a vector space model, a document (or a class) is represented by a vector based on a bag of

Normalization approaches

From the normalization aspect, two different schemes can be taken into account. The document-oriented scheme focuses on the way to normalize a document (class) vector based on document (class) sizes while the term-oriented scheme scopes on the way to normalize with respect to term sizes. Normalizing based on document sizes, known as document-length normalization, was reported by several works in the Information Retrieval (IR) field that it was helpful in improving retrieval accuracy [25], [29],

Data sets and experimental settings

Four data sets are used in the experiments: (1) Drug Information (DI), (2) Newsgroups (News), (3) WebKB1 and (4) WebKB2. The first data set (DI) is a set of drug information web pages collected from www.rxlist.com. It includes 4480 English web pages with seven classes: adverse drug reaction, clinical pharmacology, description, indications, overdose, patient information, and warning. Each web page in this data set consists of informative content with a few links. Its structure is well organized.

Normalization timing

Two types of normalization timing, WNM and WMN, are explored using term weighting tf × idf with cosine normalization. The result is shown in the form of an average of classification accuracy (%) ± standard error of mean (SEM). Table 5 shows that WMN outperforms WNM in the data sets with the unbalanced number of documents per class (WebKB1 and WebKB2) and quite competitive in the data sets with the balanced number of documents per class (DI and News). When a class possesses a large number of

Conclusion and future works

In this paper, we illustrated how to utilize class normalization to improve performance of a centroid-based classifier. To do this, we investigated the effectiveness of document- and class-length normalizations and then evaluated the effects of various normalization functions. We also proposed term-length normalization which is helpful for exploiting term distribution among documents in the class for term weighting. By experiments, the results indicated that weight–merge–normalize (a form of

Acknowledgements

This work has been supported by National Electronics and Computer Technology Center (NECTEC) under project number NT-B-22-I5-38-47-04. We would like to thank the reviewers for their precious comments.

References (41)

  • G. Salton et al.

    Term-weighting approaches in automatic text retrieval

    Information Processing and Management

    (1988)
  • A.N. Aizawa

    An information-theoretic perspective of tf-idf measures

    Information Processing and Management

    (2003)
  • M. Craven et al.

    Learning to construct knowledge bases from the World Wide Web

    Artificial Intelligence

    (2000)
  • K. Nigam et al.

    Text classification from labeled and unlabeled documents using EM

    Machine Learning

    (2000)
  • Y. Yang

    An evaluation of statistical approaches to text categorization

    Information Retrieval

    (1999)
  • C.D. Apté et al.

    Automated learning of decision rules for text categorization

    ACM Transactions on Information Systems

    (1994)
  • G. Paliouras et al.

    Learning rules for large vocabulary word sense disambiguation

  • Y. Yang et al.

    An example-based mapping method for text categorization and retrieval

    ACM Transactions on Information Systems

    (1994)
  • Y. Yang et al.

    A re-examination of text categorization methods

  • L.S. Larkey

    Automatic essay grading using text categorization techniques

  • D.B. Skalak, Prototype and feature selection by sampling and random mutation hill climbing algorithms, in:...
  • E.-H. Han et al.

    Text categorization using weight-adjusted k-nearest neighbor classification

  • D.A. Hull

    Improving text retrieval for the routing problem using latent semantic indexing

  • D.J. Ittner, D.D. Lewis, D.D. Ahn, Text categorization of low quality images, in: Proceedings of SDAIR-95, 4th Annual...
  • T. Joachims

    A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization

  • W.T. Chuang et al.

    A fast algorithm for hierarchical text classification

  • Z.-H. Deng et al.

    A linear text classification algorithm based on category relevance factors

  • T. Joachims

    Text categorization with support vector machines: learning with many relevant features

  • H.T. Ng et al.

    Feature selection, perceptron learning, and a usability case study for text categorization

  • P. Koehn, Combining multiclass maximum entropy text classifiers with neural network voting, in: E. Ranchod, N.J. Mamede...
  • Cited by (25)

    • Classification of compressed and uncompressed text documents

      2018, Future Generation Computer Systems
    • A new Centroid-Based Classification model for text categorization

      2017, Knowledge-Based Systems
      Citation Excerpt :

      The experimental results in this paper based on cross-validation scheme indicate that CFC has a low generalization ability. Additionally, Lertnattee et al. [5] investigated the effectiveness of the frequently-used normalization functions and proposed a new type of class normalization method (i.e., term-length normalization) in the construction phase of centroid vectors. As it is argued that using only border instances rather than all instances to construct centroid vectors can obtain higher generalization accuracy, Wang et al. [11] proposed a Border-Instance-based Iteratively Adjusted Centroid Classifier (IACC-BI) that relies on the border instances found by some routines, e.g., 1-Nearest-and-1-Furthest-Neighbors strategy, to construct centroid vectors.

    • A high performance centroid-based classification approach for language identification

      2012, Pattern Recognition Letters
      Citation Excerpt :

      The second problem is its relatively lower success performance in some domains when compared to other classification algorithms. One solution for the performance problem is applying normalization and smoothing techniques on the data, which was shown to increase the success rates significantly (Lertnattee and Theeramunkong, 2006). Another approach is increasing the discriminative power of the centroid values that represent the classes.

    View all citing articles on Scopus
    1

    Tel.: +66 0 2501 3505 20x2022, 2004; fax: +66 0 2501 3524.

    View full text