Class normalization in centroid-based text categorization
Introduction
In the past, several text categorization models were developed in different schemes, such as probabilistic models [1], [2], decision trees and rules [3], [4], regression models [5], [6], example-based models (e.g., k-nearest neighbor or k−NN) [6], [7], [8], [9], linear models [10], [11], [12], [13], [14], support vector machine [6], [15], neural networks [16], [17], [18] and so on. Among these models, a variant of linear models called a centroid-based method is attractive since it has relatively less computation than other methods in both the learning and classification stages. The traditional centroid-based method [19] can be viewed as a specialization of so-called Rocchio method [20] and used in several works on text categorization [11], [12], [21], [22]. Based on the vector space model, a centroid-based method computes beforehand, for each category, an explicit profile (or class prototype), which is a centroid vector for all positive training documents of that category. The classification task is to find the most similar class to the vector of the document we would like to classify, for example by the means of cosine similarity. This type of classifiers is easy to implement and effective in computation. Several literatures including those in [10], [13], [18], [19], [21], [22], [23], showed that they achieved relatively high classification accuracy. In a centroid-based approach, each individual class is modeled by weighting terms appearing in training documents of that class. This makes classification performance of the model strongly depend on the weighting method applied. As a part of term weighting calculation, normalization is an important factor for constructing a better representation for a class. Unlike instance-based methods such as k−NN, centroid-based methods can utilize class normalization as alternative factor to improve the representation of a class. In the past, most researchers applied document-length normalization to prevent the advantage of a long document over a short one and some researches consider a simple kind of class normalization, so-called class-length normalization, to solve the problem. However, there is no intensive work that clarifies how these normalizations affect classification performance and whether there are any other useful normalizations. In this paper, a more systematic experiment is made to explore normalization approaches in centroid-based methods. The purpose of this paper is three folds; (1) to investigate the effectiveness of document- and class-length normalizations, (2) to evaluate a number of commonly used normalization functions and (3) to introduce a new type of class normalization, called term-length normalization, which exploits term distribution among documents in the class. Using various data sets, the performance is compared to that of the standard centroid-based classifier with tf × idf, the base line. In the rest of this paper, Section 2 presents centroid-based text categorization. Section 3 describes the concept of normalization in text categorization as well as our proposed class-length and term-length normalization. The data sets and experimental settings are described in Section 4. Section 5 shows a set of experimental results using four data sets. A conclusion and future works are given in Section 6.
Section snippets
Centroid-based text categorization
Given a set of classes and a set of training documents where each training document dj is assigned to one or more classes, text categorization is a task to use this given information to find one or more suitable categories for a new document. By this definition, a class ck is given a set of its training documents assigned with the class . In a vector space model, a document (or a class) is represented by a vector based on a bag of
Normalization approaches
From the normalization aspect, two different schemes can be taken into account. The document-oriented scheme focuses on the way to normalize a document (class) vector based on document (class) sizes while the term-oriented scheme scopes on the way to normalize with respect to term sizes. Normalizing based on document sizes, known as document-length normalization, was reported by several works in the Information Retrieval (IR) field that it was helpful in improving retrieval accuracy [25], [29],
Data sets and experimental settings
Four data sets are used in the experiments: (1) Drug Information (DI), (2) Newsgroups (News), (3) WebKB1 and (4) WebKB2. The first data set (DI) is a set of drug information web pages collected from www.rxlist.com. It includes 4480 English web pages with seven classes: adverse drug reaction, clinical pharmacology, description, indications, overdose, patient information, and warning. Each web page in this data set consists of informative content with a few links. Its structure is well organized.
Normalization timing
Two types of normalization timing, WNM and WMN, are explored using term weighting tf × idf with cosine normalization. The result is shown in the form of an average of classification accuracy (%) ± standard error of mean (SEM). Table 5 shows that WMN outperforms WNM in the data sets with the unbalanced number of documents per class (WebKB1 and WebKB2) and quite competitive in the data sets with the balanced number of documents per class (DI and News). When a class possesses a large number of
Conclusion and future works
In this paper, we illustrated how to utilize class normalization to improve performance of a centroid-based classifier. To do this, we investigated the effectiveness of document- and class-length normalizations and then evaluated the effects of various normalization functions. We also proposed term-length normalization which is helpful for exploiting term distribution among documents in the class for term weighting. By experiments, the results indicated that weight–merge–normalize (a form of
Acknowledgements
This work has been supported by National Electronics and Computer Technology Center (NECTEC) under project number NT-B-22-I5-38-47-04. We would like to thank the reviewers for their precious comments.
References (41)
- et al.
Term-weighting approaches in automatic text retrieval
Information Processing and Management
(1988) An information-theoretic perspective of tf-idf measures
Information Processing and Management
(2003)- et al.
Learning to construct knowledge bases from the World Wide Web
Artificial Intelligence
(2000) - et al.
Text classification from labeled and unlabeled documents using EM
Machine Learning
(2000) An evaluation of statistical approaches to text categorization
Information Retrieval
(1999)- et al.
Automated learning of decision rules for text categorization
ACM Transactions on Information Systems
(1994) - et al.
Learning rules for large vocabulary word sense disambiguation
- et al.
An example-based mapping method for text categorization and retrieval
ACM Transactions on Information Systems
(1994) - et al.
A re-examination of text categorization methods
Automatic essay grading using text categorization techniques
Text categorization using weight-adjusted k-nearest neighbor classification
Improving text retrieval for the routing problem using latent semantic indexing
A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization
A fast algorithm for hierarchical text classification
A linear text classification algorithm based on category relevance factors
Text categorization with support vector machines: learning with many relevant features
Feature selection, perceptron learning, and a usability case study for text categorization
Cited by (25)
Classification of compressed and uncompressed text documents
2018, Future Generation Computer SystemsA new Centroid-Based Classification model for text categorization
2017, Knowledge-Based SystemsCitation Excerpt :The experimental results in this paper based on cross-validation scheme indicate that CFC has a low generalization ability. Additionally, Lertnattee et al. [5] investigated the effectiveness of the frequently-used normalization functions and proposed a new type of class normalization method (i.e., term-length normalization) in the construction phase of centroid vectors. As it is argued that using only border instances rather than all instances to construct centroid vectors can obtain higher generalization accuracy, Wang et al. [11] proposed a Border-Instance-based Iteratively Adjusted Centroid Classifier (IACC-BI) that relies on the border instances found by some routines, e.g., 1-Nearest-and-1-Furthest-Neighbors strategy, to construct centroid vectors.
Classification of text documents based on score level fusion approach
2017, Pattern Recognition LettersEfficient classification of multi-labeled text streams by clashing
2014, Expert Systems with ApplicationsA high performance centroid-based classification approach for language identification
2012, Pattern Recognition LettersCitation Excerpt :The second problem is its relatively lower success performance in some domains when compared to other classification algorithms. One solution for the performance problem is applying normalization and smoothing techniques on the data, which was shown to increase the success rates significantly (Lertnattee and Theeramunkong, 2006). Another approach is increasing the discriminative power of the centroid values that represent the classes.
- 1
Tel.: +66 0 2501 3505 20x2022, 2004; fax: +66 0 2501 3524.