Class normalization in centroid-based text categorization

doi:10.1016/j.ins.2005.05.010

Information Sciences

Volume 176, Issue 12, 22 June 2006, Pages 1712-1738

https://doi.org/10.1016/j.ins.2005.05.010 Get rights and content

Abstract

Centroid-based categorization is one of the most popular algorithms in text classification. In this approach, normalization is an important factor to improve performance of a centroid-based classifier when documents in text collection have quite different sizes and/or the numbers of documents in classes are unbalanced. In the past, most researchers applied document normalization, e.g., document-length normalization, while some consider a simple kind of class normalization, so-called class-length normalization, to solve the unbalancedness problem. However, there is no intensive work that clarifies how these normalizations affect classification performance and whether there are any other useful normalizations. The purpose of this paper is three folds; (1) to investigate the effectiveness of document- and class-length normalizations on several data sets, (2) to evaluate a number of commonly used normalization functions and (3) to introduce a new type of class normalization, called term-length normalization, which exploits term distribution among documents in the class. The experimental results show that a classifier with weight–merge–normalize approach (class-length normalization) performs better than one with weight–normalize–merge approach (document-length normalization) for the data sets with unbalanced numbers of documents in classes, and is quite competitive for those with balanced numbers of documents. For normalization functions, the normalization based on term weighting performs better than the others on average. For term-length normalization, it is useful for improving classification accuracy. The combination of term- and class-length normalizations outperforms pure class-length normalization and pure term-length normalization as well as unnormalization with the gaps of 4.29%, 11.50%, 30.09%, respectively.

Introduction

In the past, several text categorization models were developed in different schemes, such as probabilistic models [1], [2], decision trees and rules [3], [4], regression models [5], [6], example-based models (e.g., k-nearest neighbor or k−NN) [6], [7], [8], [9], linear models [10], [11], [12], [13], [14], support vector machine [6], [15], neural networks [16], [17], [18] and so on. Among these models, a variant of linear models called a centroid-based method is attractive since it has relatively less computation than other methods in both the learning and classification stages. The traditional centroid-based method [19] can be viewed as a specialization of so-called Rocchio method [20] and used in several works on text categorization [11], [12], [21], [22]. Based on the vector space model, a centroid-based method computes beforehand, for each category, an explicit profile (or class prototype), which is a centroid vector for all positive training documents of that category. The classification task is to find the most similar class to the vector of the document we would like to classify, for example by the means of cosine similarity. This type of classifiers is easy to implement and effective in computation. Several literatures including those in [10], [13], [18], [19], [21], [22], [23], showed that they achieved relatively high classification accuracy. In a centroid-based approach, each individual class is modeled by weighting terms appearing in training documents of that class. This makes classification performance of the model strongly depend on the weighting method applied. As a part of term weighting calculation, normalization is an important factor for constructing a better representation for a class. Unlike instance-based methods such as k−NN, centroid-based methods can utilize class normalization as alternative factor to improve the representation of a class. In the past, most researchers applied document-length normalization to prevent the advantage of a long document over a short one and some researches consider a simple kind of class normalization, so-called class-length normalization, to solve the problem. However, there is no intensive work that clarifies how these normalizations affect classification performance and whether there are any other useful normalizations. In this paper, a more systematic experiment is made to explore normalization approaches in centroid-based methods. The purpose of this paper is three folds; (1) to investigate the effectiveness of document- and class-length normalizations, (2) to evaluate a number of commonly used normalization functions and (3) to introduce a new type of class normalization, called term-length normalization, which exploits term distribution among documents in the class. Using various data sets, the performance is compared to that of the standard centroid-based classifier with tf × idf, the base line. In the rest of this paper, Section 2 presents centroid-based text categorization. Section 3 describes the concept of normalization in text categorization as well as our proposed class-length and term-length normalization. The data sets and experimental settings are described in Section 4. Section 5 shows a set of experimental results using four data sets. A conclusion and future works are given in Section 6.

Section snippets

Centroid-based text categorization

Given a set of classes $C = {c_{1}, c_{2}, \dots, c_{| C |}}$ and a set of training documents $D = {d_{1}, d_{2}, \dots, d_{| D |}}$ where each training document d_j is assigned to one or more classes, text categorization is a task to use this given information to find one or more suitable categories for a new document. By this definition, a class c_k is given a set of its training documents assigned with the class $(C_{k} = {d_{j} | d_{j} belongs to the class c_{k}})$ . In a vector space model, a document (or a class) is represented by a vector based on a bag of

Normalization approaches

From the normalization aspect, two different schemes can be taken into account. The document-oriented scheme focuses on the way to normalize a document (class) vector based on document (class) sizes while the term-oriented scheme scopes on the way to normalize with respect to term sizes. Normalizing based on document sizes, known as document-length normalization, was reported by several works in the Information Retrieval (IR) field that it was helpful in improving retrieval accuracy [25], [29],

Data sets and experimental settings

Four data sets are used in the experiments: (1) Drug Information (DI), (2) Newsgroups (News), (3) WebKB1 and (4) WebKB2. The first data set (DI) is a set of drug information web pages collected from www.rxlist.com. It includes 4480 English web pages with seven classes: adverse drug reaction, clinical pharmacology, description, indications, overdose, patient information, and warning. Each web page in this data set consists of informative content with a few links. Its structure is well organized.

Normalization timing

Two types of normalization timing, WNM and WMN, are explored using term weighting tf × idf with cosine normalization. The result is shown in the form of an average of classification accuracy (%) ± standard error of mean (SEM). Table 5 shows that WMN outperforms WNM in the data sets with the unbalanced number of documents per class (WebKB1 and WebKB2) and quite competitive in the data sets with the balanced number of documents per class (DI and News). When a class possesses a large number of

Conclusion and future works

In this paper, we illustrated how to utilize class normalization to improve performance of a centroid-based classifier. To do this, we investigated the effectiveness of document- and class-length normalizations and then evaluated the effects of various normalization functions. We also proposed term-length normalization which is helpful for exploiting term distribution among documents in the class for term weighting. By experiments, the results indicated that weight–merge–normalize (a form of

Acknowledgements

This work has been supported by National Electronics and Computer Technology Center (NECTEC) under project number NT-B-22-I5-38-47-04. We would like to thank the reviewers for their precious comments.

References (41)

G. Salton et al.
Term-weighting approaches in automatic text retrieval
Information Processing and Management
(1988)
A.N. Aizawa
An information-theoretic perspective of tf-idf measures
Information Processing and Management
(2003)
M. Craven et al.
Learning to construct knowledge bases from the World Wide Web
Artificial Intelligence
(2000)
K. Nigam et al.
Text classification from labeled and unlabeled documents using EM
Machine Learning
(2000)
Y. Yang
An evaluation of statistical approaches to text categorization
Information Retrieval
(1999)
C.D. Apté et al.
Automated learning of decision rules for text categorization
ACM Transactions on Information Systems
(1994)
G. Paliouras et al.
Learning rules for large vocabulary word sense disambiguation
Y. Yang et al.
An example-based mapping method for text categorization and retrieval
ACM Transactions on Information Systems
(1994)
Y. Yang et al.
A re-examination of text categorization methods
L.S. Larkey
Automatic essay grading using text categorization techniques

D.B. Skalak, Prototype and feature selection by sampling and random mutation hill climbing algorithms, in:...

E.-H. Han et al.

Text categorization using weight-adjusted k-nearest neighbor classification

D.A. Hull

Improving text retrieval for the routing problem using latent semantic indexing

D.J. Ittner, D.D. Lewis, D.D. Ahn, Text categorization of low quality images, in: Proceedings of SDAIR-95, 4th Annual...

T. Joachims

A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization

W.T. Chuang et al.

A fast algorithm for hierarchical text classification

Z.-H. Deng et al.

A linear text classification algorithm based on category relevance factors

T. Joachims

Text categorization with support vector machines: learning with many relevant features

H.T. Ng et al.

Feature selection, perceptron learning, and a usability case study for text categorization

P. Koehn, Combining multiclass maximum entropy text classifiers with neural network voting, in: E. Ranchod, N.J. Mamede...

Cited by (25)

Optimization and experimental study of variable frequency intermittent stirring strategy for anaerobic digestion based on CFD
2024, Fuel
The stirring strategy has a significant effect on the efficiency of anaerobic digestion (AD). In this study, the suspension velocity of multi-layer blade particles was characterized based on CFD simulation, and a variable frequency intermittent stirring strategy was proposed to study the stirring effect and mixing energy consumption. Firstly, the fermentation cycle was divided into three parts, and three stirring speeds and mixing times were set for each part to carry out an orthogonal combination test. Taking methane production and mixing energy consumption as optimization indexes, a net energy output equation was established to optimize the new stirring strategy. The results showed that the new variable frequency intermittent stirring strategy could effectively improve digestion efficiency. This provides a theoretical basis for setting a stirring strategy in a biogas project.
Classification of compressed and uncompressed text documents
2018, Future Generation Computer Systems
Computing the degree of closeness (similarity) between two sets of text documents is one of the core operations in many text mining applications like text classification, clustering and sentiment analysis. The efficiency of such applications mainly depends on the factors like selection of representation model, selection of the similarity metric and selection of learning algorithms. Among these three factors, selection of similarity measure is important since it contributes to the efficiency of most of the text mining applications. In this research article, an efficient similarity measure is proposed for computing the closeness between two sets of text documents. The proposed measure has the capacity of considering different real time situations like presence of a feature or absence of features for computing the degree of similarity between the documents. Furthermore, a compression modeling similarity measure is also proposed for text documents. Two different sets of experiments are conducted to validate the efficacy of the proposed similarity measures. Experimental results demonstrate that the $f$ -measure score obtained from proposed similarity metric is better than the $f$ -measure score of the existing state of the art techniques.
A new Centroid-Based Classification model for text categorization
2017, Knowledge-Based Systems
Citation Excerpt :
The experimental results in this paper based on cross-validation scheme indicate that CFC has a low generalization ability. Additionally, Lertnattee et al. [5] investigated the effectiveness of the frequently-used normalization functions and proposed a new type of class normalization method (i.e., term-length normalization) in the construction phase of centroid vectors. As it is argued that using only border instances rather than all instances to construct centroid vectors can obtain higher generalization accuracy, Wang et al. [11] proposed a Border-Instance-based Iteratively Adjusted Centroid Classifier (IACC-BI) that relies on the border instances found by some routines, e.g., 1-Nearest-and-1-Furthest-Neighbors strategy, to construct centroid vectors.
The automatic text categorization technique has gained significant attention among researchers because of the increasing availability of online text information. Therefore, many different learning approaches have been designed in the text categorization field. Among them, the widely used method is the Centroid-Based Classifier (CBC) due to its theoretical simplicity and computational efficiency. However, the classification accuracy of CBC greatly depends on the data distribution. Thus it leads to a misfit model and also has poor classification performance when the data distribution is highly skewed. In this paper, a new classification model named as Gravitation Model (GM) is proposed to solve the class-imbalanced classification problem. In the training phase, each class is weighted by a mass factor, which can be learned from the training data, to indicate data distribution of the corresponding class. In the testing phase, a new document will be assigned to a particular class with the max gravitational force. The performance comparisons with CBC and its variants based on the results of experiments conducted on twelve real datasets show that the proposed gravitation model consistently outperforms CBC together with the Class-Feature-Centroid Classifier (CFC). Also, it obtains the classification accuracy competitive to the DragPushing (DP) method while it maintains a more stable performance. Thus, the proposed gravitation model is proved to be less over-fitting and has higher learning ability than CBC model.
Classification of text documents based on score level fusion approach
2017, Pattern Recognition Letters
Text document classification is a well known theme in the field of the information retrieval and text mining. Selection of most desired features in the text document plays a vital role in classification problem. This research article addresses the problem of text classification by considering Sentence–Vector Space Model (S-VSM) and Unigram representation models for the text document. An enhanced S-VSM model will be considered for the constructive representation of text documents. A neural network based representation for text documents is proposed for effective capturing of semantic information of the text data. Two different classifiers are designed based on the two different representation models of the text documents. Score level fusion is applied on two proposed models to find out the overall accuracy of the proposed model. Key contributions of the paper are an enhanced S-VSM model, an interval valued representation model for the proposed S-VSM approach. A word level representation model for semantic information preserving of the text document and score level fusion approach.
Efficient classification of multi-labeled text streams by clashing
2014, Expert Systems with Applications
We present a method for the classification of multi-labeled text documents explicitly designed for data stream applications that require to process a virtually infinite sequence of data using constant memory and constant processing time.
Our method is composed of an online procedure used to efficiently map text into a low-dimensional feature space and a partition of this space into a set of regions for which the system extracts and keeps statistics used to predict multi-label text annotations. Documents are fed into the system as a sequence of words, mapped to a region of the partition, and annotated using the statistics computed from the labeled instances colliding in the same region. This approach is referred to as clashing.
We illustrate the method in real-world text data, comparing the results with those obtained using other text classifiers. In addition, we provide an analysis about the effect of the representation space dimensionality on the predictive performance of the system. Our results show that the online embedding indeed approximates the geometry of the full corpus-wise TF and TF-IDF space. The model obtains competitive F measures with respect to the most accurate methods, using significantly fewer computational resources. In addition, the method achieves a higher macro-averaged F measure than methods with similar running time. Furthermore, the system is able to learn faster than the other methods from partially labeled streams.
A high performance centroid-based classification approach for language identification
2012, Pattern Recognition Letters
Citation Excerpt :
The second problem is its relatively lower success performance in some domains when compared to other classification algorithms. One solution for the performance problem is applying normalization and smoothing techniques on the data, which was shown to increase the success rates significantly (Lertnattee and Theeramunkong, 2006). Another approach is increasing the discriminative power of the centroid values that represent the classes.
Centroid-based classification is a machine learning approach used in the text classification domain. The main advantage of centroid-based classifiers is their high performance during both the training stage and the classification stage. However, the success rate can be lower than the other classifiers if good centroid values are not used. In this paper, we apply the centroid-based classification method to the language identification problem, which can be considered as a sub-problem of text classification. We propose a novel method named as inverse class frequency to increase the quality of the centroid values, which involves an update of the classical values. We also use a feature set formed of individual characters rather than words or n-gram sequences to decrease the training and classification times. The experiments were performed on the ECI/MCI corpus and the method was compared with other methods and previous studies. The results showed that the proposed approach yields high success rates and works very efficiently for language identification.

View all citing articles on Scopus

¹: Tel.: +66 0 2501 3505 20x2022, 2004; fax: +66 0 2501 3524.

View full text

Class normalization in centroid-based text categorization

Abstract

Introduction

Section snippets

Centroid-based text categorization

Normalization approaches

Data sets and experimental settings

Normalization timing

Conclusion and future works

Acknowledgements

Information Processing and Management

Information Processing and Management

Artificial Intelligence

Text classification from labeled and unlabeled documents using EM

Machine Learning

An evaluation of statistical approaches to text categorization

Information Retrieval

Automated learning of decision rules for text categorization

ACM Transactions on Information Systems

Learning rules for large vocabulary word sense disambiguation

An example-based mapping method for text categorization and retrieval