Hierarchy construction and text classification based on the relaxation strategy and least information model
Introduction
The task of text classification is to assign a predefined category to a free text document. With more and more textual information available online, hierarchical organization of text documents is becoming increasingly important to manage the data. The research on automatic classification of documents to the categories in the hierarchy is needed.
Most of the classifiers make the decision in the same flat space. Classification performance degrades quickly with larger scale data sets and more categories, especially in terms of the classification time. On the other hand, a hierarchical classification method organizes all of the categories into a tree like structure and trains a classifier on each node in the hierarchy. The classification process begins from the root of the tree until it reaches the leaf node which denotes the final category for the document.
The hierarchies are represented as binary trees mostly. During the hierarchical classification process, the document to be classified starts from the root and the next direction is determined by each node classifier. Finally, the leaf being reached will give the decision to its category label. However, there exists a ‘blocking’ problem during the process. The error that has occurred in the upper node classifier cannot be corrected by the lower node classifier. The ‘blocking’ problem may result in weaker performance compared to the non-hierarchical classification method. The advantage of hierarchy classification method is higher efficiency which is significant in large scale data set.
In order to improve the hierarchical classification performance, we introduce the relaxation strategy idea during the process of hierarchy construction and further propose the hierarchical classification approach based on it. The method delays the uncertain category decision until it can be classified definitely, thereby alleviating the impact of the ‘blocking’ problem. We give the experiment on the Reuters Corpus Volume 1(RCV1). The result denotes that our method can build a more rational category hierarchy and improve the performance of traditional hierarchy classification. Especially, the approach has higher time efficiency than other classifiers such as Support Vector Machine.
Another contribution of this work is in term weighting and documentation representation. The classic TF*IDF has been widely used for term weighting and document representation in text clustering and classification tasks (Liu et al., 2003, Yang and Pedersen, 1997). Least Information Theory (LIT) extends Shannon's information theory to accommodate a non-linear relation between information and uncertainty and offers a new way of modeling for term weighting and document representation (Ke, 2015). It establishes a new basic information quantity and provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We adopt the LIT for term weighting during hierarchical classification and it achieves significant performance improvement over classic TF*IDF.
Section snippets
Related work
It is important to build the rational category hierarchy and there are two common ways to implement this, including the Top-Down and Bottom-Up approaches. Liu, Yi, and Chia (2005) present a method to build up a hierarchical structure from the training dataset and uses the K-Means clustering algorithm to divide the category set. The hierarchical structure of the SVM classification tree manifests the interclass relationships among different classes.
Chen, Crawford, and Ghosh (2004) propose the
Relaxation method
The category set will be divided into two subsets recursively to build a hierarchical structure that contains n categories. K-Means clustering algorithm is adopted to get the two clusters on the text data set, and it will help to determine which node the category belongs to.
As shown in Fig. 1, the aim is to divide the root node S, which contains category A, B and C, into two subsets referred to SL and SR respectively.
As an example, there are 30 training documents A01,…, A10, B01,…, B10, C01,…,
Least information model for term weighting
The Least Information Theory (LIT) measures the distance between two probability distributions in a way different from Kullback–Leibler (KL) divergence (Lin, 1991). It establishes a new basic information quantity and provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection.
We apply the proposed Least Information Theory (Ke, 2015) to term weighting and document representation. A text document can be viewed as a set of terms with
Data sets
We conducted the experiments on the Reuters Corpus Volume 1 (RCV1-v2) data set. The collection contains 804,414 newswire stories made available by Reuters. RCV1-v2 is a corrected version of the original collection, in which documents were manually assigned to a hierarchy of 103 categories. We select 10, 15, 20, 25 and 30 categories respectively to build the different size data sets, labeled as RCV1_10, RCV1_15, RCV1_20, RCV1_25 and RCV1_30. There are 500 documents selected randomly for each
Conclusions
We propose the hierarchical classification approach based on the relaxation strategy which alleviates the impact of the ‘blocking’ problem. It delays the uncertain category decision until it can be classified definitely, and so the error that has occurred in the upper level will not be transferred to the lower level. We also apply the Least Information Theory in term weighting and documentation representation and it offers a new basic information quantify model by different probability
Acknowledgment
This work is supported by the National Science and Technology Support Plan(No. 2013BAH21B02-01), National Nature Science Foundation of China Research Program(61375059, 61672065) and Beijing Natural Science Foundation (No. 4153058).
References (25)
- et al.
Turning from TF-IDF to TF-IGM for term weighting in text classification
Expert Systems with Applications
(2016) - et al.
Text classification method based on self-training and LDA topic models
Expert Systems with Applications
(2017) - et al.
Hierarchical multi-label classification using fully associative ensemble learning
Pattern Recognition
(2017) - et al.
Probabilistic models of information retrieval based on measuring the divergence from randomness
ACM Transsactions on Information System
(2002) The feature quantity: An information theoretic perspective of tfidf-like measures
- et al.
Label embedding trees for large multi-class tasks
- et al.
Integrating support vector machines in a hierarchical output space decomposition framework
- et al.
Variable global feature selection scheme for automatic classification of text documents
Expert Systems with Applications
(2017) - et al.
Fast and balanced: efficient label tree learning for large scale object recognition
Advances in Neural Information Processing Systems
(2011) - et al.
Discriminative learning of relaxed hierarchy for large-scale visual recognition