Elsevier

Expert Systems with Applications

Volume 100, 15 June 2018, Pages 157-164
Expert Systems with Applications

Hierarchy construction and text classification based on the relaxation strategy and least information model

https://doi.org/10.1016/j.eswa.2018.02.003Get rights and content

Highlights

  • Hierarchical classification is an effective approach to categorize large-scale text data.

  • The relaxation strategy effectively alleviates the impact of the ‘blocking’ problem.

  • A new term weighting approach based on the Least Information Theory is proposed.

  • It offers a new information quantify model by different probability distributions.

Abstract

Hierarchical classification is an effective approach to categorization of large-scale text data. We introduce a relaxed strategy into the traditional hierarchical classification method to improve the system performance. During the process of hierarchy structure construction, our method delays node judgment of the uncertain category until it can be classified clearly. This approach effectively alleviates the ‘block’ problem which transfers the classification error from the higher level to the lower level in the hierarchy structure. A new term weighting approach based on the Least Information Theory (LIT) is adopted for the hierarchy classification. It quantifies information in probability distribution changes and offers a new document representation model where the contribution of each term can be properly weighted. The experimental results show that the relaxation approach builds a more reasonable hierarchy and further improves classification performance. It also outperforms other classification methods such as SVM (Support Vector Machine) in terms of efficiency and the approach is more efficient for large-scale text classification tasks. Compared to the classic term weighting method TF*IDF, LIT-based methods achieves significant improvement on the classification performance.

Introduction

The task of text classification is to assign a predefined category to a free text document. With more and more textual information available online, hierarchical organization of text documents is becoming increasingly important to manage the data. The research on automatic classification of documents to the categories in the hierarchy is needed.

Most of the classifiers make the decision in the same flat space. Classification performance degrades quickly with larger scale data sets and more categories, especially in terms of the classification time. On the other hand, a hierarchical classification method organizes all of the categories into a tree like structure and trains a classifier on each node in the hierarchy. The classification process begins from the root of the tree until it reaches the leaf node which denotes the final category for the document.

The hierarchies are represented as binary trees mostly. During the hierarchical classification process, the document to be classified starts from the root and the next direction is determined by each node classifier. Finally, the leaf being reached will give the decision to its category label. However, there exists a ‘blocking’ problem during the process. The error that has occurred in the upper node classifier cannot be corrected by the lower node classifier. The ‘blocking’ problem may result in weaker performance compared to the non-hierarchical classification method. The advantage of hierarchy classification method is higher efficiency which is significant in large scale data set.

In order to improve the hierarchical classification performance, we introduce the relaxation strategy idea during the process of hierarchy construction and further propose the hierarchical classification approach based on it. The method delays the uncertain category decision until it can be classified definitely, thereby alleviating the impact of the ‘blocking’ problem. We give the experiment on the Reuters Corpus Volume 1(RCV1). The result denotes that our method can build a more rational category hierarchy and improve the performance of traditional hierarchy classification. Especially, the approach has higher time efficiency than other classifiers such as Support Vector Machine.

Another contribution of this work is in term weighting and documentation representation. The classic TF*IDF has been widely used for term weighting and document representation in text clustering and classification tasks (Liu et al., 2003, Yang and Pedersen, 1997). Least Information Theory (LIT) extends Shannon's information theory to accommodate a non-linear relation between information and uncertainty and offers a new way of modeling for term weighting and document representation (Ke, 2015). It establishes a new basic information quantity and provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We adopt the LIT for term weighting during hierarchical classification and it achieves significant performance improvement over classic TF*IDF.

Section snippets

Related work

It is important to build the rational category hierarchy and there are two common ways to implement this, including the Top-Down and Bottom-Up approaches. Liu, Yi, and Chia (2005) present a method to build up a hierarchical structure from the training dataset and uses the K-Means clustering algorithm to divide the category set. The hierarchical structure of the SVM classification tree manifests the interclass relationships among different classes.

Chen, Crawford, and Ghosh (2004) propose the

Relaxation method

The category set will be divided into two subsets recursively to build a hierarchical structure that contains n categories. K-Means clustering algorithm is adopted to get the two clusters on the text data set, and it will help to determine which node the category belongs to.

As shown in Fig. 1, the aim is to divide the root node S, which contains category A, B and C, into two subsets referred to SL and SR respectively.

As an example, there are 30 training documents A01,…, A10, B01,…, B10, C01,…,

Least information model for term weighting

The Least Information Theory (LIT) measures the distance between two probability distributions in a way different from Kullback–Leibler (KL) divergence (Lin, 1991). It establishes a new basic information quantity and provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection.

We apply the proposed Least Information Theory (Ke, 2015) to term weighting and document representation. A text document can be viewed as a set of terms with

Data sets

We conducted the experiments on the Reuters Corpus Volume 1 (RCV1-v2) data set. The collection contains 804,414 newswire stories made available by Reuters. RCV1-v2 is a corrected version of the original collection, in which documents were manually assigned to a hierarchy of 103 categories. We select 10, 15, 20, 25 and 30 categories respectively to build the different size data sets, labeled as RCV1_10, RCV1_15, RCV1_20, RCV1_25 and RCV1_30. There are 500 documents selected randomly for each

Conclusions

We propose the hierarchical classification approach based on the relaxation strategy which alleviates the impact of the ‘blocking’ problem. It delays the uncertain category decision until it can be classified definitely, and so the error that has occurred in the upper level will not be transferred to the lower level. We also apply the Least Information Theory in term weighting and documentation representation and it offers a new basic information quantify model by different probability

Acknowledgment

This work is supported by the National Science and Technology Support Plan(No. 2013BAH21B02-01), National Nature Science Foundation of China Research Program(61375059, 61672065) and Beijing Natural Science Foundation (No. 4153058).

References (25)

  • K. Chen et al.

    Turning from TF-IDF to TF-IGM for term weighting in text classification

    Expert Systems with Applications

    (2016)
  • M. Pavlinek et al.

    Text classification method based on self-training and LDA topic models

    Expert Systems with Applications

    (2017)
  • L. Zhang et al.

    Hierarchical multi-label classification using fully associative ensemble learning

    Pattern Recognition

    (2017)
  • G. Amati et al.

    Probabilistic models of information retrieval based on measuring the divergence from randomness

    ACM Transsactions on Information System

    (2002)
  • A. Aizawa

    The feature quantity: An information theoretic perspective of tfidf-like measures

  • S. Bengio et al.

    Label embedding trees for large multi-class tasks

  • Y.C. Chen et al.

    Integrating support vector machines in a hierarchical output space decomposition framework

  • A. Deepak et al.

    Variable global feature selection scheme for automatic classification of text documents

    Expert Systems with Applications

    (2017)
  • J. Deng et al.

    Fast and balanced: efficient label tree learning for large scale object recognition

    Advances in Neural Information Processing Systems

    (2011)
  • T. Gao et al.

    Discriminative learning of relaxed hierarchy for large-scale visual recognition

  • G. Griffin et al.

    Learning and using taxonomies for fast visual categorization

  • X.M. Gong et al.

    Term weighting for interactive cluster labeling based on least information gain

  • Cited by (0)

    View full text