Elsevier

Computers in Human Behavior

Volume 30, January 2014, Pages 689-697
Computers in Human Behavior

Text classification using a few labeled examples

https://doi.org/10.1016/j.chb.2013.07.043Get rights and content

Highlights

  • A graph of terms can be effectively used for text classification.

  • Such a graph is extracted from documents thanks to a LDA based methodology.

  • Proposed method achieves good performances on standard datasets.

  • The approach can discover solutions matching user information needs.

Abstract

Supervised text classifiers need to learn from many labeled examples to achieve a high accuracy. However, in a real context, sufficient labeled examples are not always available because human labeling is enormously time-consuming. For this reason, there has been recent interest in methods that are capable of obtaining a high accuracy when the size of the training set is small.

In this paper we introduce a new single label text classification method that performs better than baseline methods when the number of labeled examples is small. Differently from most of the existing methods that usually make use of a vector of features composed of weighted words, the proposed approach uses a structured vector of features, composed of weighted pairs of words.

The proposed vector of features is automatically learned, given a set of documents, using a global method for term extraction based on the Latent Dirichlet Allocation implemented as the Probabilistic Topic Model. Experiments performed using a small percentage of the original training set (about 1%) confirmed our theories.

Introduction

With the proliferation of blogs, social networks and e-commerce sites, there is a great interest in supervised and semi-supervised text classification methods to reveal user sentiments and opinions (Magdalini Eirinaki and Pisal, 2012, Palus et al., 2011), to discover and classify the health service information obtained from the digital health ecosystems (Hai Dong and Hussain, 2011, Karavasilis et al., 2010) and to classify web resources for improving the quality of web searches (Rahat Iqbal, 2012, Adam Grzywaczewski, 2012, Liu, 2006, Colace et al., 2013).

The problem of supervised text classification has been extensively discussed in literature and metrics and measures of performance confirm that all the existing techniques achieve a high accuracy when trained on large datasets (Christopher et al., 2008, Sebastiani, 2002, Lewis et al., 2004).

However, most times, a supervised classifier is unfeasible in a real context, because a large labeled training set is not always available. It has been demonstrated that human employes about 90 min to label 100 documents. This makes the labeling task, for large datasets, practically unfeasible (McCallum et al., 1999, Ko and Seo, 2009).

Furthermore, the accuracy of classifiers, learned from a reduced training set (for instance made of hundreds instead of thousand of labeled examples), is quite low, around 30% (Ko & Seo, 2009). The low accuracy depends on the fact that most of the existing methods usually use a vector of features composed of weighted words that are obtained through the “bag of words” assumption (Christopher et al., 2008). Due to the inherent ambiguity of language (polysemy etc.), vectors of weighted words are insufficiently discriminative, especially when the classifier learns common patterns from a few labeled examples made of numerous features (Clarizia et al., 2011, Napoletano et al., 2012).

In this paper we demonstrate that a more complex vector of features, based on weighted pairs of words, is capable of overcoming the limitations of simple structures when the number of labeled samples is small. Specifically, we propose a linear single label supervised classifier that is capable, based on a vector of features composed of weighted pairs of words, of achieving a better performance, in terms of accuracy, than existing methods when the size of the training set is about 1% of the original and composed of only positive examples. The proposed vector of features is automatically extracted from a set of documents D using a global method for term extraction based on the Latent Dirichlet Allocation (Blei, Ng, & Jordan, 2003) implemented as the Probabilistic Topic Model (Griffiths, Steyvers, & Tenenbaum, 2007).

To confirm the discriminative property of the proposed features, we have evaluated the performance through a comparison with different methods using vectors of weighted words. The results, obtained on the top 10 classes of the ModApte split from the Reuters-21578 dataset, show that our method, independently of the topic, is capable of achieving a better performance.

Section snippets

Background

Following the definition introduced in Sebastiani (2002), a supervised Text Classifier may be formalized as the task of approximating the unknown target function Φ:D×C{T,F} (namely the expert) by means of a function Φˆ:D×C{T,F} called the classifier, where C={c1,,c|C|} is a predefined set of categories and D is a set of documents. If Φ(dj, ci) = T, then dj is called a positive example (or a member) of ci, while if Φ(dj, ci) = F it is called a negative example of ci.

The categories are just symbolic

Document representation and dimension reduction

Texts cannot be directly interpreted by a classifier, therefore an indexing procedure that maps a text dj into a compact representation of its content must be uniformly applied to the training and test documents. For the sake of simplicity we consider the case of the training set but the procedure described here is repeated in the case of test set.

Each document can be represented, following the Vector Space Model (Christopher et al., 2008), as a vector of term weightsdj={w1j,,w|T|j},where T is

Proposed feature extraction method

In this paper we propose a new method for feature selection that, based on the probabilistic topic model, finds the pairs among all the |Tp| that are the most discriminative. The feature extraction module is represented in Fig. 3. The input of the system is the set of documents Ωr=(d1,,d|Ωr|) and the output is a vector of weighted word pairs g={b1,,b|Tsp|}, where Tsp is the number of pairs and bn is the weight associated to each pair (feature) tn = (vi, vj). Therefore, the method works on the

Graph building

A graph g is learned from a corpus of documents as a result of two important phases: the Relations Learning stage, where graph relation weights are learned by computing probabilities between word pairs (see Fig. 3); the Structure Learning stage, which specify the shape, namely the structure, of the graph. This stage is achieved by performing an iterative procedure which, given the number of keywords H and the desired max number of pairs as constraints, chooses the best parameter settings τ and μ

Evaluation

We have considered a classic text classification problem performed on the Reuters-21578 repository. This is a collection of 21,578 newswire articles, originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd.. The articles are assigned classes from a set of 118 topic categories. A document may be assigned several classes or none, but the commonest case is a single assignment (documents with at least one class received an average of 1.24 classes).

For this task we have used the

Discussion

If we apply the bag of words representation to the reduced training set ϒr, we obtain a number of features |T| that is higher than the number of documents: |T||ϒr|. In this case, even if we reduce the dimension of ϒr by selecting the most discriminative features Ts, we may still have |Ts||ϒr| because the number of samples may be too small. As already discussed before in Section 3.1, when |Ts||ϒr| the accuracy of classifiers is poor.

Notwithstanding this, a way to improve the performance of

Conclusions and future works

The proposed structure implements a document-ranking text classifier, which is able to make a soft decision: it draws up a ranking of documents that requires the choice of an appropriate threshold (Categorization Status Value) in order to obtain a binary classification. This threshold was chosen by evaluating performance on a validation set in terms of micro-precision, micro-recall and micro-F1. The dataset Reuters-21578, consisting of about 21 thousand newspaper articles, has been used; in

References (36)

  • A.L. Blum et al.

    Selection of relevant features and examples in machine learning

    Artificial Intelligence

    (1997)
  • Y. Ko et al.

    Text classification from unlabeled documents with bootstrapping and feature projection techniques

    Information Processing Management

    (2009)
  • R.I. Adam Grzywaczewski

    Task-specific information retrieval systems for software engineers

    Journal of Computer and System Sciences

    (2012)
  • C. Aggarwal et al.

    A survey of text classification algorithms

  • P. Berkhin

    A survey of clustering data mining techniques

  • C.M. Bishop

    Neural networks for pattern recognition

    (1995)
  • C.M. Bishop

    Pattern recognition and machine learning

    (2006)
  • D.M. Blei et al.

    Latent Dirichlet allocation

    Journal of Machine Learning Research

    (2003)
  • P.R. Christopher et al.

    Introduction to information retrieval

    (2008)
  • F. Clarizia et al.

    A new text classification technique using small training sets

  • F. Clarizia et al.

    Mixed graph of terms for query expansion

  • F. Colace et al.

    Improving text retrieval accuracy by using a minimal relevance feedback

  • C. Cortes et al.

    Support-vector networks

    Machine Learning

    (1995)
  • Fodor, I. (2002). A survey of dimension reduction techniques, technical...
  • T.L. Griffiths et al.

    Topics in semantic representation

    Psychological Review

    (2007)
  • E.C. Hai Dong et al.

    A framework for discovering and classifying ubiquitous services in digital health ecosystems

    Journal of Computer and System Sciences

    (2011)
  • T. Hastie et al.

    The elements of statistical learning

    (2009)
  • I. Karavasilis et al.

    A model for investigating e-governance adoption using tam and doi

    International Journal of Knowledge Society Research

    (2010)
  • Cited by (42)

    • Instance-Based Zero-Shot learning for semi-Automatic MeSH indexing

      2021, Pattern Recognition Letters
      Citation Excerpt :

      Inspired from humans’ ability of expanding their knowledge on never before seen entities/objects, some kind of initial information should be provided as input to a typical ZSL approach. This should stem from the structure of the tackled problem, rather than revealing some possible training instances [5]. During the latter scenario, such approaches are rendered more towards a Few-Shot learning (FSL): a more loose variant of WSL strategies that could be initialized via an Active Learning stage or a Data Augmentation strategy [6].

    • Technology in the 21st century: New challenges and opportunities

      2019, Technological Forecasting and Social Change
    • A new Centroid-Based Classification model for text categorization

      2017, Knowledge-Based Systems
      Citation Excerpt :

      To reduce the high computational complexity of SVM, a feasible approach is to reconstruct training set for SVM in the context of the reduction methods [31,32]. The graph-based approach for text classification have proven to achieve good performances as demonstrated in [27]. Also, many classifier ensemble approaches such as bagging [33], boosting [34], random subspace [35], and random forest [36] have been successfully applied to different areas and have achieved excellent performances.

    • A multidisciplinary perspective of big data in management research

      2017, International Journal of Production Economics
    • RFBoost: An improved multi-label boosting algorithm and its application to text categorisation

      2016, Knowledge-Based Systems
      Citation Excerpt :

      Dimension-reduction-based acceleration: Dimension reduction methods involve a pretask which reduces the feature space of the training examples using feature selection techniques before learning. This is a general method for any supervised machine learning classification algorithm [8,12,41]. Representation-based acceleration: Most work on AdaBoost.MH for TC concerns the bag-of-words (BOW) representation model.

    View all citing articles on Scopus
    1

    The authors contributed equally to this work.

    View full text