Text classification using a few labeled examples
Introduction
With the proliferation of blogs, social networks and e-commerce sites, there is a great interest in supervised and semi-supervised text classification methods to reveal user sentiments and opinions (Magdalini Eirinaki and Pisal, 2012, Palus et al., 2011), to discover and classify the health service information obtained from the digital health ecosystems (Hai Dong and Hussain, 2011, Karavasilis et al., 2010) and to classify web resources for improving the quality of web searches (Rahat Iqbal, 2012, Adam Grzywaczewski, 2012, Liu, 2006, Colace et al., 2013).
The problem of supervised text classification has been extensively discussed in literature and metrics and measures of performance confirm that all the existing techniques achieve a high accuracy when trained on large datasets (Christopher et al., 2008, Sebastiani, 2002, Lewis et al., 2004).
However, most times, a supervised classifier is unfeasible in a real context, because a large labeled training set is not always available. It has been demonstrated that human employes about 90 min to label 100 documents. This makes the labeling task, for large datasets, practically unfeasible (McCallum et al., 1999, Ko and Seo, 2009).
Furthermore, the accuracy of classifiers, learned from a reduced training set (for instance made of hundreds instead of thousand of labeled examples), is quite low, around 30% (Ko & Seo, 2009). The low accuracy depends on the fact that most of the existing methods usually use a vector of features composed of weighted words that are obtained through the “bag of words” assumption (Christopher et al., 2008). Due to the inherent ambiguity of language (polysemy etc.), vectors of weighted words are insufficiently discriminative, especially when the classifier learns common patterns from a few labeled examples made of numerous features (Clarizia et al., 2011, Napoletano et al., 2012).
In this paper we demonstrate that a more complex vector of features, based on weighted pairs of words, is capable of overcoming the limitations of simple structures when the number of labeled samples is small. Specifically, we propose a linear single label supervised classifier that is capable, based on a vector of features composed of weighted pairs of words, of achieving a better performance, in terms of accuracy, than existing methods when the size of the training set is about 1% of the original and composed of only positive examples. The proposed vector of features is automatically extracted from a set of documents using a global method for term extraction based on the Latent Dirichlet Allocation (Blei, Ng, & Jordan, 2003) implemented as the Probabilistic Topic Model (Griffiths, Steyvers, & Tenenbaum, 2007).
To confirm the discriminative property of the proposed features, we have evaluated the performance through a comparison with different methods using vectors of weighted words. The results, obtained on the top 10 classes of the ModApte split from the Reuters-21578 dataset, show that our method, independently of the topic, is capable of achieving a better performance.
Section snippets
Background
Following the definition introduced in Sebastiani (2002), a supervised Text Classifier may be formalized as the task of approximating the unknown target function (namely the expert) by means of a function called the classifier, where is a predefined set of categories and is a set of documents. If Φ(dj, ci) = T, then dj is called a positive example (or a member) of ci, while if Φ(dj, ci) = F it is called a negative example of ci.
The categories are just symbolic
Document representation and dimension reduction
Texts cannot be directly interpreted by a classifier, therefore an indexing procedure that maps a text dj into a compact representation of its content must be uniformly applied to the training and test documents. For the sake of simplicity we consider the case of the training set but the procedure described here is repeated in the case of test set.
Each document can be represented, following the Vector Space Model (Christopher et al., 2008), as a vector of term weightswhere is
Proposed feature extraction method
In this paper we propose a new method for feature selection that, based on the probabilistic topic model, finds the pairs among all the that are the most discriminative. The feature extraction module is represented in Fig. 3. The input of the system is the set of documents and the output is a vector of weighted word pairs , where is the number of pairs and bn is the weight associated to each pair (feature) tn = (vi, vj). Therefore, the method works on the
Graph building
A graph g is learned from a corpus of documents as a result of two important phases: the Relations Learning stage, where graph relation weights are learned by computing probabilities between word pairs (see Fig. 3); the Structure Learning stage, which specify the shape, namely the structure, of the graph. This stage is achieved by performing an iterative procedure which, given the number of keywords H and the desired max number of pairs as constraints, chooses the best parameter settings τ and μ
Evaluation
We have considered a classic text classification problem performed on the Reuters-21578 repository. This is a collection of 21,578 newswire articles, originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd.. The articles are assigned classes from a set of 118 topic categories. A document may be assigned several classes or none, but the commonest case is a single assignment (documents with at least one class received an average of 1.24 classes).
For this task we have used the
Discussion
If we apply the bag of words representation to the reduced training set ϒr, we obtain a number of features that is higher than the number of documents: . In this case, even if we reduce the dimension of ϒr by selecting the most discriminative features , we may still have because the number of samples may be too small. As already discussed before in Section 3.1, when the accuracy of classifiers is poor.
Notwithstanding this, a way to improve the performance of
Conclusions and future works
The proposed structure implements a document-ranking text classifier, which is able to make a soft decision: it draws up a ranking of documents that requires the choice of an appropriate threshold (Categorization Status Value) in order to obtain a binary classification. This threshold was chosen by evaluating performance on a validation set in terms of micro-precision, micro-recall and micro-F1. The dataset Reuters-21578, consisting of about 21 thousand newspaper articles, has been used; in
References (36)
- et al.
Selection of relevant features and examples in machine learning
Artificial Intelligence
(1997) - et al.
Text classification from unlabeled documents with bootstrapping and feature projection techniques
Information Processing Management
(2009) Task-specific information retrieval systems for software engineers
Journal of Computer and System Sciences
(2012)- et al.
A survey of text classification algorithms
A survey of clustering data mining techniques
Neural networks for pattern recognition
(1995)Pattern recognition and machine learning
(2006)- et al.
Latent Dirichlet allocation
Journal of Machine Learning Research
(2003) - et al.
Introduction to information retrieval
(2008) - et al.
A new text classification technique using small training sets
Mixed graph of terms for query expansion
Improving text retrieval accuracy by using a minimal relevance feedback
Support-vector networks
Machine Learning
Topics in semantic representation
Psychological Review
A framework for discovering and classifying ubiquitous services in digital health ecosystems
Journal of Computer and System Sciences
The elements of statistical learning
A model for investigating e-governance adoption using tam and doi
International Journal of Knowledge Society Research
Cited by (42)
Instance-Based Zero-Shot learning for semi-Automatic MeSH indexing
2021, Pattern Recognition LettersCitation Excerpt :Inspired from humans’ ability of expanding their knowledge on never before seen entities/objects, some kind of initial information should be provided as input to a typical ZSL approach. This should stem from the structure of the tackled problem, rather than revealing some possible training instances [5]. During the latter scenario, such approaches are rendered more towards a Few-Shot learning (FSL): a more loose variant of WSL strategies that could be initialized via an Active Learning stage or a Data Augmentation strategy [6].
Technology in the 21st century: New challenges and opportunities
2019, Technological Forecasting and Social ChangeA new Centroid-Based Classification model for text categorization
2017, Knowledge-Based SystemsCitation Excerpt :To reduce the high computational complexity of SVM, a feasible approach is to reconstruct training set for SVM in the context of the reduction methods [31,32]. The graph-based approach for text classification have proven to achieve good performances as demonstrated in [27]. Also, many classifier ensemble approaches such as bagging [33], boosting [34], random subspace [35], and random forest [36] have been successfully applied to different areas and have achieved excellent performances.
A multidisciplinary perspective of big data in management research
2017, International Journal of Production EconomicsText classification method based on self-training and LDA topic models
2017, Expert Systems with ApplicationsRFBoost: An improved multi-label boosting algorithm and its application to text categorisation
2016, Knowledge-Based SystemsCitation Excerpt :Dimension-reduction-based acceleration: Dimension reduction methods involve a pretask which reduces the feature space of the training examples using feature selection techniques before learning. This is a general method for any supervised machine learning classification algorithm [8,12,41]. Representation-based acceleration: Most work on AdaBoost.MH for TC concerns the bag-of-words (BOW) representation model.
- 1
The authors contributed equally to this work.