Text classification from unlabeled documents with bootstrapping and feature projection techniques

doi:10.1016/j.ipm.2008.07.004

Information Processing & Management

Volume 45, Issue 1, January 2009, Pages 70-83

https://doi.org/10.1016/j.ipm.2008.07.004 Get rights and content

Abstract

Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.

Introduction

With the rapid growth of the World Wide Web, the task of classifying natural language documents into a pre-defined set of semantic categories has become one of the key methods for organizing online information. This task is commonly referred to as text classification. Since there has been an explosion of electronic texts from not only the World Wide Web but also various online sources (electronic mail, corporate databases, chat rooms, digital libraries, and so on) recently, one way of organizing this overwhelming amount of data is to classify them into topical categories.

Since the machine learning paradigm emerged in the 1990s, many machine learning algorithms have been applied to text classification by supervised learning. The supervised learning algorithm finds a representation or decision rule from an example set of labeled documents for each class. A wide range of the supervised learning algorithms has been applied to this area using a training data set of labeled documents. For example, there are naive Bayes (Ko and Seo, 2000, McCallum and Nigam, 1998), Rocchio (Lewis, Schapire, Callan, & Papka, 1996), Nearest Neighbor (k-NN) (Yang, Slattery, & Ghani, 2002), and Support Vector Machine (SVM) (Joachims, 2001).

However, the major bottleneck of the supervised learning algorithms is that they require a large number of labeled training documents for accurate learning. Since a labeling task must be done manually, it is a painfully time-consuming process. Furthermore, since the application area of automatic text classification has diversified from newswire articles and web pages to E-mails and newsgroup postings, it is also a difficult task to create training data for each application area (Nigam, McCallum, Thrun, & Mitchell, 1998). McCallum, Nigam, Rennie, and Seymore (1999) found that only 100 documents could be hand-labeled in the 90 minutes and the result of a classifier learned from this small training set achieved just 30% accuracy in their experiments. Most users of a practical system, however, do not want to do the labeling task for a long time only to obtain this level of accuracy. They obviously prefer algorithms that have high accuracy, but do not require a large amount of manually labeling task.

In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method uses only unlabeled documents and the title word of each category as initial data for learning of text classification. While labeled data is difficultly obtained, unlabeled data is readily available and plentiful. Therefore, this paper advocates an automatic labeling task using a bootstrapping technique and a robust text classifier using a feature projection technique. The input to the bootstrapping process is a large amount of unlabeled documents and a small amount of seed information to tell the learner about the specific task. Here, we consider a title word associated with a category as seed information. To automatically build up a text classifier with unlabeled documents, we must solve two problems; how we can automatically generate labeled training documents (machine-labeled data) from only a title word, and how we can handle incorrectly labeled documents in the machine-labeled data. This paper provides the solutions of both the problems. For the former, we employ the bootstrapping technique and, for the latter, we use the TCFP (Text Categorization using Feature Projections) classifier with robustness from noisy data (Ko & Seo, 2002).

Do you think that it is possible to build a text classifier with only unlabeled documents? Maybe we cannot gain any information from unlabeled documents for building a text classifier because the unlabeled documents do not contain the most important information, their category. In general, the existing supervised learning algorithms cannot construct any decision rules without the labeled data. Thus labeled training data must be obtained in order to use the existing supervised learning algorithms. Here, we explain how labeled data can be generated from the unlabeled data for text classification. Since text classification is a task based on the pre-defined categories, developers can at least know the categories for classifying documents. Knowing the categories means that they can at least choose a title word of each category. This is the starting point of the proposed method. As developers carry out a bootstrapping task from the title word, they can finally get labeled training data.

Suppose that we are going to classify documents into an ‘Autos’ category. First, the title word of this category is selected ‘automobile,’ and then the related keywords (e.g. ‘car’, ‘gear’, ‘transmission’, ‘sedan’) of ‘Auto’ are extracted by using co-occurrence information between the title word (‘automobile’) and the other words. In the proposed method, context is defined as a unit of meaning for the bootstrapping process from the title word; it has a middle size of sentences and documents (a sequence of 60 words in a document). Then the bootstrapping process first extracts the most informative contexts for the category which include at least one among the title word and the keywords. The extracted contexts are called by centroid-contexts because they are regarded as contexts with the core meaning of each category. We can obtain many words directly co-occurred with the title word and the keywords from the centroid-contexts (e.g. ‘driver’, ‘clutch’, ‘trunk’, and so on); these words are in the first-order co-occurrence with the title word and the keywords. Since only the words in the first-order co-occurrence cannot sufficiently describe the meaning of the category, we must collect more contexts by measuring similarities between centroid-contexts and remaining contexts; the remaining contexts do not have any title word and any keywords. The collected contexts contain the words in the second-order co-occurrence with the title word and the keywords. As a result, the context-cluster of the category is constructed as the combination of the centroid-contexts and the contexts collected by the similarity method. A Naive Bayes classifier can learn from the created context-cluster. Since the Naive Bayes classifier can assign each unlabeled document its label, the labeled training documents are obtained automatically; it is called by machine-labeled data.

When the machine-labeled data is used to build up supervised mannered text classifiers, there is an additional problem in that the data has more incorrectly labeled documents than manually labeled data does. Thus we develop and employ the TCFP classifier with robustness from noisy data for learning from the machine-labeled data.

The rest of this paper is organized as follows. Section 2 presents previous related work. In Section 3, we explain the bootstrapping technique to create machine-labeled data. Section 4 describes the TCFP classifier to learn from the machine-labeled data. Section 5 is devoted to the analysis of empirical results. In Section 6, we discuss the proposed method and results. Finally, we describe conclusions and future work.

Section snippets

Related work

In this literature, there are various studies that aim to reduce efforts for labeling tasks. Some studies are based on models that learn from labeled and unlabeled documents (Ghani, 2002, Lanquillon, 2000, Nigam, 2001), models that perform a partially supervised classification (Jeon and Landgrebe, 1999, Liu et al., 2002), or active learning (Roy and McCallum, 2001, Tong and Koller, 2001). An alternative strategy is to employ unsupervised clustering for text classification (Adami et al., 2003,

The bootstrapping technique to generate machine-labeled data

The bootstrapping process consists of three modules as shown in Fig. 1: a module to preprocess unlabeled documents, a module to construct context-clusters for training, and a module to build up the Naive Bayes classifier using context-clusters. Each module is described in the following sections in detail.

Using a feature projection technique for handling the noisy data of the machine-labeled data

The labeled data of a documents unit is finally obtained through the bootstrapping process, machine-labeled data. Now text classifiers can learn from the machine labeled data. But since the machine-labeled data is created by the proposed bootstrapping method, it generally includes more incorrectly labeled documents than the human-labeled data. In order to effectively handle them, a feature projection technique is applied to our text classifier (TCFP) (Ko & Seo, 2002). By the property of the

Data sets and experimental settings

To test the proposed method, we used three different kinds of data sets: UseNet newsgroups (20 Newsgroups), web pages (WebKB), and newswire articles (Reuters 21578). For fair evaluation in Newsgroups and WebKB, we employed the five-fold cross-validation method. That is, each data set is split into five subsets, and each subset is used once as test data in a particular run while the remaining subsets are used as training data for that run. The split into training and test sets for each run is

Discussion

We here discuss the weakness of the proposed method and propose the hybrid keywords extraction method to overcome this weakness. Then we observe how many human-labeled documents are required to obtain the performance of the proposed method in each data set.

Conclusions and future work

This paper has addressed a new unsupervised or semi-supervised text classification method. Though the proposed method uses only title words and unlabeled data, it shows reasonably comparable performance to the supervised Naive Bayes classifier. Moreover, it outperforms a clustering method, sIB. Labeled data is expensive while unlabeled data is inexpensive and plentiful. Therefore, the proposed method is useful for low-cost text classification. Furthermore, if some text classification tasks

Acknowledgement

This paper was supported by Dong-A University Research Fund in 2008.

References (28)

M. Craven et al.
Learning to construct knowledge bases from the World Wide Web
Artificial Intelligence
(2000)
Adami, G., Avesani, P., & Sona, D. (2003). Bootstrapping for hierarchical document classification. In Proceedings of...
E. Brill
Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging
Computational Linguistics
(1995)
Cho, K., & Kim, J. (1997). Automatic text categorization on hierarchical category structure by using ICF (inverse...
Ghani, R. (2002). Combining labeled and unlabeled data for multiclass text categorization. In Proceedings of...
B. Jeon et al.
Partially supervised classification using weighted unsupervised clustering
IEEE Transaction on Geoscience and Remote Sensing
(1999)
T. Joachims
Learning to classify text using support vector machines
(2001)
Y. Karov et al.
Similarity-based word sense disambiguation
Computational Linguistics
(1998)
Ko, Y., & Seo, J. (2000). Automatic text categorization by unsupervised learning. In Proceedings of the 18th...
Ko, Y., & Seo, J. (2002). Text categorization using feature projections. In Proceedings of the 19th international...

Lanquillon, C. (2000). Partially supervised text categorization: combining labeled and unlabeled documents using an...

Lewis, D. D., Schapire, R. E., Callan, J. P., & Papka, R. (1996). Training algorithms for linear text classifiers. In...

Liu, B., Lee, W., Yu, P., & Li, X. (2002). Partially supervised classification of text documents. In Proceedings of...

Y. Maarek et al.

An information retrieval approach for automatically construction software libraries

IEEE Transaction on Software Engineering

(1991)

Cited by (61)

Dual-feature-embeddings-based semi-supervised learning for cognitive engagement classification in online course discussions
2023, Knowledge-Based Systems
Online course discussions contain abundant cognitive information from learners. Previous models required a large amount of labeled data to classify cognitive engagement from the perspective of semantic features alone. However, these models only contain semantic features but cannot fully represent textual information and have poor performance in cases of scarce labeled data. Moreover, cognitive psychological features imply important information that cannot be captured by semantic features. Therefore, this paper proposes a dual feature embedding-based semi-supervised cognitive classification method that exploits the additional inductive biases caused by implicit cognitive features to supplement generic semantic features. Additional inductive biases facilitate the propagation of labeled and unlabeled data and improve the consistency between unlabeled and augmented data. Unsupervised data augmentation (UDA) is used to obtain augmented data by inserting advanced noise into unlabeled data in semi-supervised learning. Furthermore, bidirectional encoder representations from transformers (BERT) are used to extract generic semantics, and linguistic inquiry and word count (LIWC) are adopted to fetch implicit cognitive features from discussion texts. Therefore, we refer to the proposed method as B-LIWC-UDA, sequentially fusing the dual features in the explicit and hidden levels to obtain dual feature embeddings. The cognitive engagement classification model was trained using supervised and consistent training methods. We conducted experiments using datasets obtained from two real-world online course discussions. The experimental results demonstrate that, in terms of major evaluation metrics, the proposed B-LIWC-UDA method performs better than state-of-the-art text classification methods used for identifying cognitive engagement.
How to use negative class information for Naive Bayes classification
2017, Information Processing and Management
The Naive Bayes (NB) classifier is a popular classifier for text classification problems due to its simple, flexible framework and its reasonable performance. In this paper, we present how to effectively utilize negative class information to improve NB classification. As opposed to information retrieval, supervised learning based text classification already obtains class information, a negative class as well as a positive class, from a labeled training dataset. Since the negative class can also provide significant information to improve the NB classifier, the negative class information is applied to the NB classifier through two phases of indexing and class prediction tasks. As a result, the new classifier using the negative class information consistently performs better than the traditional multinomial NB classifier.
Text classification method based on self-training and LDA topic models
2017, Expert Systems with Applications
Supervised text classification methods are efficient when they can learn with reasonably sized labeled sets. On the other hand, when only a small set of labeled documents is available, semi-supervised methods become more appropriate. These methods are based on comparing distributions between labeled and unlabeled instances, therefore it is important to focus on the representation and its discrimination abilities. In this paper we present the ST LDA method for text classification in a semi-supervised manner with representations based on topic models. The proposed method comprises a semi-supervised text classification algorithm based on self-training and a model, which determines parameter settings for any new document collection. Self-training is used to enlarge the small initial labeled set with the help of information from unlabeled data. We investigate how topic-based representation affects prediction accuracy by performing NBMN and SVM classification algorithms on an enlarged labeled set and then compare the results with the same method on a typical TF-IDF representation. We also compare ST LDA with supervised classification methods and other well-known semi-supervised methods. Experiments were conducted on 11 very small initial labeled sets sampled from six publicly available document collections. The results show that our ST LDA method, when used in combination with NBMN, performed significantly better in terms of classification accuracy than other comparable methods and variations. In this manner, the ST LDA method proved to be a competitive classification method for different text collections when only a small set of labeled instances is available. As such, the proposed ST LDA method may well help to improve text classification tasks, which are essential in many advanced expert and intelligent systems, especially in the case of a scarcity of labeled texts.
Weighted Word Pairs for query expansion
2015, Information Processing and Management
Citation Excerpt :
Pseudo relevance feedback (or blind feedback) assumes that the top “n” ranked documents obtained after performing the initial query are relevant: this approach is generally used in automatic systems. Since human labeling task is enormously boring and time consuming (Ko & Seo, 2009), most existing methods make use of pseudo relevance feedback. Nevertheless, fully automatic methods can exhibit low performance when the initial query is intrinsically ambiguous.
This paper proposes a novel query expansion method to improve accuracy of text retrieval systems. Our method makes use of a minimal relevance feedback to expand the initial query with a structured representation composed of weighted pairs of words. Such a structure is obtained from the relevance feedback through a method for pairs of words selection based on the Probabilistic Topic Model. We compared our method with other baseline query expansion schemes and methods. Evaluations performed on TREC-8 demonstrated the effectiveness of the proposed method with respect to the baseline.
Semantic compared cross impact analysis
2014, Expert Systems with Applications
Citation Excerpt :
Text classification is used to assign text to different classes. In contrast to clustering, the classes have to be pre-defined in advanced (Ko & Seo, 2009; Lin & Hong, 2011). Classes can be defined as events and a text can be assigned to one or several of these events.
The aim of cross impact analysis (CIA) is to predict the impact of a first event on a second. For organization’s strategic planning, it is helpful to identify the impacts among organization’s internal events and to compare these impacts to the corresponding impacts of external events from organization’s competitors. For this, literature has introduced compared cross impact analysis (CCIA) that depicts advantages and disadvantages of the relationships between organization’s events to the relationships between competitors’ events. However, CCIA is restricted to the use of patent data as representative for competitors’ events and it applies a knowledge structure based text mining approach that does not allow considering semantic aspects from highly unstructured textual information. In contrast to related work, we propose an internet based environmental scanning procedure to identify textual patterns represent competitors’ events. To enable processing of this highly unstructured textual information, the proposed methodology uses latent semantic indexing (LSI) to calculate the compared cross impacts (CCI) for an organization. A latent semantic subspace is built that consists of semantic textual patterns. These patterns are selected that represent organization’s events. A web mining approach is used for crawling textual information from the internet based on keywords extracted from each selected pattern. This textual information is projected into the same latent semantic subspace. Based on the relationships between the semantic textual patterns in the subspace, CCI is calculated for different events of an organization. A case study shows that the proposed approach successfully calculates the CCI for technologies processed by a governmental organization. This enables decision makers to direct their investments more targeted.
Text classification using a few labeled examples
2014, Computers in Human Behavior
Supervised text classifiers need to learn from many labeled examples to achieve a high accuracy. However, in a real context, sufficient labeled examples are not always available because human labeling is enormously time-consuming. For this reason, there has been recent interest in methods that are capable of obtaining a high accuracy when the size of the training set is small.
In this paper we introduce a new single label text classification method that performs better than baseline methods when the number of labeled examples is small. Differently from most of the existing methods that usually make use of a vector of features composed of weighted words, the proposed approach uses a structured vector of features, composed of weighted pairs of words.
The proposed vector of features is automatically learned, given a set of documents, using a global method for term extraction based on the Latent Dirichlet Allocation implemented as the Probabilistic Topic Model. Experiments performed using a small percentage of the original training set (about 1%) confirmed our theories.

View all citing articles on Scopus

View full text

Text classification from unlabeled documents with bootstrapping and feature projection techniques

Abstract

Introduction

Section snippets

Related work

The bootstrapping technique to generate machine-labeled data

Using a feature projection technique for handling the noisy data of the machine-labeled data

Data sets and experimental settings

Discussion

Conclusions and future work

Acknowledgement

Artificial Intelligence

Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging

Computational Linguistics

Partially supervised classification using weighted unsupervised clustering

IEEE Transaction on Geoscience and Remote Sensing

Learning to classify text using support vector machines

Similarity-based word sense disambiguation

Computational Linguistics

An information retrieval approach for automatically construction software libraries

IEEE Transaction on Software Engineering