Text classification from unlabeled documents with bootstrapping and feature projection techniques

https://doi.org/10.1016/j.ipm.2008.07.004Get rights and content

Abstract

Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.

Introduction

With the rapid growth of the World Wide Web, the task of classifying natural language documents into a pre-defined set of semantic categories has become one of the key methods for organizing online information. This task is commonly referred to as text classification. Since there has been an explosion of electronic texts from not only the World Wide Web but also various online sources (electronic mail, corporate databases, chat rooms, digital libraries, and so on) recently, one way of organizing this overwhelming amount of data is to classify them into topical categories.

Since the machine learning paradigm emerged in the 1990s, many machine learning algorithms have been applied to text classification by supervised learning. The supervised learning algorithm finds a representation or decision rule from an example set of labeled documents for each class. A wide range of the supervised learning algorithms has been applied to this area using a training data set of labeled documents. For example, there are naive Bayes (Ko and Seo, 2000, McCallum and Nigam, 1998), Rocchio (Lewis, Schapire, Callan, & Papka, 1996), Nearest Neighbor (k-NN) (Yang, Slattery, & Ghani, 2002), and Support Vector Machine (SVM) (Joachims, 2001).

However, the major bottleneck of the supervised learning algorithms is that they require a large number of labeled training documents for accurate learning. Since a labeling task must be done manually, it is a painfully time-consuming process. Furthermore, since the application area of automatic text classification has diversified from newswire articles and web pages to E-mails and newsgroup postings, it is also a difficult task to create training data for each application area (Nigam, McCallum, Thrun, & Mitchell, 1998). McCallum, Nigam, Rennie, and Seymore (1999) found that only 100 documents could be hand-labeled in the 90 minutes and the result of a classifier learned from this small training set achieved just 30% accuracy in their experiments. Most users of a practical system, however, do not want to do the labeling task for a long time only to obtain this level of accuracy. They obviously prefer algorithms that have high accuracy, but do not require a large amount of manually labeling task.

In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method uses only unlabeled documents and the title word of each category as initial data for learning of text classification. While labeled data is difficultly obtained, unlabeled data is readily available and plentiful. Therefore, this paper advocates an automatic labeling task using a bootstrapping technique and a robust text classifier using a feature projection technique. The input to the bootstrapping process is a large amount of unlabeled documents and a small amount of seed information to tell the learner about the specific task. Here, we consider a title word associated with a category as seed information. To automatically build up a text classifier with unlabeled documents, we must solve two problems; how we can automatically generate labeled training documents (machine-labeled data) from only a title word, and how we can handle incorrectly labeled documents in the machine-labeled data. This paper provides the solutions of both the problems. For the former, we employ the bootstrapping technique and, for the latter, we use the TCFP (Text Categorization using Feature Projections) classifier with robustness from noisy data (Ko & Seo, 2002).

Do you think that it is possible to build a text classifier with only unlabeled documents? Maybe we cannot gain any information from unlabeled documents for building a text classifier because the unlabeled documents do not contain the most important information, their category. In general, the existing supervised learning algorithms cannot construct any decision rules without the labeled data. Thus labeled training data must be obtained in order to use the existing supervised learning algorithms. Here, we explain how labeled data can be generated from the unlabeled data for text classification. Since text classification is a task based on the pre-defined categories, developers can at least know the categories for classifying documents. Knowing the categories means that they can at least choose a title word of each category. This is the starting point of the proposed method. As developers carry out a bootstrapping task from the title word, they can finally get labeled training data.

Suppose that we are going to classify documents into an ‘Autos’ category. First, the title word of this category is selected ‘automobile,’ and then the related keywords (e.g. ‘car’, ‘gear’, ‘transmission’, ‘sedan’) of ‘Auto’ are extracted by using co-occurrence information between the title word (‘automobile’) and the other words. In the proposed method, context is defined as a unit of meaning for the bootstrapping process from the title word; it has a middle size of sentences and documents (a sequence of 60 words in a document). Then the bootstrapping process first extracts the most informative contexts for the category which include at least one among the title word and the keywords. The extracted contexts are called by centroid-contexts because they are regarded as contexts with the core meaning of each category. We can obtain many words directly co-occurred with the title word and the keywords from the centroid-contexts (e.g. ‘driver’, ‘clutch’, ‘trunk’, and so on); these words are in the first-order co-occurrence with the title word and the keywords. Since only the words in the first-order co-occurrence cannot sufficiently describe the meaning of the category, we must collect more contexts by measuring similarities between centroid-contexts and remaining contexts; the remaining contexts do not have any title word and any keywords. The collected contexts contain the words in the second-order co-occurrence with the title word and the keywords. As a result, the context-cluster of the category is constructed as the combination of the centroid-contexts and the contexts collected by the similarity method. A Naive Bayes classifier can learn from the created context-cluster. Since the Naive Bayes classifier can assign each unlabeled document its label, the labeled training documents are obtained automatically; it is called by machine-labeled data.

When the machine-labeled data is used to build up supervised mannered text classifiers, there is an additional problem in that the data has more incorrectly labeled documents than manually labeled data does. Thus we develop and employ the TCFP classifier with robustness from noisy data for learning from the machine-labeled data.

The rest of this paper is organized as follows. Section 2 presents previous related work. In Section 3, we explain the bootstrapping technique to create machine-labeled data. Section 4 describes the TCFP classifier to learn from the machine-labeled data. Section 5 is devoted to the analysis of empirical results. In Section 6, we discuss the proposed method and results. Finally, we describe conclusions and future work.

Section snippets

Related work

In this literature, there are various studies that aim to reduce efforts for labeling tasks. Some studies are based on models that learn from labeled and unlabeled documents (Ghani, 2002, Lanquillon, 2000, Nigam, 2001), models that perform a partially supervised classification (Jeon and Landgrebe, 1999, Liu et al., 2002), or active learning (Roy and McCallum, 2001, Tong and Koller, 2001). An alternative strategy is to employ unsupervised clustering for text classification (Adami et al., 2003,

The bootstrapping technique to generate machine-labeled data

The bootstrapping process consists of three modules as shown in Fig. 1: a module to preprocess unlabeled documents, a module to construct context-clusters for training, and a module to build up the Naive Bayes classifier using context-clusters. Each module is described in the following sections in detail.

Using a feature projection technique for handling the noisy data of the machine-labeled data

The labeled data of a documents unit is finally obtained through the bootstrapping process, machine-labeled data. Now text classifiers can learn from the machine labeled data. But since the machine-labeled data is created by the proposed bootstrapping method, it generally includes more incorrectly labeled documents than the human-labeled data. In order to effectively handle them, a feature projection technique is applied to our text classifier (TCFP) (Ko & Seo, 2002). By the property of the

Data sets and experimental settings

To test the proposed method, we used three different kinds of data sets: UseNet newsgroups (20 Newsgroups), web pages (WebKB), and newswire articles (Reuters 21578). For fair evaluation in Newsgroups and WebKB, we employed the five-fold cross-validation method. That is, each data set is split into five subsets, and each subset is used once as test data in a particular run while the remaining subsets are used as training data for that run. The split into training and test sets for each run is

Discussion

We here discuss the weakness of the proposed method and propose the hybrid keywords extraction method to overcome this weakness. Then we observe how many human-labeled documents are required to obtain the performance of the proposed method in each data set.

Conclusions and future work

This paper has addressed a new unsupervised or semi-supervised text classification method. Though the proposed method uses only title words and unlabeled data, it shows reasonably comparable performance to the supervised Naive Bayes classifier. Moreover, it outperforms a clustering method, sIB. Labeled data is expensive while unlabeled data is inexpensive and plentiful. Therefore, the proposed method is useful for low-cost text classification. Furthermore, if some text classification tasks

Acknowledgement

This paper was supported by Dong-A University Research Fund in 2008.

References (28)

  • M. Craven et al.

    Learning to construct knowledge bases from the World Wide Web

    Artificial Intelligence

    (2000)
  • Adami, G., Avesani, P., & Sona, D. (2003). Bootstrapping for hierarchical document classification. In Proceedings of...
  • E. Brill

    Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging

    Computational Linguistics

    (1995)
  • Cho, K., & Kim, J. (1997). Automatic text categorization on hierarchical category structure by using ICF (inverse...
  • Ghani, R. (2002). Combining labeled and unlabeled data for multiclass text categorization. In Proceedings of...
  • B. Jeon et al.

    Partially supervised classification using weighted unsupervised clustering

    IEEE Transaction on Geoscience and Remote Sensing

    (1999)
  • T. Joachims

    Learning to classify text using support vector machines

    (2001)
  • Y. Karov et al.

    Similarity-based word sense disambiguation

    Computational Linguistics

    (1998)
  • Ko, Y., & Seo, J. (2000). Automatic text categorization by unsupervised learning. In Proceedings of the 18th...
  • Ko, Y., & Seo, J. (2002). Text categorization using feature projections. In Proceedings of the 19th international...
  • Lanquillon, C. (2000). Partially supervised text categorization: combining labeled and unlabeled documents using an...
  • Lewis, D. D., Schapire, R. E., Callan, J. P., & Papka, R. (1996). Training algorithms for linear text classifiers. In...
  • Liu, B., Lee, W., Yu, P., & Li, X. (2002). Partially supervised classification of text documents. In Proceedings of...
  • Y. Maarek et al.

    An information retrieval approach for automatically construction software libraries

    IEEE Transaction on Software Engineering

    (1991)
  • Cited by (61)

    • Weighted Word Pairs for query expansion

      2015, Information Processing and Management
      Citation Excerpt :

      Pseudo relevance feedback (or blind feedback) assumes that the top “n” ranked documents obtained after performing the initial query are relevant: this approach is generally used in automatic systems. Since human labeling task is enormously boring and time consuming (Ko & Seo, 2009), most existing methods make use of pseudo relevance feedback. Nevertheless, fully automatic methods can exhibit low performance when the initial query is intrinsically ambiguous.

    • Semantic compared cross impact analysis

      2014, Expert Systems with Applications
      Citation Excerpt :

      Text classification is used to assign text to different classes. In contrast to clustering, the classes have to be pre-defined in advanced (Ko & Seo, 2009; Lin & Hong, 2011). Classes can be defined as events and a text can be assigned to one or several of these events.

    • Text classification using a few labeled examples

      2014, Computers in Human Behavior
    View all citing articles on Scopus
    View full text