Text classification from unlabeled documents with bootstrapping and feature projection techniques
Introduction
With the rapid growth of the World Wide Web, the task of classifying natural language documents into a pre-defined set of semantic categories has become one of the key methods for organizing online information. This task is commonly referred to as text classification. Since there has been an explosion of electronic texts from not only the World Wide Web but also various online sources (electronic mail, corporate databases, chat rooms, digital libraries, and so on) recently, one way of organizing this overwhelming amount of data is to classify them into topical categories.
Since the machine learning paradigm emerged in the 1990s, many machine learning algorithms have been applied to text classification by supervised learning. The supervised learning algorithm finds a representation or decision rule from an example set of labeled documents for each class. A wide range of the supervised learning algorithms has been applied to this area using a training data set of labeled documents. For example, there are naive Bayes (Ko and Seo, 2000, McCallum and Nigam, 1998), Rocchio (Lewis, Schapire, Callan, & Papka, 1996), Nearest Neighbor (k-NN) (Yang, Slattery, & Ghani, 2002), and Support Vector Machine (SVM) (Joachims, 2001).
However, the major bottleneck of the supervised learning algorithms is that they require a large number of labeled training documents for accurate learning. Since a labeling task must be done manually, it is a painfully time-consuming process. Furthermore, since the application area of automatic text classification has diversified from newswire articles and web pages to E-mails and newsgroup postings, it is also a difficult task to create training data for each application area (Nigam, McCallum, Thrun, & Mitchell, 1998). McCallum, Nigam, Rennie, and Seymore (1999) found that only 100 documents could be hand-labeled in the 90 minutes and the result of a classifier learned from this small training set achieved just 30% accuracy in their experiments. Most users of a practical system, however, do not want to do the labeling task for a long time only to obtain this level of accuracy. They obviously prefer algorithms that have high accuracy, but do not require a large amount of manually labeling task.
In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method uses only unlabeled documents and the title word of each category as initial data for learning of text classification. While labeled data is difficultly obtained, unlabeled data is readily available and plentiful. Therefore, this paper advocates an automatic labeling task using a bootstrapping technique and a robust text classifier using a feature projection technique. The input to the bootstrapping process is a large amount of unlabeled documents and a small amount of seed information to tell the learner about the specific task. Here, we consider a title word associated with a category as seed information. To automatically build up a text classifier with unlabeled documents, we must solve two problems; how we can automatically generate labeled training documents (machine-labeled data) from only a title word, and how we can handle incorrectly labeled documents in the machine-labeled data. This paper provides the solutions of both the problems. For the former, we employ the bootstrapping technique and, for the latter, we use the TCFP (Text Categorization using Feature Projections) classifier with robustness from noisy data (Ko & Seo, 2002).
Do you think that it is possible to build a text classifier with only unlabeled documents? Maybe we cannot gain any information from unlabeled documents for building a text classifier because the unlabeled documents do not contain the most important information, their category. In general, the existing supervised learning algorithms cannot construct any decision rules without the labeled data. Thus labeled training data must be obtained in order to use the existing supervised learning algorithms. Here, we explain how labeled data can be generated from the unlabeled data for text classification. Since text classification is a task based on the pre-defined categories, developers can at least know the categories for classifying documents. Knowing the categories means that they can at least choose a title word of each category. This is the starting point of the proposed method. As developers carry out a bootstrapping task from the title word, they can finally get labeled training data.
Suppose that we are going to classify documents into an ‘Autos’ category. First, the title word of this category is selected ‘automobile,’ and then the related keywords (e.g. ‘car’, ‘gear’, ‘transmission’, ‘sedan’) of ‘Auto’ are extracted by using co-occurrence information between the title word (‘automobile’) and the other words. In the proposed method, context is defined as a unit of meaning for the bootstrapping process from the title word; it has a middle size of sentences and documents (a sequence of 60 words in a document). Then the bootstrapping process first extracts the most informative contexts for the category which include at least one among the title word and the keywords. The extracted contexts are called by centroid-contexts because they are regarded as contexts with the core meaning of each category. We can obtain many words directly co-occurred with the title word and the keywords from the centroid-contexts (e.g. ‘driver’, ‘clutch’, ‘trunk’, and so on); these words are in the first-order co-occurrence with the title word and the keywords. Since only the words in the first-order co-occurrence cannot sufficiently describe the meaning of the category, we must collect more contexts by measuring similarities between centroid-contexts and remaining contexts; the remaining contexts do not have any title word and any keywords. The collected contexts contain the words in the second-order co-occurrence with the title word and the keywords. As a result, the context-cluster of the category is constructed as the combination of the centroid-contexts and the contexts collected by the similarity method. A Naive Bayes classifier can learn from the created context-cluster. Since the Naive Bayes classifier can assign each unlabeled document its label, the labeled training documents are obtained automatically; it is called by machine-labeled data.
When the machine-labeled data is used to build up supervised mannered text classifiers, there is an additional problem in that the data has more incorrectly labeled documents than manually labeled data does. Thus we develop and employ the TCFP classifier with robustness from noisy data for learning from the machine-labeled data.
The rest of this paper is organized as follows. Section 2 presents previous related work. In Section 3, we explain the bootstrapping technique to create machine-labeled data. Section 4 describes the TCFP classifier to learn from the machine-labeled data. Section 5 is devoted to the analysis of empirical results. In Section 6, we discuss the proposed method and results. Finally, we describe conclusions and future work.
Section snippets
Related work
In this literature, there are various studies that aim to reduce efforts for labeling tasks. Some studies are based on models that learn from labeled and unlabeled documents (Ghani, 2002, Lanquillon, 2000, Nigam, 2001), models that perform a partially supervised classification (Jeon and Landgrebe, 1999, Liu et al., 2002), or active learning (Roy and McCallum, 2001, Tong and Koller, 2001). An alternative strategy is to employ unsupervised clustering for text classification (Adami et al., 2003,
The bootstrapping technique to generate machine-labeled data
The bootstrapping process consists of three modules as shown in Fig. 1: a module to preprocess unlabeled documents, a module to construct context-clusters for training, and a module to build up the Naive Bayes classifier using context-clusters. Each module is described in the following sections in detail.
Using a feature projection technique for handling the noisy data of the machine-labeled data
The labeled data of a documents unit is finally obtained through the bootstrapping process, machine-labeled data. Now text classifiers can learn from the machine labeled data. But since the machine-labeled data is created by the proposed bootstrapping method, it generally includes more incorrectly labeled documents than the human-labeled data. In order to effectively handle them, a feature projection technique is applied to our text classifier (TCFP) (Ko & Seo, 2002). By the property of the
Data sets and experimental settings
To test the proposed method, we used three different kinds of data sets: UseNet newsgroups (20 Newsgroups), web pages (WebKB), and newswire articles (Reuters 21578). For fair evaluation in Newsgroups and WebKB, we employed the five-fold cross-validation method. That is, each data set is split into five subsets, and each subset is used once as test data in a particular run while the remaining subsets are used as training data for that run. The split into training and test sets for each run is
Discussion
We here discuss the weakness of the proposed method and propose the hybrid keywords extraction method to overcome this weakness. Then we observe how many human-labeled documents are required to obtain the performance of the proposed method in each data set.
Conclusions and future work
This paper has addressed a new unsupervised or semi-supervised text classification method. Though the proposed method uses only title words and unlabeled data, it shows reasonably comparable performance to the supervised Naive Bayes classifier. Moreover, it outperforms a clustering method, sIB. Labeled data is expensive while unlabeled data is inexpensive and plentiful. Therefore, the proposed method is useful for low-cost text classification. Furthermore, if some text classification tasks
Acknowledgement
This paper was supported by Dong-A University Research Fund in 2008.
References (28)
- et al.
Learning to construct knowledge bases from the World Wide Web
Artificial Intelligence
(2000) - Adami, G., Avesani, P., & Sona, D. (2003). Bootstrapping for hierarchical document classification. In Proceedings of...
Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging
Computational Linguistics
(1995)- Cho, K., & Kim, J. (1997). Automatic text categorization on hierarchical category structure by using ICF (inverse...
- Ghani, R. (2002). Combining labeled and unlabeled data for multiclass text categorization. In Proceedings of...
- et al.
Partially supervised classification using weighted unsupervised clustering
IEEE Transaction on Geoscience and Remote Sensing
(1999) Learning to classify text using support vector machines
(2001)- et al.
Similarity-based word sense disambiguation
Computational Linguistics
(1998) - Ko, Y., & Seo, J. (2000). Automatic text categorization by unsupervised learning. In Proceedings of the 18th...
- Ko, Y., & Seo, J. (2002). Text categorization using feature projections. In Proceedings of the 19th international...
An information retrieval approach for automatically construction software libraries
IEEE Transaction on Software Engineering
Cited by (61)
How to use negative class information for Naive Bayes classification
2017, Information Processing and ManagementText classification method based on self-training and LDA topic models
2017, Expert Systems with ApplicationsWeighted Word Pairs for query expansion
2015, Information Processing and ManagementCitation Excerpt :Pseudo relevance feedback (or blind feedback) assumes that the top “n” ranked documents obtained after performing the initial query are relevant: this approach is generally used in automatic systems. Since human labeling task is enormously boring and time consuming (Ko & Seo, 2009), most existing methods make use of pseudo relevance feedback. Nevertheless, fully automatic methods can exhibit low performance when the initial query is intrinsically ambiguous.
Semantic compared cross impact analysis
2014, Expert Systems with ApplicationsCitation Excerpt :Text classification is used to assign text to different classes. In contrast to clustering, the classes have to be pre-defined in advanced (Ko & Seo, 2009; Lin & Hong, 2011). Classes can be defined as events and a text can be assigned to one or several of these events.
Text classification using a few labeled examples
2014, Computers in Human Behavior