Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec
Introduction
Document classification is one of the main tasks of text mining and has been used in several applications [6] such as spam filtering [3] and sentiment analysis [10], [11], [20]. There are two main challenges for document classification: insufficient label information [19] and absence of an optimal representation method [12]. For document classification, no systematic and automated process is available that can be used to assign class labels to a large number of documents and then update the classification model simultaneously. Meanwhile, in numerous other classifications, once a new dataset is obtained, its class label is automatically determined. For example, in the customer churn classification of the telecommunication industry, the class labels, i.e., to stay or to leave, can be automatically determined because the customer status is periodically updated [1]. In addition, for the daily stock market prediction in the financial industry, the fluctuating price of a certain equity, which is the class label for the task, is also automatically determined when the market is closed on that day. However, the label assignment for documents is human-labor intensive, time consuming, and cost ineffective. Moreover, because a document is a list of words with a variable length, for further analysis, it should be transformed into a fixed size of numerical vector. Although document representation methods are available, such as term frequency and inverse document frequency (TF–IDF) [23] and the recently proposed neural network-based distributed representation [16], no document representation method performs higher than the other methods, for all text analytics tasks.
When there are only a few labeled examples but a large number of unlabeled examples are also available, one can consider employing semi-supervised learning (SSL) approaches for improving the classification performance [8]. In SSL, it is assumed that examples belonging to a single class are generated from a single distribution. Hence, although unlabeled examples cannot be explicitly used to help classification models to learn the discrimination function, they can be used to help estimate the data distribution for various classes; this, in turn, helps in obtaining an improved class boundary as compared with that obtained based on only labeled examples. Several strategies have realized the SSL concept in learning algorithms. Self-training (ST) constructs a classifier using only the labeled examples and unlabeled test examples with the current classifier [24]. If the likelihood of an unlabeled instance for a class is sufficiently high, it is included in the labeled dataset with the predicted class label. Then, a new classifier is trained based on the extended labeled dataset. This procedure is repeated until all the unlabeled examples are classified into one of the available classes. In contrast, generative models attempt to estimate the underlying data-generation function based on the labeled and unlabeled examples [9]. They can be used to determine the most appropriate distribution parameters by maximizing the posterior probability of both the labeled and unlabeled examples. Graph-based SSL is based on the assumption that a specified dataset can be expressed using a set of nodes (examples) and edges (relations between nodes, e.g., similarity or distance) [32]. Once the graph is constructed, the label information is propagated throughout the graph in order to assign appropriate class labels to the unlabeled examples.
Among the SSL approaches, co-training is highly noteworthy in that it considers data from various perspectives [5]. The premise of the co-training approach is that not all significant characteristics of data can be observed by a single view. A few characteristics can be conveniently understood using one view, whereas others can be captured using another view. Therefore, if a feature set of a specified dataset can be split into two subsets, two classifiers are trained independently based only on one feature set. Let us assume that these classifiers are Model A and Model B. The classification models evolve by teaching each other as follows: If Model A is highly confident about the prediction for an unlabeled example whereas Model B’s confidence is low, this instance is added to the training set of Model B with the label predicted by Model A. The reverse case is also likely to occur. With the aid of the other classification model, each model can learn the characteristics of the dataset that cannot be learned independently. The key indicator of success of the co-training is whether the features of the dataset can be split into the independent subsets. If the features of a specified dataset are originally generated from a single view, dividing the feature set cannot aid in improving the classification performance. In contrast, if the features are naturally generated using different views, such as the text description of an object (one view) and its image (another view), the co-training approach can effectively utilize unlabeled examples to enhance the classification performance.
We have derived inspiration from the manner in which co-training algorithms are trained, and as different document representation methods have demonstrated their effectiveness in various text-mining tasks, we propose a multi-co-training (MCT) method for document classification. In our method, documents are expressed using three representation schemes: TF–IDF, latent Dirichlet allocation (LDA) [4], and document to vector (Doc2Vec) [16]. TF–IDF representation is based on the bag-of-words philosophy, which involves the assumption that a document is simply a collection of words, and thus, the document can be vectorized by computing the relative importance of each word, i.e., by considering the word’s frequency in the document and its popularity in the corpus. LDA was originally developed for topic modeling, the main purpose of which is to discover latent themes that permeate the corpus. Once the LDA is trained, two outputs are generated: word distribution per topic and topic distribution per document. The latter can be regarded as another document representation in which both the word frequencies and semantic information (topic constitution) is considered. Doc2Vec is the newest among the three document representation schemes, and it is an extension of the word-to-vector (Word2Vec) representation. A word is regarded as a single vector, the element values of which are real numbers in the Word2Vec representation. The assumption of Word2Vec is that the element values of a word are affected by those of other words surrounding the target word. This assumption is encoded as a neural network structure, e.g., continuous bag-of-words or skip-gram, and the network weights are adjusted by learning observed examples [17]. Doc2Vec extends Word2Vec from the word level to the document level [16]. Each document has its own vector values in the same space as that for words. Thus, the distributed representation for both words and documents are learned simultaneously. Once the documents in a corpus were expressed using the three representation methods, we trained three classification models based on each representation method. As in co-training, a document with a prediction confidence that is significantly high for one of the three models is added to the training set of the other two models with the confidently predicted label. In order to verify the proposed MCT method, we conduct experiments by varying the rate of labeled training examples, representation dimensions, and data-dependent parameters.
The rest of this paper is organized as follows: In Section 2, we briefly review previous studies on document classification with the three document representation methods (TF–IDF, LDA, Doc2Vec) and their variants. In Section 3, the proposed MCT method is demonstrated. In Section 4, we explain the experimental design with the data description, parameter settings, benchmarked methods, and performance measure. The experimental results are discussed in Section 5. Finally, in Section 6, we conclude the current work with a few future research directions.
Section snippets
Literature review
As document classification is one of the main text mining tasks, a large number of related studies have exhibited significant progress to date. In this study, we briefly review a few representative studies while focusing on document representation methods.
To date, TF–IDF has been the most commonly adopted document representation method for various document-processing tasks. It provides each word in a document a weight according to the following two criteria: (1) the frequency of its usage in
Method: Multi-co-training
The proposed MCT method is illustrated in Fig. 1. Each document is converted into three numerical vectors (three feature sets) based on three document representation methods: TF–IDF, LDA, and Doc2Vec. Then, three learning schemes are applied: supervised learning (SL), ST, and MCT. SL-based algorithms use only labeled documents, whereas ST-based algorithms use unlabeled data although they are trained based only on one of the three feature sets. In contrast, in MCT, each model is initially
Experiments
In order to verify the proposed MCT method, we conducted a comparative experiment as described in Fig. 5. The same text-preprocessing techniques are applied to the five selected datasets. Each document in the datasets is transformed into three sets of vectors based on TF–IDF, LDA, and Doc2Vec. Based on the three document representation methods, four learning strategies are employed: pure SL, ST-based SSL, co-training, and the proposed MCT. Consequently, 10 strategies are compared as presented
Result
Table 5 summarizes the number of feature dimensions optimized by the 10-fold cross-validation. Generally, TF–IDF requires the highest dimension, followed by LDA and Doc2Vec. This result is straightforward in that although TF–IDF selected the most significant terms for classification tasks, it has a more sparse representation than the other two document representation methods: numerous feature values can be zero in TF–IDF. Meanwhile, LDA and Doc2Vec learn the distributed representation, and they
Conclusion
In this paper, we propose MCT for improving document classification accuracy. Three document representation methods, i.e., TF–IDF, LDA, and Doc2Vec, are employed for transforming an unstructured document into a real-valued vector. As a larger number of unlabeled text documents than labeled documents are present in real-world problems, we adopt an SSL scheme for altering the classification model by using available albeit unlabeled data. The experimental results verify that the proposed MCT can
Acknowledgments
This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2016R1D1A1B03930729) and Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korean Government (MSIP) (No. 2017-0-00349), Development of Media Streaming system with Machine Learning using QoE (Quality of Experience). This work was also supported by Korea Electric Power Corporation. (Grant number:
References (32)
- et al.
Support vector machine active learning with applications to text classification
J. Mach. Learn. Res.
(2001) - et al.
Customer churn prediction in telecommunication industry: With and without counter-example
2014 European Network Intelligence Conference
(2014) - et al.
Random forest classifier for multi-category classification of web pages
Services Computing Conference, 2009. APSCC 2009. IEEE Asia-Pacific
(2009) - et al.
Latent dirichlet allocation in web spam filtering
Proceedings of the 4th international workshop on Adversarial information retrieval on the web
(2008) - et al.
Latent dirichlet allocation
J. Mach. Learn. Res.
(2003) - et al.
Combining labeled and unlabeled data with co-training
Proceedings of the eleventh annual conference on Computational learning theory
(1998) - et al.
Automatic document classification
J. ACM (JACM)
(1963) - et al.
A stream-based semi-supervised active learning approach for document classification
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
(2013) - et al.
Semi-supervised learning
(2010) - et al.
Semi-supervised classification with hybrid generative/discriminative methods
Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(2007)
Domain adaptation for large-scale sentiment classification: A deep learning approach
Proceedings of the 28th international conference on machine learning (ICML-11)
Twitter sentiment classification using distant supervision
CS224N Project Report, Stanford
Representation and classification of text documents: a brief review
IJCA, Special Issue on RTIPPR
A review of machine learning algorithms for text-documents classification
J. Adv. Inf. Technol.
Some effective techniques for naive bayes text classification
IEEE Trans. Knowl. Data Eng.
Cited by (320)
Research on strategies for improving green product consumption sentiment from the perspective of big data
2024, Journal of Retailing and Consumer ServicesMulti-objective optimization and integrated indicator-driven two-stage project recommendation in time-dependent software ecosystem
2024, Information and Software TechnologyMulti-attribute decision-making based on data mining under a dynamic hybrid trust network
2023, Computers and Industrial EngineeringAn end-to-end deep learning model for solving data-driven newsvendor problem with accessibility to textual review data
2023, International Journal of Production EconomicsComputational thematics: comparing algorithms for clustering the genres of literary fiction
2024, Humanities and Social Sciences Communications