Elsevier

Information Sciences

Volume 477, March 2019, Pages 15-29
Information Sciences

Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec

https://doi.org/10.1016/j.ins.2018.10.006Get rights and content

Abstract

The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.

Introduction

Document classification is one of the main tasks of text mining and has been used in several applications [6] such as spam filtering [3] and sentiment analysis [10], [11], [20]. There are two main challenges for document classification: insufficient label information [19] and absence of an optimal representation method [12]. For document classification, no systematic and automated process is available that can be used to assign class labels to a large number of documents and then update the classification model simultaneously. Meanwhile, in numerous other classifications, once a new dataset is obtained, its class label is automatically determined. For example, in the customer churn classification of the telecommunication industry, the class labels, i.e., to stay or to leave, can be automatically determined because the customer status is periodically updated [1]. In addition, for the daily stock market prediction in the financial industry, the fluctuating price of a certain equity, which is the class label for the task, is also automatically determined when the market is closed on that day. However, the label assignment for documents is human-labor intensive, time consuming, and cost ineffective. Moreover, because a document is a list of words with a variable length, for further analysis, it should be transformed into a fixed size of numerical vector. Although document representation methods are available, such as term frequency and inverse document frequency (TF–IDF) [23] and the recently proposed neural network-based distributed representation [16], no document representation method performs higher than the other methods, for all text analytics tasks.

When there are only a few labeled examples but a large number of unlabeled examples are also available, one can consider employing semi-supervised learning (SSL) approaches for improving the classification performance [8]. In SSL, it is assumed that examples belonging to a single class are generated from a single distribution. Hence, although unlabeled examples cannot be explicitly used to help classification models to learn the discrimination function, they can be used to help estimate the data distribution for various classes; this, in turn, helps in obtaining an improved class boundary as compared with that obtained based on only labeled examples. Several strategies have realized the SSL concept in learning algorithms. Self-training (ST) constructs a classifier using only the labeled examples and unlabeled test examples with the current classifier [24]. If the likelihood of an unlabeled instance for a class is sufficiently high, it is included in the labeled dataset with the predicted class label. Then, a new classifier is trained based on the extended labeled dataset. This procedure is repeated until all the unlabeled examples are classified into one of the available classes. In contrast, generative models attempt to estimate the underlying data-generation function based on the labeled and unlabeled examples [9]. They can be used to determine the most appropriate distribution parameters by maximizing the posterior probability of both the labeled and unlabeled examples. Graph-based SSL is based on the assumption that a specified dataset can be expressed using a set of nodes (examples) and edges (relations between nodes, e.g., similarity or distance) [32]. Once the graph is constructed, the label information is propagated throughout the graph in order to assign appropriate class labels to the unlabeled examples.

Among the SSL approaches, co-training is highly noteworthy in that it considers data from various perspectives [5]. The premise of the co-training approach is that not all significant characteristics of data can be observed by a single view. A few characteristics can be conveniently understood using one view, whereas others can be captured using another view. Therefore, if a feature set of a specified dataset can be split into two subsets, two classifiers are trained independently based only on one feature set. Let us assume that these classifiers are Model A and Model B. The classification models evolve by teaching each other as follows: If Model A is highly confident about the prediction for an unlabeled example whereas Model B’s confidence is low, this instance is added to the training set of Model B with the label predicted by Model A. The reverse case is also likely to occur. With the aid of the other classification model, each model can learn the characteristics of the dataset that cannot be learned independently. The key indicator of success of the co-training is whether the features of the dataset can be split into the independent subsets. If the features of a specified dataset are originally generated from a single view, dividing the feature set cannot aid in improving the classification performance. In contrast, if the features are naturally generated using different views, such as the text description of an object (one view) and its image (another view), the co-training approach can effectively utilize unlabeled examples to enhance the classification performance.

We have derived inspiration from the manner in which co-training algorithms are trained, and as different document representation methods have demonstrated their effectiveness in various text-mining tasks, we propose a multi-co-training (MCT) method for document classification. In our method, documents are expressed using three representation schemes: TF–IDF, latent Dirichlet allocation (LDA) [4], and document to vector (Doc2Vec) [16]. TF–IDF representation is based on the bag-of-words philosophy, which involves the assumption that a document is simply a collection of words, and thus, the document can be vectorized by computing the relative importance of each word, i.e., by considering the word’s frequency in the document and its popularity in the corpus. LDA was originally developed for topic modeling, the main purpose of which is to discover latent themes that permeate the corpus. Once the LDA is trained, two outputs are generated: word distribution per topic and topic distribution per document. The latter can be regarded as another document representation in which both the word frequencies and semantic information (topic constitution) is considered. Doc2Vec is the newest among the three document representation schemes, and it is an extension of the word-to-vector (Word2Vec) representation. A word is regarded as a single vector, the element values of which are real numbers in the Word2Vec representation. The assumption of Word2Vec is that the element values of a word are affected by those of other words surrounding the target word. This assumption is encoded as a neural network structure, e.g., continuous bag-of-words or skip-gram, and the network weights are adjusted by learning observed examples [17]. Doc2Vec extends Word2Vec from the word level to the document level [16]. Each document has its own vector values in the same space as that for words. Thus, the distributed representation for both words and documents are learned simultaneously. Once the documents in a corpus were expressed using the three representation methods, we trained three classification models based on each representation method. As in co-training, a document with a prediction confidence that is significantly high for one of the three models is added to the training set of the other two models with the confidently predicted label. In order to verify the proposed MCT method, we conduct experiments by varying the rate of labeled training examples, representation dimensions, and data-dependent parameters.

The rest of this paper is organized as follows: In Section 2, we briefly review previous studies on document classification with the three document representation methods (TF–IDF, LDA, Doc2Vec) and their variants. In Section 3, the proposed MCT method is demonstrated. In Section 4, we explain the experimental design with the data description, parameter settings, benchmarked methods, and performance measure. The experimental results are discussed in Section 5. Finally, in Section 6, we conclude the current work with a few future research directions.

Section snippets

Literature review

As document classification is one of the main text mining tasks, a large number of related studies have exhibited significant progress to date. In this study, we briefly review a few representative studies while focusing on document representation methods.

To date, TF–IDF has been the most commonly adopted document representation method for various document-processing tasks. It provides each word in a document a weight according to the following two criteria: (1) the frequency of its usage in

Method: Multi-co-training

The proposed MCT method is illustrated in Fig. 1. Each document is converted into three numerical vectors (three feature sets) based on three document representation methods: TF–IDF, LDA, and Doc2Vec. Then, three learning schemes are applied: supervised learning (SL), ST, and MCT. SL-based algorithms use only labeled documents, whereas ST-based algorithms use unlabeled data although they are trained based only on one of the three feature sets. In contrast, in MCT, each model is initially

Experiments

In order to verify the proposed MCT method, we conducted a comparative experiment as described in Fig. 5. The same text-preprocessing techniques are applied to the five selected datasets. Each document in the datasets is transformed into three sets of vectors based on TF–IDF, LDA, and Doc2Vec. Based on the three document representation methods, four learning strategies are employed: pure SL, ST-based SSL, co-training, and the proposed MCT. Consequently, 10 strategies are compared as presented

Result

Table 5 summarizes the number of feature dimensions optimized by the 10-fold cross-validation. Generally, TF–IDF requires the highest dimension, followed by LDA and Doc2Vec. This result is straightforward in that although TF–IDF selected the most significant terms for classification tasks, it has a more sparse representation than the other two document representation methods: numerous feature values can be zero in TF–IDF. Meanwhile, LDA and Doc2Vec learn the distributed representation, and they

Conclusion

In this paper, we propose MCT for improving document classification accuracy. Three document representation methods, i.e., TF–IDF, LDA, and Doc2Vec, are employed for transforming an unstructured document into a real-valued vector. As a larger number of unlabeled text documents than labeled documents are present in real-world problems, we adopt an SSL scheme for altering the classification model by using available albeit unlabeled data. The experimental results verify that the proposed MCT can

Acknowledgments

This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2016R1D1A1B03930729) and Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korean Government (MSIP) (No. 2017-0-00349), Development of Media Streaming system with Machine Learning using QoE (Quality of Experience). This work was also supported by Korea Electric Power Corporation. (Grant number:

References (32)

  • S. Tong et al.

    Support vector machine active learning with applications to text classification

    J. Mach. Learn. Res.

    (2001)
  • A. Amin et al.

    Customer churn prediction in telecommunication industry: With and without counter-example

    2014 European Network Intelligence Conference

    (2014)
  • W.T. Aung et al.

    Random forest classifier for multi-category classification of web pages

    Services Computing Conference, 2009. APSCC 2009. IEEE Asia-Pacific

    (2009)
  • I. Bíró et al.

    Latent dirichlet allocation in web spam filtering

    Proceedings of the 4th international workshop on Adversarial information retrieval on the web

    (2008)
  • D.M. Blei et al.

    Latent dirichlet allocation

    J. Mach. Learn. Res.

    (2003)
  • A. Blum et al.

    Combining labeled and unlabeled data with co-training

    Proceedings of the eleventh annual conference on Computational learning theory

    (1998)
  • H. Borko et al.

    Automatic document classification

    J. ACM (JACM)

    (1963)
  • M.-R. Bouguelia et al.

    A stream-based semi-supervised active learning approach for document classification

    Document Analysis and Recognition (ICDAR), 2013 12th International Conference on

    (2013)
  • O. Chapelle et al.

    Semi-supervised learning

    (2010)
  • G. Druck et al.

    Semi-supervised classification with hybrid generative/discriminative methods

    Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2007)
  • X. Glorot et al.

    Domain adaptation for large-scale sentiment classification: A deep learning approach

    Proceedings of the 28th international conference on machine learning (ICML-11)

    (2011)
  • A. Go et al.

    Twitter sentiment classification using distant supervision

    CS224N Project Report, Stanford

    (2009)
  • B.S. Harish et al.

    Representation and classification of text documents: a brief review

    IJCA, Special Issue on RTIPPR

    (2010)
  • A. Khan et al.

    A review of machine learning algorithms for text-documents classification

    J. Adv. Inf. Technol.

    (2010)
  • S.-B. Kim et al.

    Some effective techniques for naive bayes text classification

    IEEE Trans. Knowl. Data Eng.

    (2006)
  • J.H. Lau, T. Baldwin, An empirical evaluation of doc2vec with practical insights into document embedding generation,...
  • Cited by (320)

    View all citing articles on Scopus
    View full text