Exploiting probabilistic topic models to improve text categorization under class imbalance

https://doi.org/10.1016/j.ipm.2010.07.003Get rights and content

Abstract

In text categorization, it is quite often that the numbers of documents in different categories are different, i.e., the class distribution is imbalanced. We propose a unique approach to improve text categorization under class imbalance by exploiting the semantic context in text documents. Specifically, we generate new samples of rare classes (categories with relatively small amount of training data) by using global semantic information of classes represented by probabilistic topic models. In this way, the numbers of samples in different categories can become more balanced and the performance of text categorization can be improved using this transformed data set. Indeed, the proposed method is different from traditional re-sampling methods, which try to balance the number of documents in different classes by re-sampling the documents in rare classes. Such re-sampling methods can cause overfitting. Another benefit of our approach is the effective handling of noisy samples. Since all the new samples are generated by topic models, the impact of noisy samples is dramatically reduced. Finally, as demonstrated by the experimental results, the proposed methods can achieve better performance under class imbalance and is more tolerant to noisy samples.

Research highlights

► Propose two re-sampling methods based on probabilistic topic models. ► Improve text categorization under class imbalance. ► DECOM and DECODER achieve better performance under class imbalance. ► DECODER is more tolerant to noisy samples.

Introduction

Text categorization/classification (Sebastiani, 2002) is a common technique for automatically organizing documents into predefined categories. While existing text categorization techniques have shown promising results in many application scenarios (e.g. Sebastiani, 2002, Ko and Seo, 2009, Paradis and Nie, 2007), text categorization techniques for handling data sets with imbalanced class distributions remain a challenging research issue.

In the case of class imbalance, the classifiers tend to ignore rare classes in favor of larger classes due to the size effect. In fact Yang and Liu (1999) compared the robustness and classification performances of several text categorization methods such as Support Vector Machine (SVM), Naive Bayes classifier, and K-Nearest Neighbor (KNN) classifier on data sets with various class distributions, and the experimental results show that all these classifiers achieve relatively low performance for rare classes. A promising direction to handle the class imbalance problem is applying re-sampling techniques. Specifically, over-sampling techniques (Japkowicz, 2000) can be used to increase the number of data instances in rare classes and under-sampling techniques can be used to reduce the number of data instances in large classes (Japkowicz & Stephen, 2002). The ultimate goal is to adjust the sizes of classes to a relatively balanced level.

Although both over sampling and under sampling can alleviate the class imbalance problem, there are some side effects. For instance, replicating samples by over sampling could result in overfitting. Also, some useful samples in large classes may be missing due to under sampling. This, in turn, will hinder the classification performance. Therefore, many modified re-sampling methods were developed to overcome the disadvantages described above. For example, the overfitting problem in random over-sampling methods can be avoided to a certain extent by bringing in random Gaussian noise to samples in rare classes or synthetically generating samples for rare classes (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). Also, a better performance than random under-sampling methods can be achieved by only eliminating samples that are further away from class boundaries (Batista, Prati, & Monard, 2004).

As a matter of fact, most re-sampling techniques were developed for general types of data rather than for text. Our approach intends to exploit the semantic characteristics uniquely existing in text documents to improve text categorization under class imbalance. New samples of rare classes are generated by using global semantic information of classes represented by probabilistic topic models instead of replicating samples in rare classes. Probabilistic topic models (Blei et al., 2003, Griffiths and Steyvers, 2004) are effective tools to capture the semantic topics (more details Section 2.1). By exploiting probabilistic topic models of rare classes to generate new samples, overfitting is less likely to occur. In addition, if the original training samples are replaced by new samples generated by probabilistic topic models, the training data can be smoothed and the impact on the classification performance by noisy samples will also be alleviated.

In this paper, we propose two re-sampling methods based on probabilistic topic models: DECOM (data re-sampling with probabilistic topic models) and DECODER (data re-sampling with probabilistic topic models after smoothing). DECOM deals with class imbalance by generating new samples of rare classes using probabilistic topic models. In addition, for data sets with a high proportion of noisy samples, DECODER first smoothes data by regenerating all samples in data sets and then generates more samples for rare classes by probabilistic topic models. Experimental results on various real-world data sets show that both DECOM and DECODER can achieve better classification performances on rare classes. Also, DECODER is more robust in handling very noisy data.

Overview: The remainder of this paper is organized as follows. Section 2 introduces probabilistic topic models as well as two re-sampling methods based on probabilistic topic models. In Section 3, we present experimental results on a number of real-world document data sets. Section 4 describes some related work. Finally, we draw conclusions in Section 5.

Section snippets

Re-sampling with probabilistic topic models

In this section, we first present a probabilistic topic model based on Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Then we propose two re-sampling methods: DECOM and DECODER.

Experimental evaluations

In this section, we evaluate the effectiveness of the proposed DECOM and DECODER re-sampling methods in two stages. In the first stage (Sections 3.2.1 Results on two-class data sets, 3.2.2 Results on multi-class data sets), we use several unbalanced real-world data sets to show the effectiveness of DECOM and DECODER in improving the classification performances on rare classes. In the second stage, we evaluate the performance of DECODER on very noisy data.

Related works

The class imbalance problem is a major research issue, and researchers have addressed this problem from various perspectives.

First, there are research works focusing on understanding the class imbalanced problem. For example, it has been shown (Japkowicz and Stephen, 2002, Wu et al., 2007) that the imbalance problem is not only related to the degree of class imbalance in the data sets, but also related to the overall size of the training data as well as the complexity of the concepts in the

Conclusion

In this paper, we have proposed a semantic re-sampling method to handle the class imbalance problem in text categorization. Specifically, two re-sampling techniques DECOM and DECODER were developed based on probabilistic topic models. DECOM was proposed to deal with class imbalance by generating new samples of rare classes. For data sets with noisy samples and rare classes, DECODER was developed to smooth the data by regenerating all samples in each class using probabilistic topic models and

Acknowledgements

The authors wish to thank the reviewers for their invaluable comments. The work described in this paper was supported by grants from Natural Science Foundation of China (Grant No. 60775037), the Key Program of National Natural Science Foundation of China (Grant No. 60933013), the National High Technology Research and Development Program of China (No. 2009AA01Z123) and Research Fund for the Doctoral Program of Higher Education of China (20093402110017).

References (29)

  • Y. Ko et al.

    Text classification from unlabeled documents with bootstrapping and feature projection techniques

    Information Processing and Management

    (2009)
  • F. Paradis et al.

    Contextual feature selection for text classification

    Information Processing and Management

    (2007)
  • S. Tan

    Neighbor-weighted k-nearest neighbor for unbalanced text corpus

    Expert Systems with Applications

    (2005)
  • G.E.A.P.A. Batista et al.

    A study of the behavior of several methods for balancing machine learning training data

    SIGKDD Explorations Newsletter

    (2004)
  • D.M. Blei et al.

    Latent dirichlet allocation

    The Journal of Machine Learning Research

    (2003)
  • Brank, J., Grobelnik, M. (2003). Training text classifiers with SVM on very few positive examples. Tech. Rep....
  • G. Casella et al.

    Explaining the Gibbs sampler

    The American Statistician

    (1992)
  • Chang, C.-C., Lin, C.-J. (2001). LIBSVM: A library for support vector machines....
  • N.V. Chawla et al.

    Smote: Synthetic minority over-sampling technique

    Journal of Artificial Intelligence Research

    (2002)
  • N. Cristianini et al.

    An introduction to support vector machines and other Kernel-based learning methods

    (2000)
  • M. DeGroot et al.

    Probability and statistics

    (2001)
  • T.L. Griffiths et al.

    Finding scientific topics

    Proceedings of the National Academy of Sciences of the United States of America

    (2004)
  • Han, E.-H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., et al. (1998). Webace: A web agent for document...
  • P.E. Hart

    The condensed nearest neighbor rule

    IEEE Transactions on Information Theory

    (1968)
  • Cited by (45)

    • Mining product innovation ideas from online reviews

      2021, Information Processing and Management
    • Exploring coherent topics by topic modeling with term weighting

      2018, Information Processing and Management
      Citation Excerpt :

      The basic assumption of topic modeling is that there exists a latent topic level beyond the observable word level, where each topic is a multinomial distribution over the vocabulary. Given topics learnt by topic models, we can deeply explore text documents in a variety of tasks, such as sentiment analysis (Lin & He, 2009; Lu, Ott, Cardie, & Tsou, 2011) and classification applications (Chen, Lin, Xiong, Luo, & Ma, 2011; Yang, Gao, Tan, & Wong, 2013). However, they often produce low-quality topics that are unexplainable (Mimno, Wallach, Talley, Leenders, & McCallum, 2011).

    • Prediction of a hotspot pattern in keyword search results

      2018, Computer Speech and Language
      Citation Excerpt :

      This imbalance makes it difficult for a machine learning algorithm to induce an effective decision boundary between the two classes. High data skew is a common issue for real-world language processing tasks, such as sentiment analysis (Mountassir et al., 2012), named entity recognition (Lee et al., 2004), and text categorization (Chen et al., 2011). Class imbalance where the positive class is around 10–15% of the data is considered problematic, and 5% or less is considered extreme (Imam et al., 2006).

    • A Two-Stage Machine learning approach for temporally-robust text classification

      2017, Information Systems
      Citation Excerpt :

      The “in scores” strategy becomes vulnerable to the class imbalance problem since it artificially increases the imbalance when mapping the classes to ⟨c, p⟩. Several works have already proposed strategies to minimize such problem, for example, strategies to under-sample the majority classes [43] or over-sample the minority classes [40]. We address this issue by modifying the “in scores” strategy in order to minimize the class imbalance problem.

    • Vocabulary size and its effect on topic representation

      2017, Information Processing and Management
      Citation Excerpt :

      Similarly, Yi and Allan (2009) compared different topic models and found that more sophisticated topic models did not necessarily provide additional IR benefits over an LDA model. LDA-based and other forms of topic modeling have also been introduced in different IR applications, such as text segmentation (Misra, Yvon, Cappé, & Jose, 2011), text categorization (Chen, Lin, Xiong, Luo, & Ma, 2011), text summarization (Kar, Nunes, & Ribeiro, 2015), query expansion (Colace, De Santo, Greco, & Napoletano, 2015; Zhou, Lawless, & Wade, 2012), identification of patterns in search queries (Konishi, Ohwa, Fujita, Ikeda, & Hayashi, 2016), cross-language information retrieval (Vulić, De Smet, & Moens, 2013) and citation recommendation (Jiang, Liu, & Gao, 2015). Topic modeling is also becoming more widely used in informetric studies as an alternative or complement to more traditional citation and collaboration-based approaches to identify relationships among entities of interest.

    View all citing articles on Scopus
    View full text