Exploiting probabilistic topic models to improve text categorization under class imbalance

doi:10.1016/j.ipm.2010.07.003

Information Processing & Management

Volume 47, Issue 2, March 2011, Pages 202-214

https://doi.org/10.1016/j.ipm.2010.07.003 Get rights and content

Abstract

In text categorization, it is quite often that the numbers of documents in different categories are different, i.e., the class distribution is imbalanced. We propose a unique approach to improve text categorization under class imbalance by exploiting the semantic context in text documents. Specifically, we generate new samples of rare classes (categories with relatively small amount of training data) by using global semantic information of classes represented by probabilistic topic models. In this way, the numbers of samples in different categories can become more balanced and the performance of text categorization can be improved using this transformed data set. Indeed, the proposed method is different from traditional re-sampling methods, which try to balance the number of documents in different classes by re-sampling the documents in rare classes. Such re-sampling methods can cause overfitting. Another benefit of our approach is the effective handling of noisy samples. Since all the new samples are generated by topic models, the impact of noisy samples is dramatically reduced. Finally, as demonstrated by the experimental results, the proposed methods can achieve better performance under class imbalance and is more tolerant to noisy samples.

Research highlights

► Propose two re-sampling methods based on probabilistic topic models. ► Improve text categorization under class imbalance. ► DECOM and DECODER achieve better performance under class imbalance. ► DECODER is more tolerant to noisy samples.

Introduction

Text categorization/classification (Sebastiani, 2002) is a common technique for automatically organizing documents into predefined categories. While existing text categorization techniques have shown promising results in many application scenarios (e.g. Sebastiani, 2002, Ko and Seo, 2009, Paradis and Nie, 2007), text categorization techniques for handling data sets with imbalanced class distributions remain a challenging research issue.

In the case of class imbalance, the classifiers tend to ignore rare classes in favor of larger classes due to the size effect. In fact Yang and Liu (1999) compared the robustness and classification performances of several text categorization methods such as Support Vector Machine (SVM), Naive Bayes classifier, and K-Nearest Neighbor (KNN) classifier on data sets with various class distributions, and the experimental results show that all these classifiers achieve relatively low performance for rare classes. A promising direction to handle the class imbalance problem is applying re-sampling techniques. Specifically, over-sampling techniques (Japkowicz, 2000) can be used to increase the number of data instances in rare classes and under-sampling techniques can be used to reduce the number of data instances in large classes (Japkowicz & Stephen, 2002). The ultimate goal is to adjust the sizes of classes to a relatively balanced level.

Although both over sampling and under sampling can alleviate the class imbalance problem, there are some side effects. For instance, replicating samples by over sampling could result in overfitting. Also, some useful samples in large classes may be missing due to under sampling. This, in turn, will hinder the classification performance. Therefore, many modified re-sampling methods were developed to overcome the disadvantages described above. For example, the overfitting problem in random over-sampling methods can be avoided to a certain extent by bringing in random Gaussian noise to samples in rare classes or synthetically generating samples for rare classes (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). Also, a better performance than random under-sampling methods can be achieved by only eliminating samples that are further away from class boundaries (Batista, Prati, & Monard, 2004).

As a matter of fact, most re-sampling techniques were developed for general types of data rather than for text. Our approach intends to exploit the semantic characteristics uniquely existing in text documents to improve text categorization under class imbalance. New samples of rare classes are generated by using global semantic information of classes represented by probabilistic topic models instead of replicating samples in rare classes. Probabilistic topic models (Blei et al., 2003, Griffiths and Steyvers, 2004) are effective tools to capture the semantic topics (more details Section 2.1). By exploiting probabilistic topic models of rare classes to generate new samples, overfitting is less likely to occur. In addition, if the original training samples are replaced by new samples generated by probabilistic topic models, the training data can be smoothed and the impact on the classification performance by noisy samples will also be alleviated.

In this paper, we propose two re-sampling methods based on probabilistic topic models: DECOM (data re-sampling with probabilistic topic models) and DECODER (data re-sampling with probabilistic topic models after smoothing). DECOM deals with class imbalance by generating new samples of rare classes using probabilistic topic models. In addition, for data sets with a high proportion of noisy samples, DECODER first smoothes data by regenerating all samples in data sets and then generates more samples for rare classes by probabilistic topic models. Experimental results on various real-world data sets show that both DECOM and DECODER can achieve better classification performances on rare classes. Also, DECODER is more robust in handling very noisy data.

Overview: The remainder of this paper is organized as follows. Section 2 introduces probabilistic topic models as well as two re-sampling methods based on probabilistic topic models. In Section 3, we present experimental results on a number of real-world document data sets. Section 4 describes some related work. Finally, we draw conclusions in Section 5.

Section snippets

Re-sampling with probabilistic topic models

In this section, we first present a probabilistic topic model based on Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Then we propose two re-sampling methods: DECOM and DECODER.

Experimental evaluations

In this section, we evaluate the effectiveness of the proposed DECOM and DECODER re-sampling methods in two stages. In the first stage (Sections 3.2.1 Results on two-class data sets, 3.2.2 Results on multi-class data sets), we use several unbalanced real-world data sets to show the effectiveness of DECOM and DECODER in improving the classification performances on rare classes. In the second stage, we evaluate the performance of DECODER on very noisy data.

Related works

The class imbalance problem is a major research issue, and researchers have addressed this problem from various perspectives.

First, there are research works focusing on understanding the class imbalanced problem. For example, it has been shown (Japkowicz and Stephen, 2002, Wu et al., 2007) that the imbalance problem is not only related to the degree of class imbalance in the data sets, but also related to the overall size of the training data as well as the complexity of the concepts in the

Conclusion

In this paper, we have proposed a semantic re-sampling method to handle the class imbalance problem in text categorization. Specifically, two re-sampling techniques DECOM and DECODER were developed based on probabilistic topic models. DECOM was proposed to deal with class imbalance by generating new samples of rare classes. For data sets with noisy samples and rare classes, DECODER was developed to smooth the data by regenerating all samples in each class using probabilistic topic models and

Acknowledgements

The authors wish to thank the reviewers for their invaluable comments. The work described in this paper was supported by grants from Natural Science Foundation of China (Grant No. 60775037), the Key Program of National Natural Science Foundation of China (Grant No. 60933013), the National High Technology Research and Development Program of China (No. 2009AA01Z123) and Research Fund for the Doctoral Program of Higher Education of China (20093402110017).

References (29)

Y. Ko et al.
Text classification from unlabeled documents with bootstrapping and feature projection techniques
Information Processing and Management
(2009)
F. Paradis et al.
Contextual feature selection for text classification
Information Processing and Management
(2007)
S. Tan
Neighbor-weighted k-nearest neighbor for unbalanced text corpus
Expert Systems with Applications
(2005)
G.E.A.P.A. Batista et al.
A study of the behavior of several methods for balancing machine learning training data
SIGKDD Explorations Newsletter
(2004)
D.M. Blei et al.
Latent dirichlet allocation
The Journal of Machine Learning Research
(2003)
Brank, J., Grobelnik, M. (2003). Training text classifiers with SVM on very few positive examples. Tech. Rep....
G. Casella et al.
Explaining the Gibbs sampler
The American Statistician
(1992)
Chang, C.-C., Lin, C.-J. (2001). LIBSVM: A library for support vector machines....
N.V. Chawla et al.
Smote: Synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
(2002)
N. Cristianini et al.
An introduction to support vector machines and other Kernel-based learning methods
(2000)

M. DeGroot et al.

Probability and statistics

(2001)

T.L. Griffiths et al.

Finding scientific topics

Proceedings of the National Academy of Sciences of the United States of America

(2004)

Han, E.-H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., et al. (1998). Webace: A web agent for document...

P.E. Hart

The condensed nearest neighbor rule

IEEE Transactions on Information Theory

(1968)

Cited by (45)

Are deeper reflectors better goal-setters? AI-empowered analytics of reflective writing in pharmaceutical education
2023, Computers and Education: Artificial Intelligence
Reflection and goal-setting are interrelated processes in well-established educational theories to promote in-depth self-reflection and self-regulated learning. Prior studies have considered reflection to be an important antecedent for meaningful goal-setting. Yet, there lacks empirical evidence to shed light on how students' abilities to reflect inform their abilities to set goals. Hence, in the present study, we aimed to quantify the connection between students' retrospective reflection and their subsequent goal-setting, and derive more in-depth insights to benefit educators in their teaching to promote deeper reflection, more specific goal-setting and better self-regulation. To this end, we utilised two fine-grained coding schemes, adapted from well-established reflection and goal-setting theories, respectively, as well as pertinent prior studies, to annotate the reflective and goal-setting elements within 600 student responses in pharmacy curricula. We visualised such elements as a network graph to study students' joint behavioural patterns in reflecting and setting goals. Then, we statistically analysed the correlation between students' reflective levels and the goal specificities using a Mann Whitney U test. We found that (1) descriptive reflection and goals that included content and actions with additional details more commonly presented jointly; (2) students who reflected deeply tended to set more specific goals. These findings are further summarised and discussed to guide educators to adopt reflective and goal-setting practices when designing teaching activities. Moreover, driven by these findings, we emphasised the significance of aiding instructors to provide timely assessment to students' written reflections so as to further ameliorate students' reflective abilities. Therefore, we attempted to automate such assessments using five traditional machine learning algorithms and one deep learning approach based on Bidirectional Encoder Representation of Transformers (BERT), and discovered that BERT gave the best performance in terms of identifying reflective sentences and differentiating various reflective elements.
Mining product innovation ideas from online reviews
2021, Information Processing and Management
The importance of online customer reviews to product innovation has been well-recognized in prior literature. Mining online reviews has received extensive attention and efforts. Most existing research on mining online reviews focus on issues such as the impact of reviews on sales, helpfulness of reviews, and customers’ participation in reviews. Few research studies, however, seek to identify and extract innovation ideas for products from online reviews. This type of information is particularly important for product functionality improvement and new feature development from a manufacturer's perspective. Mining product innovation ideas allows a manufacturer to proactively review customer opinion and unlock insights about new functionality and features that the market expects, in order to gain a competitive advantage. In this paper, we propose a deep learning-based approach to identify sentences that contain innovation ideas from online reviews. Specifically, we develop a novel ensemble embedding method to generate semantic and contextual representations of the words in review sentences. The resultant representations in each sentence are then used in a long short-term memory (LSTM) model for innovation-sentence identification. Moreover, we adopt a focal loss function in our model to address the class imbalance problem. We validate our approach with a dataset of 10,000 customer reviews from Amazon. Our model achieves an AUC score of 0.91 and an F1 score of 0.89, outperforming a set of state-of-the-art baseline models in the comparison. Our approach can be extended and applied to many other information extraction tasks.
Exploring coherent topics by topic modeling with term weighting
2018, Information Processing and Management
Citation Excerpt :
The basic assumption of topic modeling is that there exists a latent topic level beyond the observable word level, where each topic is a multinomial distribution over the vocabulary. Given topics learnt by topic models, we can deeply explore text documents in a variety of tasks, such as sentiment analysis (Lin & He, 2009; Lu, Ott, Cardie, & Tsou, 2011) and classification applications (Chen, Lin, Xiong, Luo, & Ma, 2011; Yang, Gao, Tan, & Wong, 2013). However, they often produce low-quality topics that are unexplainable (Mimno, Wallach, Talley, Leenders, & McCallum, 2011).
Topic models often produce unexplainable topics that are filled with noisy words. The reason is that words in topic modeling have equal weights. High frequency words dominate the top topic word lists, but most of them are meaningless words, e.g., domain-specific stopwords. To address this issue, in this paper we aim to investigate how to weight words, and then develop a straightforward but effective term weighting scheme, namely entropy weighting (EW). The proposed EW scheme is based on conditional entropy measured by word co-occurrences. Compared with existing term weighting schemes, the highlight of EW is that it can automatically reward informative words. For more robust word weight, we further suggest a combination form of EW (CEW) with two existing weighting schemes. Basically, our CEW assigns meaningless words lower weights and informative words higher weights, leading to more coherent topics during topic modeling inference. We apply CEW to Dirichlet multinomial mixture and latent Dirichlet allocation, and evaluate it by topic quality, document clustering and classification tasks on 8 real world data sets. Experimental results show that weighting words can effectively improve the topic modeling performance over both short texts and normal long texts. More importantly, the proposed CEW significantly outperforms the existing term weighting schemes, since it further considers which words are informative.
Prediction of a hotspot pattern in keyword search results
2018, Computer Speech and Language
Citation Excerpt :
This imbalance makes it difficult for a machine learning algorithm to induce an effective decision boundary between the two classes. High data skew is a common issue for real-world language processing tasks, such as sentiment analysis (Mountassir et al., 2012), named entity recognition (Lee et al., 2004), and text categorization (Chen et al., 2011). Class imbalance where the positive class is around 10–15% of the data is considered problematic, and 5% or less is considered extreme (Imam et al., 2006).
This paper identifies and models a phenomenon observed across low-resource languages in keyword search results from speech retrieval systems where the speech recognition has high error rate, due to very limited training data. High confidence correct detections (hccds) of keywords are rare, yet often succeed one another closely in time. We refer to these close sequences of hccds as keyword hotspots. The ability to predict keyword hotspots could support speech retrieval, and provide new insights into the behavior of speech recognition systems. We treat hotspot prediction as a binary classification task on all word-sized time intervals in an audio file of a telephone conversation, using prosodic features as predictors. Rare events that follow this pattern are often modeled as a self-exciting point process (sepp), meaning the occurrence of a rare event excites a following one. To label successive points in time as occurring within a hotspot or not, we fit a sepp function to the distribution of hccds in the keyword search output. Two major learning challenges are that the size of the positive class is very small, and the training and test data have dissimilar distributions. To address these challenges, we develop a novel data selection framework that chooses training data with good generalization properties. Results exhibit superior generalization performance.
A Two-Stage Machine learning approach for temporally-robust text classification
2017, Information Systems
Citation Excerpt :
The “in scores” strategy becomes vulnerable to the class imbalance problem since it artificially increases the imbalance when mapping the classes to ⟨c, p⟩. Several works have already proposed strategies to minimize such problem, for example, strategies to under-sample the majority classes [43] or over-sample the minority classes [40]. We address this issue by modifying the “in scores” strategy in order to minimize the class imbalance problem.
One of the most relevant research topics in Information Retrieval is Automatic Document Classification (ADC). Several ADC algorithms have been proposed in the literature. However, the majority of these algorithms assume that the underlying data distribution does not change over time. Previous work has demonstrated evidence of the negative impact of three main temporal effects in representative datasets textual datasets, reflected by variations observed over time in the class distribution, in the pairwise class similarities and in the relationships between terms and classes [1]. In order to minimize the impact of temporal effects in ADC algorithms, we have previously introduced the notion of a temporal weighting function (TWF), which reflects the varying nature of textual datasets. We have also proposed a procedure to derive the TWF’s expression and parameters. However, the derivation of the TWF requires the running of explicit and complex statistical tests, which are very cumbersome or can not even be run in several cases. In this article, we propose a machine learning methodology to automatically learn the TWF without the need to perform any statistical tests. We also propose new strategies to incorporate the TWF into ADC algorithms, which we call temporally-aware classifiers. Experiments showed that the fully-automated temporally-aware classifiers achieved significant gains (up to 17%) when compared to their non-temporal counterparts, even outperforming some state-of-the-art algorithms (e.g., SVM) in most cases, with large reductions in execution time.
Vocabulary size and its effect on topic representation
2017, Information Processing and Management
Citation Excerpt :
Similarly, Yi and Allan (2009) compared different topic models and found that more sophisticated topic models did not necessarily provide additional IR benefits over an LDA model. LDA-based and other forms of topic modeling have also been introduced in different IR applications, such as text segmentation (Misra, Yvon, Cappé, & Jose, 2011), text categorization (Chen, Lin, Xiong, Luo, & Ma, 2011), text summarization (Kar, Nunes, & Ribeiro, 2015), query expansion (Colace, De Santo, Greco, & Napoletano, 2015; Zhou, Lawless, & Wade, 2012), identification of patterns in search queries (Konishi, Ohwa, Fujita, Ikeda, & Hayashi, 2016), cross-language information retrieval (Vulić, De Smet, & Moens, 2013) and citation recommendation (Jiang, Liu, & Gao, 2015). Topic modeling is also becoming more widely used in informetric studies as an alternative or complement to more traditional citation and collaboration-based approaches to identify relationships among entities of interest.
This study investigates how computational overhead for topic model training may be reduced by selectively removing terms from the vocabulary of text corpora being modeled. We compare the impact of removing singly occurring terms, the top 0.5%, 1% and 5% most frequently occurring terms and both top 0.5% most frequent and singly occurring terms, along with changes in the number of topics modeled (10, 20, 30, 40, 50, 100) using three datasets. Four outcome measures are compared. The removal of singly occurring terms has little impact on outcomes for all of the measures tested. Document discriminative capacity, as measured by the document space density, is reduced by the removal of frequently occurring terms, but increases with higher numbers of topics. Vocabulary size does not greatly influence entropy, but entropy is affected by the number of topics. Finally, topic similarity, as measured by pairwise topic similarity and Jensen-Shannon divergence, decreases with the removal of frequent terms. The findings have implications for information science research in information retrieval and informetrics that makes use of topic modeling.

View all citing articles on Scopus

View full text

Exploiting probabilistic topic models to improve text categorization under class imbalance

Abstract

Research highlights

Introduction

Section snippets

Re-sampling with probabilistic topic models

Experimental evaluations

Related works

Conclusion

Acknowledgements

Information Processing and Management

Information Processing and Management

Expert Systems with Applications

A study of the behavior of several methods for balancing machine learning training data

SIGKDD Explorations Newsletter

Latent dirichlet allocation

The Journal of Machine Learning Research

Explaining the Gibbs sampler

The American Statistician

Smote: Synthetic minority over-sampling technique

Journal of Artificial Intelligence Research

An introduction to support vector machines and other Kernel-based learning methods

Probability and statistics

Finding scientific topics

Proceedings of the National Academy of Sciences of the United States of America

The condensed nearest neighbor rule

IEEE Transactions on Information Theory