Live and learn from mistakes: A lightweight system for document classification
Highlights
► Text processing. ► Clusterheads leashed to class centroids. ► Online learning with negative feedback. ► A lightweight well-performing system for online document classification is proposed.
Introduction
There is no shortage of machine learning tools for batch-mode data analysis. However, the need to analyze larger and larger datasets created demand for a new class of online learning algorithms (online SVMs Bordes and Bottou, 2005, Sculley and Wachman, 2007, Zhang et al., 2005, online decision trees Basak, 2006, etc.), which can produce similar or better results as their batch-mode counterparts, but are more efficient because they can learn from every new example. In real-life scenarios, training data may not be available all at once (e.g., streaming data), or there may be a prohibitively large training set, which makes batch-mode global optimization algorithms inefficient or even impractical.
Document classification, web resource categorization, spam filtering, blog classification are just some possible applications that could benefit from online learning. Most of well-known document classification algorithms, such as the centroid-based classifier (Han et al., 2000, Rocchio, 1971), are batch-mode algorithms and cannot be effectively used with streaming data because they need to be rerun periodically on the entire dataset. On the other hand, online algorithms, such as Bayesian Online Classifier (Chai, Chieu, & Ng, 2002), require periodic retraining due to “forgetting”.
At the same time, a “truly” online document classifier should be able to categorize streaming data and adapt to the changing environment by continually learning and unlearning document classes. This type of learning is also called lifelong or never-ending (Beroule, 1988). For example, spam filtering and classification of news feeds with ever changing features are but a few useful applications of lifelong learning (Rennie, 2000, Sahami et al., 1998).
In this paper, we draw on the existing machine learning techniques for document classification and clustering to extend the batch-mode centroid-based classifier to lifelong learning. We describe the limitations of the centroid-based classifier and propose a new Life-Long Learning from Mistakes (3LM) algorithm that:
- 1.
Learns to classify documents on a per-example basis and never stops learning, adapting to evolving data;
- 2.
Uses negative feedback to reinforce learning (learning with a critic);
- 3.
Fits the distribution of the data, trying to minimize the number of misclassifications, instead of estimating the centroids (class means);
- 4.
Provably converges for hyperplane-separable classes;
- 5.
Experimentally demonstrates better accuracy than centroid-based document classification algorithms;
- 6.
Demonstrates better performance than centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM algorithms;
- 7.
Can find applications in standard document classification tasks, as well as life-long learning scenarios where user feedback may be available, e.g., facebook/twitter feed classification.
The main contribution of the 3LM algorithm is in avoiding over-smoothing1 present in the centroid-based classifiers. At the same time, the 3LM algorithm does not overfit for the dataset. It achieves this by using different class representatives – clusterheads, which are trained on a per-example basis by an online procedure similar to the perceptron training (Minsky & Papert, 1969). We note that the term “clusterhead” has been used earlier in the literature (Basagni et al., 2001, Kumar et al., 2009) to denote various things not directly related to our use of it.
The rest of the paper is organized as follows. Section 2 overviews related work and provides the background for subsequent sections. Section 3 describes our algorithm in detail. Experimental validation of our algorithm appears in Section 4. We discuss impact and significance of our work including possible applications Section 5 and conclude the paper with Section 6.
Section snippets
Related work on machine learning
Our work has broad connections with research in online learning, document classification, and reinforcement learning.
Online learning is a growing subset of machine learning algorithms that learn on a per-example basis. There exist online clustering algorithms (Zhang et al., 2005, Zhong, 2005), online SVM (Bordes and Bottou, 2005, Sculley and Wachman, 2007, Zhang et al., 2005), online Bayesian learning (Chai et al., 2002, Opper, 1998, Solla and Winther, 1998), etc. The 3LM algorithm shares some
Lifelong learning for document classification
In this section, we continue exploring the idea of lifelong learning centroid-based classification and its limitations. We give the intuition for our 3LM classifier and formally describe its algorithm. We close the section with a proof of convergence of 3LM for hyper-plane-separable classes.
Empirical evaluation
We conducted a series of experiments to verify the accuracy of 3LM on the commonly used classification datasets. In our experiments, we compared 3LM with the other two algorithms discussed in this paper: centroid-based batch and centroid-based online classifiers. In this section, we overview our experiments and show how the obtained results support the intuition and the theory presented in this paper.
Applications
Lifelong learning in document classification from streaming data can have a number of applications. For example, spam filters need to keep learning the features of spam messages as spammers are finding new ways to trick the spam filters.
To be able to learn effectively, some classifiers require expert feedback, which is then used to reinforce the learning. This is also called “learning with a critic.” Such feedback is implicit if labeled data samples are used; however, in case of classification
Conclusion and future work
In this paper, we described 3LM – a new algorithm for document classification centered on the idea of lifelong learning from misclassifications. We provided experimental evidence of its effectiveness compared to centroid-based classifiers on the standard Reuters, OHSUMED, and TREC07p datasets.
There are several avenues for expanding our work in the future. From a theoretical point of view, formalizing and analyzing the balancing behavior of the learner is an interesting problem that can shed
Acknowledgment
We thank the anonymous reviewers for their many helpful comments which greatly improved the presentation. VP is funded by the Academy of Finland grant 138520.
References (34)
The handbook of brain theory and neural networks
(2008)- Basagni, S., Herrin, K., Bruschi, D., Rosti, E. (2001). Secure pebblenets. In Proceedings of the 2nd ACM international...
Online adaptive decision trees: Pattern classification and function approximation
Neural Computation
(2006)- Berikov, V., Litvinenko, A. (2003). Methods for statistical data analysis with decision tree. Novosibirsk Sobolev...
- Beroule, D. (1988). The never-ending learning. In Proceedings of the NATO advanced research workshop on neural...
- et al.
Dynamics of on-line competitive learning
A Letters Journal Exploring the Frontiers of Physics
(1997) - Bloehdorn, S., Hotho, A. (2004). Boosting for text classification with semantic features. In Proceedings of the MSW...
- Bordes, A., Bottou, L. (2005). The Huller: A simple and efficient online SVM. In Proceedings of ECML, 16th European...
- et al.
Convergence properties of the K-means algorithms
Advances in Neural Information Processing Systems
(1995) - Chai, K.M.A., Chieu, H.L., Ng, H.T. (2002). Bayesian online classifiers for text classification and filtering. In...
Concept decompositions for large sparse text data using clustering
Machine Learning
Cited by (4)
Using evolutionary computation for discovering spam patterns from e-mail samples
2018, Information Processing and ManagementCitation Excerpt :In this context, machine learning (ML) became very popular as a set of reliable techniques able to efficiently fight against spam ( Guzella & Caminhas, 2009). Earlier ML approaches were mainly based on the use of Naïve Bayes classifiers (Borodin, Polishchuk, Mahmud, Ramakrishnan, & Stent, 2013; Graham, 2009; Metsis, Androutsopoulos, & Paliouras, 2006; Rennie, 2000) being successfully introduced in some popular filtering frameworks (Apache Software Foundation, 2007; Pérez-Díaz, Ruano-Ordás, Fdez-Riverola, & Méndez, 2013). Additionally, the scientific community also evaluated other ML algorithms with the potential to filter spam including Support Vector Machines (SVM) (Amayri & Bouguila, 2010; Bouguila & Amayri, 2009; Li, Li, & Zhou, 2009; Pérez-Díaz, Ruano-Ordás, Fdez-Riverola, & Méndez, 2016), Artificial Immune Systems (AIS) (Borodin et al., 2013; Guzella, Mota-Santos, Uchõa, & Caminhas, 2008), Memory-Based Systems (Delany, Cunningham, & Tsymbal, 2006; Delany, Cunningham, Tsymbal, & Coyle, 2005; Fdez-Riverola, Iglesias, Díaz, Méndez, & Corchado, 2007; Pang & Jiang, 2013; Sakkis et al., 2003) or Rough Sets (RS) (Pérez-Díaz, Ruano-Ordás, Méndez, Gálvez, & Fdez-Riverola, 2012; Zhao & Zhang, 2005).
Efficient classification of multi-labeled text streams by clashing
2014, Expert Systems with ApplicationsCitation Excerpt :An online extension of this procedure has been presented in Tan et al. (2011) and applied to large train/test problems where the method slightly outperforms SVMs. A similar technique was presented in Borodin et al. (2013) for text classification in stationary and non-stationary environments. Wang et al. (2013) propose to filter out instances far from the boundary to enhance the predictive power of CC and Pang and Jiang (2013) integrate it with a clustering algorithm to obtain a lightweight approximation of nearest neighbors.
Document classification methods
2019, arXivSpam detection using linear genetic programming
2019, Advances in Intelligent Systems and Computing