Live and learn from mistakes: A lightweight system for document classification

https://doi.org/10.1016/j.ipm.2012.02.001Get rights and content

Abstract

We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a “balanced state” for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by “leashing” the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.

Highlights

► Text processing. ► Clusterheads leashed to class centroids. ► Online learning with negative feedback. ► A lightweight well-performing system for online document classification is proposed.

Introduction

There is no shortage of machine learning tools for batch-mode data analysis. However, the need to analyze larger and larger datasets created demand for a new class of online learning algorithms (online SVMs Bordes and Bottou, 2005, Sculley and Wachman, 2007, Zhang et al., 2005, online decision trees Basak, 2006, etc.), which can produce similar or better results as their batch-mode counterparts, but are more efficient because they can learn from every new example. In real-life scenarios, training data may not be available all at once (e.g., streaming data), or there may be a prohibitively large training set, which makes batch-mode global optimization algorithms inefficient or even impractical.

Document classification, web resource categorization, spam filtering, blog classification are just some possible applications that could benefit from online learning. Most of well-known document classification algorithms, such as the centroid-based classifier (Han et al., 2000, Rocchio, 1971), are batch-mode algorithms and cannot be effectively used with streaming data because they need to be rerun periodically on the entire dataset. On the other hand, online algorithms, such as Bayesian Online Classifier (Chai, Chieu, & Ng, 2002), require periodic retraining due to “forgetting”.

At the same time, a “truly” online document classifier should be able to categorize streaming data and adapt to the changing environment by continually learning and unlearning document classes. This type of learning is also called lifelong or never-ending (Beroule, 1988). For example, spam filtering and classification of news feeds with ever changing features are but a few useful applications of lifelong learning (Rennie, 2000, Sahami et al., 1998).

In this paper, we draw on the existing machine learning techniques for document classification and clustering to extend the batch-mode centroid-based classifier to lifelong learning. We describe the limitations of the centroid-based classifier and propose a new Life-Long Learning from Mistakes (3LM) algorithm that:

  • 1.

    Learns to classify documents on a per-example basis and never stops learning, adapting to evolving data;

  • 2.

    Uses negative feedback to reinforce learning (learning with a critic);

  • 3.

    Fits the distribution of the data, trying to minimize the number of misclassifications, instead of estimating the centroids (class means);

  • 4.

    Provably converges for hyperplane-separable classes;

  • 5.

    Experimentally demonstrates better accuracy than centroid-based document classification algorithms;

  • 6.

    Demonstrates better performance than centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM algorithms;

  • 7.

    Can find applications in standard document classification tasks, as well as life-long learning scenarios where user feedback may be available, e.g., facebook/twitter feed classification.

The main contribution of the 3LM algorithm is in avoiding over-smoothing1 present in the centroid-based classifiers. At the same time, the 3LM algorithm does not overfit for the dataset. It achieves this by using different class representatives – clusterheads, which are trained on a per-example basis by an online procedure similar to the perceptron training (Minsky & Papert, 1969). We note that the term “clusterhead” has been used earlier in the literature (Basagni et al., 2001, Kumar et al., 2009) to denote various things not directly related to our use of it.

The rest of the paper is organized as follows. Section 2 overviews related work and provides the background for subsequent sections. Section 3 describes our algorithm in detail. Experimental validation of our algorithm appears in Section 4. We discuss impact and significance of our work including possible applications Section 5 and conclude the paper with Section 6.

Section snippets

Related work on machine learning

Our work has broad connections with research in online learning, document classification, and reinforcement learning.

Online learning is a growing subset of machine learning algorithms that learn on a per-example basis. There exist online clustering algorithms (Zhang et al., 2005, Zhong, 2005), online SVM (Bordes and Bottou, 2005, Sculley and Wachman, 2007, Zhang et al., 2005), online Bayesian learning (Chai et al., 2002, Opper, 1998, Solla and Winther, 1998), etc. The 3LM algorithm shares some

Lifelong learning for document classification

In this section, we continue exploring the idea of lifelong learning centroid-based classification and its limitations. We give the intuition for our 3LM classifier and formally describe its algorithm. We close the section with a proof of convergence of 3LM for hyper-plane-separable classes.

Empirical evaluation

We conducted a series of experiments to verify the accuracy of 3LM on the commonly used classification datasets. In our experiments, we compared 3LM with the other two algorithms discussed in this paper: centroid-based batch and centroid-based online classifiers. In this section, we overview our experiments and show how the obtained results support the intuition and the theory presented in this paper.

Applications

Lifelong learning in document classification from streaming data can have a number of applications. For example, spam filters need to keep learning the features of spam messages as spammers are finding new ways to trick the spam filters.

To be able to learn effectively, some classifiers require expert feedback, which is then used to reinforce the learning. This is also called “learning with a critic.” Such feedback is implicit if labeled data samples are used; however, in case of classification

Conclusion and future work

In this paper, we described 3LM – a new algorithm for document classification centered on the idea of lifelong learning from misclassifications. We provided experimental evidence of its effectiveness compared to centroid-based classifiers on the standard Reuters, OHSUMED, and TREC07p datasets.

There are several avenues for expanding our work in the future. From a theoretical point of view, formalizing and analyzing the balancing behavior of the learner is an interesting problem that can shed

Acknowledgment

We thank the anonymous reviewers for their many helpful comments which greatly improved the presentation. VP is funded by the Academy of Finland grant 138520.

References (34)

  • M.A. Arbib

    The handbook of brain theory and neural networks

    (2008)
  • Basagni, S., Herrin, K., Bruschi, D., Rosti, E. (2001). Secure pebblenets. In Proceedings of the 2nd ACM international...
  • J. Basak

    Online adaptive decision trees: Pattern classification and function approximation

    Neural Computation

    (2006)
  • Berikov, V., Litvinenko, A. (2003). Methods for statistical data analysis with decision tree. Novosibirsk Sobolev...
  • Beroule, D. (1988). The never-ending learning. In Proceedings of the NATO advanced research workshop on neural...
  • M. Biehl et al.

    Dynamics of on-line competitive learning

    A Letters Journal Exploring the Frontiers of Physics

    (1997)
  • Bloehdorn, S., Hotho, A. (2004). Boosting for text classification with semantic features. In Proceedings of the MSW...
  • Bordes, A., Bottou, L. (2005). The Huller: A simple and efficient online SVM. In Proceedings of ECML, 16th European...
  • L. Bottou et al.

    Convergence properties of the K-means algorithms

    Advances in Neural Information Processing Systems

    (1995)
  • Chai, K.M.A., Chieu, H.L., Ng, H.T. (2002). Bayesian online classifiers for text classification and filtering. In...
  • Dasarathy, B. V. (1991). Nearest neighbor (NN) norms: NN pattern classification techniques....
  • I.S. Dhillon et al.

    Concept decompositions for large sparse text data using clustering

    Machine Learning

    (2001)
  • Dumais, S., Furnas, G. W., Landauer, T. K., Deerwester, S., Harshman, R. (1988). Using latent semantic analysis to...
  • Garey, M. R., Johnson, D. S. (1979). Computers and intractability: A guide to the theory of NP-completeness. New York,...
  • Godbole, S., Harpale, A., Sarawagi, S., Chakrabarti, S. (2004). Document classification through interactive supervision...
  • Guan, H., Zhou, J., Guo, M. (2009). A class-feature-centroid classifier for text categorization. In Proceedings of the...
  • Han, E.-H., Karypis, G. (2000). Centroid-based document classification: Analysis and experimental results. Principles...
  • Cited by (4)

    • Using evolutionary computation for discovering spam patterns from e-mail samples

      2018, Information Processing and Management
      Citation Excerpt :

      In this context, machine learning (ML) became very popular as a set of reliable techniques able to efficiently fight against spam ( Guzella & Caminhas, 2009). Earlier ML approaches were mainly based on the use of Naïve Bayes classifiers (Borodin, Polishchuk, Mahmud, Ramakrishnan, & Stent, 2013; Graham, 2009; Metsis, Androutsopoulos, & Paliouras, 2006; Rennie, 2000) being successfully introduced in some popular filtering frameworks (Apache Software Foundation, 2007; Pérez-Díaz, Ruano-Ordás, Fdez-Riverola, & Méndez, 2013). Additionally, the scientific community also evaluated other ML algorithms with the potential to filter spam including Support Vector Machines (SVM) (Amayri & Bouguila, 2010; Bouguila & Amayri, 2009; Li, Li, & Zhou, 2009; Pérez-Díaz, Ruano-Ordás, Fdez-Riverola, & Méndez, 2016), Artificial Immune Systems (AIS) (Borodin et al., 2013; Guzella, Mota-Santos, Uchõa, & Caminhas, 2008), Memory-Based Systems (Delany, Cunningham, & Tsymbal, 2006; Delany, Cunningham, Tsymbal, & Coyle, 2005; Fdez-Riverola, Iglesias, Díaz, Méndez, & Corchado, 2007; Pang & Jiang, 2013; Sakkis et al., 2003) or Rough Sets (RS) (Pérez-Díaz, Ruano-Ordás, Méndez, Gálvez, & Fdez-Riverola, 2012; Zhao & Zhang, 2005).

    • Efficient classification of multi-labeled text streams by clashing

      2014, Expert Systems with Applications
      Citation Excerpt :

      An online extension of this procedure has been presented in Tan et al. (2011) and applied to large train/test problems where the method slightly outperforms SVMs. A similar technique was presented in Borodin et al. (2013) for text classification in stationary and non-stationary environments. Wang et al. (2013) propose to filter out instances far from the boundary to enhance the predictive power of CC and Pang and Jiang (2013) integrate it with a clustering algorithm to obtain a lightweight approximation of nearest neighbors.

    • Spam detection using linear genetic programming

      2019, Advances in Intelligent Systems and Computing
    View full text