Skip to main content

Incremental Weighted Naive Bays Classifiers for Data Stream

  • Conference paper

Abstract

A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with naive independence assumption. The explanatory variables (X i ) are assumed to be independent from the target variable (Y ). Despite this strong assumption this classifier has proved to be very effective on many real applications and is often used on data stream for supervised classification. The naive Bayes classifier simply relies on the estimation of the univariate conditional probabilities P(X i  | C). This estimation can be provided on a data stream using a “supervised quantiles summary.” The literature shows that the naive Bayes classifier can be improved (1) using a variable selection method (2) weighting the explanatory variables. Most of these methods are related to batch (off-line) learning and need to store all the data in memory and/or require reading more than once each example. Therefore they cannot be used on data stream. This paper presents a new method based on a graphical model which computes the weights on the input variables using a stochastic estimation. The method is incremental and produces a Weighted Naive Bayes Classifier for data stream. This method will be compared to classical naive Bayes classifier on the Large Scale Learning challenge datasets.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press.

    Google Scholar 

  • Boullé, M. (2006a). MODL: A Bayes optimal discretization method for continuous attributes. Machine Learning, 65(1), 131–165.

    Article  Google Scholar 

  • Boullé, M. (2006b). Regularization and averaging of the selective naive bayes classifier. In The 2006 IEEE International Joint Conference on Neural Network Proceedings (pp. 1680–1688).

    Google Scholar 

  • Cotter, A., Shamir, O., Srebro, N., & Sridharan, K. (2011). Better mini-batch algorithms via accelerated gradient methods. In J. Shawe-taylor, R.S. Zemel, P. Bartlett, F. C. N. Pereira, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 24, pp. 1647–1655). http://books.nips.cc/papers/files/nips24/NIPS2011_0942.pdf.

  • Gama, J. (2010). Knowledge discovery from data streams. Chapman and Hall/CRC Press

    Google Scholar 

  • Gama, J., & Pinto, C. (2006). Discretization from data streams: applications to histograms and data mining. In Proceedings of the 2006 ACM Symposium on Applied Computing (pp. 662–667).

    Google Scholar 

  • Greenwald, M., & Khanna, S. (2001). Space-efficient online computation of quantile summaries. ACM SIGMOD Record, 30(2), 58–66.

    Article  Google Scholar 

  • Guigourès, R., & Boullé, M. (2011). Optimisation directe des poids de modèles dans un prédicteur Bayésien naif moyenné. In Extraction et gestion des connaissances EGC’2011 (pp. 77–82).

    Google Scholar 

  • Guyon, I., Lemaire, V., Boullé, M., Dror, G., & Vogel, D. (2009). Analysis of the KDD cup 2009: Fast scoring on a large orange customer database. In JMLR: Workshop and Conference Proceedings (Vol. 7, pp. 1–22).

    Google Scholar 

  • Hoeting, J., Madigan, D., & Raftery, A. (1999). Bayesian model averaging: a tutorial. Statistical Science, 14(4), 382–417.

    Article  MATH  MathSciNet  Google Scholar 

  • Koller, D., & Sahami, M. (1996, May). Toward optimal feature selection. In International Conference on Machine Learning (pp. 284–292).

    Google Scholar 

  • Kuncheva, L. I., & Plumpton, C. O. (2008). Adaptive learning rate for online linear discriminant classifiers. In Proceedings of the 2008 Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition (pp. 510–519). Heidelberg: Springer.

    Google Scholar 

  • Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. In Proceedings of the National Conference on Artificial Intelligence (pp. 223–228).

    Google Scholar 

  • Langley, P., & Sage, S. (1994). Induction of selective Bayesian classifiers. In R. L. Mantaras & D. Poole (Eds.), Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence (pp. 399–406). Seattle, WA: Morgan Kaufmann.

    Google Scholar 

  • Lecun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). Efficient BackProp. In G. Orr & K. Müller (Eds.), Neural networks: Tricks of the trade. Lecture notes in computer science (Vol. 1524, pp. 5–50). Heidelberg: Springer.

    Google Scholar 

  • Salperwyck, C. (2012). Apprentissage incrémental en ligne sur flux de données. PhD thesis, University of Lille.

    Google Scholar 

  • Salperwyck, C., & Lemaire, V. (2013). A two layers incremental discretization based on order statistics. In Statistical models for data analysis (pp. 315–323). Springer International Publishing. http://rd.springer.com/chapter/10.1007%2F978-3-319-00032-9_36

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christophe Salperwyck .

Editor information

Editors and Affiliations

Appendix: Derivative of the Cost Function

Appendix: Derivative of the Cost Function

The graphical model is built to have directly the values of the P(C k  | X) at the output. The goal is to maximize the likelihood and therefore to minimize the negative log likelihood. The first step in the calculation is to decompose the softmax considering that each output could be seen as the succession of two steps: an activation followed by a function of this activation.

Here the activation function could be seen as: \(O_{k} = f(H_{k}) =\mathrm{ exp}(H_{k})\) and the output of the softmax part of our graphical model is: \(P_{k} = \frac{O_{k}} {\sum _{j=1}^{K}O_{j}}\). The derivative of the activation function is:

$$\displaystyle{ \frac{\partial O_{k}} {\partial H_{k}} = f'(H_{k}) =\mathrm{ exp}(H_{k}) = O_{k} }$$
(5)

The cost function being the −log likelihood, we have to consider two cases: (1) the desired output is equal to 1 or (2) the desired output is equal to 0. For the following we note:

$$\displaystyle{ \frac{\partial \mathrm{Cost}} {\partial H_{k}} = \frac{\partial C} {\partial P_{k}} \frac{\partial P_{k}} {\partial O_{k}} \frac{\partial O_{k}} {\partial H_{k}} }$$
(6)

In the case where the desired output of the output k is equal to 1 by replacing (5) in (6):

$$\displaystyle{ \frac{\partial \mathrm{Cost}} {\partial H_{k}} = \frac{\partial C} {\partial P_{k}} \frac{\partial P_{k}} {\partial O_{k}} \frac{\partial O_{k}} {\partial H_{k}} = \frac{-1} {P_{k}} \frac{\partial P_{k}} {\partial O_{k}}O_{k} }$$
(7)
$$\displaystyle\begin{array}{rcl} \frac{\partial \mathrm{Cost}} {\partial H_{k}} & =& \frac{-1} {P_{k}}\left [\sum _{l=1,l\neq k}^{K}\left ( \frac{O_{l}} {\big(\sum _{j=1}^{K}O_{j}\big)^{2}}\right )\right ]O_{k} \\ & =& \frac{-1} {P_{k}}\left [\frac{\big(\sum _{j=1}^{K}O_{j}\big) - O_{k}} {\big(\sum _{j=1}^{K}O_{j}\big)^{2}} \right ]O_{k} {}\end{array}$$
(8)
$$\displaystyle\begin{array}{rcl} \frac{\partial \mathrm{Cost}} {\partial H_{k}} & =& \frac{-1} {P_{k}}\left [\frac{\big(\sum _{j=1}^{K}O_{j}\big) - O_{k}} {\big(\sum _{j=1}^{K}O_{j}\big)} \right ] \frac{O_{k}} {\big(\sum _{j=1}^{K}O_{j}\big)} \\ & =& \frac{-1} {P_{k}}\left [1 - \frac{O_{k}} {\big(\sum _{j=1}^{K}O_{j}\big)}\right ] \frac{O_{k}} {\big(\sum _{j=1}^{K}O_{j}\big)} {}\end{array}$$
(9)

Therefore

$$\displaystyle{ \frac{\partial \mathrm{Cost}} {\partial H_{k}} = \frac{-1} {P_{k}}[1 - P_{k}]P_{k} = P_{k} - 1 }$$
(10)

In the case where the desired output of the output k is equal to 0 the error is only transmitted by the normalization part of the softmax function since the derivative for an output where the desired value is 0 is equal to 0. Therefore with similar steps we have: \(\frac{\partial \mathrm{Cost}} {\partial H_{k}} = P_{k}\)

Finally we conclude: \(\frac{\partial \mathrm{Cost}} {\partial H_{k}} = P_{k} - T_{k},\forall k\) where T k is the desired probability and P k the estimated probability. Then the rest of the calculation of \(\frac{\partial \mathrm{Cost}} {\partial w_{\mathit{ik}}}\) is straightforward.

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Salperwyck, C., Lemaire, V., Hue, C. (2015). Incremental Weighted Naive Bays Classifiers for Data Stream. In: Lausen, B., Krolak-Schwerdt, S., Böhmer, M. (eds) Data Science, Learning by Latent Structures, and Knowledge Discovery. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44983-7_16

Download citation

Publish with us

Policies and ethics