Abstract
A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with naive independence assumption. The explanatory variables (X i ) are assumed to be independent from the target variable (Y ). Despite this strong assumption this classifier has proved to be very effective on many real applications and is often used on data stream for supervised classification. The naive Bayes classifier simply relies on the estimation of the univariate conditional probabilities P(X i | C). This estimation can be provided on a data stream using a “supervised quantiles summary.” The literature shows that the naive Bayes classifier can be improved (1) using a variable selection method (2) weighting the explanatory variables. Most of these methods are related to batch (off-line) learning and need to store all the data in memory and/or require reading more than once each example. Therefore they cannot be used on data stream. This paper presents a new method based on a graphical model which computes the weights on the input variables using a stochastic estimation. The method is incremental and produces a Weighted Naive Bayes Classifier for data stream. This method will be compared to classical naive Bayes classifier on the Large Scale Learning challenge datasets.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press.
Boullé, M. (2006a). MODL: A Bayes optimal discretization method for continuous attributes. Machine Learning, 65(1), 131–165.
Boullé, M. (2006b). Regularization and averaging of the selective naive bayes classifier. In The 2006 IEEE International Joint Conference on Neural Network Proceedings (pp. 1680–1688).
Cotter, A., Shamir, O., Srebro, N., & Sridharan, K. (2011). Better mini-batch algorithms via accelerated gradient methods. In J. Shawe-taylor, R.S. Zemel, P. Bartlett, F. C. N. Pereira, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 24, pp. 1647–1655). http://books.nips.cc/papers/files/nips24/NIPS2011_0942.pdf.
Gama, J. (2010). Knowledge discovery from data streams. Chapman and Hall/CRC Press
Gama, J., & Pinto, C. (2006). Discretization from data streams: applications to histograms and data mining. In Proceedings of the 2006 ACM Symposium on Applied Computing (pp. 662–667).
Greenwald, M., & Khanna, S. (2001). Space-efficient online computation of quantile summaries. ACM SIGMOD Record, 30(2), 58–66.
Guigourès, R., & Boullé, M. (2011). Optimisation directe des poids de modèles dans un prédicteur Bayésien naif moyenné. In Extraction et gestion des connaissances EGC’2011 (pp. 77–82).
Guyon, I., Lemaire, V., Boullé, M., Dror, G., & Vogel, D. (2009). Analysis of the KDD cup 2009: Fast scoring on a large orange customer database. In JMLR: Workshop and Conference Proceedings (Vol. 7, pp. 1–22).
Hoeting, J., Madigan, D., & Raftery, A. (1999). Bayesian model averaging: a tutorial. Statistical Science, 14(4), 382–417.
Koller, D., & Sahami, M. (1996, May). Toward optimal feature selection. In International Conference on Machine Learning (pp. 284–292).
Kuncheva, L. I., & Plumpton, C. O. (2008). Adaptive learning rate for online linear discriminant classifiers. In Proceedings of the 2008 Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition (pp. 510–519). Heidelberg: Springer.
Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. In Proceedings of the National Conference on Artificial Intelligence (pp. 223–228).
Langley, P., & Sage, S. (1994). Induction of selective Bayesian classifiers. In R. L. Mantaras & D. Poole (Eds.), Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence (pp. 399–406). Seattle, WA: Morgan Kaufmann.
Lecun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). Efficient BackProp. In G. Orr & K. Müller (Eds.), Neural networks: Tricks of the trade. Lecture notes in computer science (Vol. 1524, pp. 5–50). Heidelberg: Springer.
Salperwyck, C. (2012). Apprentissage incrémental en ligne sur flux de données. PhD thesis, University of Lille.
Salperwyck, C., & Lemaire, V. (2013). A two layers incremental discretization based on order statistics. In Statistical models for data analysis (pp. 315–323). Springer International Publishing. http://rd.springer.com/chapter/10.1007%2F978-3-319-00032-9_36
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Derivative of the Cost Function
Appendix: Derivative of the Cost Function
The graphical model is built to have directly the values of the P(C k | X) at the output. The goal is to maximize the likelihood and therefore to minimize the negative log likelihood. The first step in the calculation is to decompose the softmax considering that each output could be seen as the succession of two steps: an activation followed by a function of this activation.
Here the activation function could be seen as: \(O_{k} = f(H_{k}) =\mathrm{ exp}(H_{k})\) and the output of the softmax part of our graphical model is: \(P_{k} = \frac{O_{k}} {\sum _{j=1}^{K}O_{j}}\). The derivative of the activation function is:
The cost function being the −log likelihood, we have to consider two cases: (1) the desired output is equal to 1 or (2) the desired output is equal to 0. For the following we note:
In the case where the desired output of the output k is equal to 1 by replacing (5) in (6):
Therefore
In the case where the desired output of the output k is equal to 0 the error is only transmitted by the normalization part of the softmax function since the derivative for an output where the desired value is 0 is equal to 0. Therefore with similar steps we have: \(\frac{\partial \mathrm{Cost}} {\partial H_{k}} = P_{k}\)
Finally we conclude: \(\frac{\partial \mathrm{Cost}} {\partial H_{k}} = P_{k} - T_{k},\forall k\) where T k is the desired probability and P k the estimated probability. Then the rest of the calculation of \(\frac{\partial \mathrm{Cost}} {\partial w_{\mathit{ik}}}\) is straightforward.
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Salperwyck, C., Lemaire, V., Hue, C. (2015). Incremental Weighted Naive Bays Classifiers for Data Stream. In: Lausen, B., Krolak-Schwerdt, S., Böhmer, M. (eds) Data Science, Learning by Latent Structures, and Knowledge Discovery. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44983-7_16
Download citation
DOI: https://doi.org/10.1007/978-3-662-44983-7_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44982-0
Online ISBN: 978-3-662-44983-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)