Probability estimation for multi-class classification using AdaBoost
Introduction
Most of the binary classifier algorithms assume equal unconditional probabilities of the two classes : . With this basic setup, a more direct approach is to fit a linear function that assigns a value to each possible category. The prediction is determined by the values, which always reflect the reliability of each of its predictions. However in some applications, it is not enough to classify an object. What really count is an accurate estimation of the posterior probability. There are also some techniques that attempt to produce probabilistic outputs from a classification function. In reverse, we intend to construct classifier by the probability estimation method. This idea avoids the drawback that the probabilities from classifier may be not credible because the classifier function pays more attention on classification.
In this paper a probability model is proposed to estimate the posterior probabilities, which can be used for classification by some discriminant rules, such as Bayes׳ theorem. The boosting algorithm is chosen to achieve this purpose. There are two reasons for doing this: first, we find that a simple probability estimation approach with Newton-like method exactly produces AdaBoost algorithm (see proof in Section 2.1); and second, AdaBoost is robust and efficient for classifications. The first boosting algorithms were given by Schapire and Freund [1], [2] and AdaBoost is the first practical boosting algorithm [3]. Friedman et al. pointed out that the AdaBoost algorithms could be interpreted as stagewise estimation procedures for fitting an additive logistic regression model [4]. A consequence of this is that AdaBoost produces probability estimations from the statistical view. However, Mease et al. gave empirical evidence that almost all of the final probability estimations are often close to 1 or 0 in many cases while the classification rule from AdaBoost shows no signs of overfitting and performs quite well [5]. Niculescu-Mizil and Caruana have also noted this phenomenon and they have tried to solve it with some calibration techniques [6]. Mease and Wyner considered that the phenomenon implies complete overfit in terms of posterior probability (conditional class probability function) estimation [7]. Our probability model could give a theoretical analysis of the probability estimation about AdaBoost algorithms.
On the basic of Bayesian theory, the prediction of binary classification depends on the rate of the posterior probability function. Hence, we believe that there is a close connection between the posterior probabilities and the best performance of AdaBoost classifiers. This connection is given explicitly by Friedman et al. [4]. However, it is just a byproduct from the optimization model , where F(x) is the strong classifier or regression function. In this paper we take the posterior probability estimation as the starting point to deduce AdaBoost algorithms. Based on this, we expound: (1) the relationship between sampling and overfitting; (2) the relationship between AdaBoost classifier and Bayes error; (3) how to handle the imbalanced data; (4) how to build a classifier for a multi-classification problem; (5) how to obtain a robust prediction by the posterior probabilities.
A detailed description of the probability model is discussed in the following section. In Section 3, we deduce probability estimations for multi-class classification by pairwise coupling and get the analytic solution. In Section 4, we raise a question that whether there exists a more robust prediction framework than the maximum probability criterion. In fact, we find a different prediction model, called Phase-out Model, which weeds out one of all classes step by step until only one still insists when predict and the surviving class is the outcome. A local optimum algorithm is used to implement the model. We design some experiments to compare with classical algorithms in Sections 5 and end with some concluding remarks in Section 6.
Section snippets
Probability model
Firstly, we deduce AdaBoost algorithm again. More than ever, we regard it as a posterior probability model approaching to the Bayes minimum error decision rule. Although a number of theories have mentioned this, most of them rely on the optimization model and use an inverse method to reach the conclusion. Our model is more direct. We would like to deal with classification tasks via Bayesian decision rule, so a more natural way is to use logistic model and it brings out AdaBoost
Probability estimates for multi-class classification
Let us aim to multi-class boosting. There are two main methods: one is to transform the multi-label problem into a set of binary classification problems, such as AdaBoost.MH [11], MC [12], and ECC [13]; the other is to combine many multi-class weak learners, such as M1 [14] and SAMME [15] . The two methods can be formulated in a uniform prediction framework, for all of them rely on the criterion, that the prediction is only determined by the highest score of all classes. Let be a
Prediction problem
In this section, we consider the problem that how to get a robust prediction. The traditional framework prefers to choose the class with the maximum probability as the best one once for all. This method is based on the Bayesian decision theory. Considering a practical problem, there are k teams in a basketball game and we want to predict the champion once for all. This is the Bayesian decision idea. Now we turn to a new route: generally it is a difficult task to give a credible prediction of
Experiments
In this section, we experiment on some UCI datasets to evaluate the new algorithms. In all cases, decision stumps are used as weak learners. We design the first experiment for calibrating binary classifiers. The second and the third experiments are implemented on multi-class datasets. We choose DAB and LogitBoost as the basic strong classifiers in the experiments, for the reason that they are the representative algorithms of the two objective functions (7), (13).
Conclusions
We have proposed a probability method to solving classification problems. It can be used to estimate the posterior probabilities for two or more classes. A series of experiments show that it is helpful for us to overcome the problems that we face. And we must be careful when to use AdaBoost algorithms for the reasons: whether we need to calibrate the algorithm; whether the training model has been overfitting and so on.
About the prediction model, we focus on the sequence. The traditional model
Conflict of interest
None declared.
Acknowledgments
This work was partially supported by the Aeronautical Science Foundation of China under the Grant No. 20115169016, the General Armament Department Pre-research Foundation of China under the Grant No. 9140C460302130C46173 and the Natural Science Foundation of Jiangsu Province of China under the Grant No. BK20131296.
Qingfeng Nie received his B.S. degree in mathematics from Southeast University, Nanjing, China, in 2010. He is currently a Ph.D. Candidate in the School of Automation at the Southeast University. His research interests are in the areas of machine learning and data mining with emphasis on mathematical modeling.
References (28)
Boosting a weak learning algorithm by majority
Inf. Comput.
(1995)- et al.
A decision-theoretic generalization of on-line learning and an application to boosting
J. Comput. Syst. Sci.
(1997) - et al.
Logistic regression using covariates obtained by product-unit neural network models
Pattern Recognit.
(2007) - et al.
Pattern classification of dermoscopy images: a perceptually uniform model
Pattern Recognit.
(2013) - et al.
An overview of ensemble methods for binary classifiers in multi-class problemsexperimental study on one-vs-one and one-vs-all schemes
Pattern Recognit.
(2011) Additive estimators for probabilities of correct classification
Pattern Recognit.
(1978)The strength of weak learnability
Mach. Learn.
(1990)- et al.
Additive logistic regressiona statistical view of boosting
Ann. Stat.
(2000) - et al.
Boosted classification trees and class probability/quantile estimation
J. Mach. Learn. Res.
(2006) - A. Niculescu-Mizil, R. Caruana, Obtaining calibrated probabilities from boosting, in: 21st Conference on Uncertainty in...
Evidence contrary to the statistical view of boosting
J. Mach. Learn. Res.
Generalized additive models
Stat. Sci.
Logistic model trees
Mach. Learn.
Improved boosting algorithms using confidence-rated prediction
Mach. Learn.
Cited by (0)
Qingfeng Nie received his B.S. degree in mathematics from Southeast University, Nanjing, China, in 2010. He is currently a Ph.D. Candidate in the School of Automation at the Southeast University. His research interests are in the areas of machine learning and data mining with emphasis on mathematical modeling.
Lizuo Jin received his Ph.D. degree in Pattern Recognition and Intelligent System from Southeast University, China, in 2000. He is now an associate professor at School of Automation, Southeast University. His research interests include theory and methods for machine learning, pattern recognition, computer vision and embedded systems.
Shumin Fei received his Ph.D. degree from Beihang University, Beijing, China, in 1995. He is now a professor at School of Automation, Southeast University. His research interests are in the areas of the nonlinear control system design and synthesis, Hybrid system analysis, Neural Network Control and so on.