Elsevier

Pattern Recognition

Volume 47, Issue 12, December 2014, Pages 3931-3940
Pattern Recognition

Probability estimation for multi-class classification using AdaBoost

https://doi.org/10.1016/j.patcog.2014.06.008Get rights and content

Highlights

  • A probability estimation model is present for classification.

  • The model reveals a new view of boosting.

  • The model provides a new channel for constructing classifier structure.

  • An analytical solution is given for pairwise coupling.

  • A robust prediction framework is proposed for pairwise coupling.

Abstract

It is a general viewpoint that AdaBoost classifier has excellent performance on classification problems but could not produce good probability estimations. In this paper we put forward a theoretical analysis of probability estimation model and present some verification experiments, which indicate that AdaBoost can be used for probability estimation. With the theory, we suggest some useful measures for using AdaBoost algorithms properly. And then we deduce a probability estimation model for multi-class classification by pairwise coupling. Unlike previous approximate methods, we provide an analytical solution instead of a special iterative procedure. Moreover, a new problem that how to get a robust prediction with classifier scores is proposed. Experiments show that the traditional predict framework, which chooses one with the highest score from all classes as the prediction, is not always good while our model performs well.

Introduction

Most of the binary classifier algorithms assume equal unconditional probabilities of the two classes {+1,1}: P(Y=+1)=P(Y=1)=1/2. With this basic setup, a more direct approach is to fit a linear function that assigns a value to each possible category. The prediction is determined by the values, which always reflect the reliability of each of its predictions. However in some applications, it is not enough to classify an object. What really count is an accurate estimation of the posterior probability. There are also some techniques that attempt to produce probabilistic outputs from a classification function. In reverse, we intend to construct classifier by the probability estimation method. This idea avoids the drawback that the probabilities from classifier may be not credible because the classifier function pays more attention on classification.

In this paper a probability model is proposed to estimate the posterior probabilities, which can be used for classification by some discriminant rules, such as Bayes׳ theorem. The boosting algorithm is chosen to achieve this purpose. There are two reasons for doing this: first, we find that a simple probability estimation approach with Newton-like method exactly produces AdaBoost algorithm (see proof in Section 2.1); and second, AdaBoost is robust and efficient for classifications. The first boosting algorithms were given by Schapire and Freund [1], [2] and AdaBoost is the first practical boosting algorithm [3]. Friedman et al. pointed out that the AdaBoost algorithms could be interpreted as stagewise estimation procedures for fitting an additive logistic regression model [4]. A consequence of this is that AdaBoost produces probability estimations from the statistical view. However, Mease et al. gave empirical evidence that almost all of the final probability estimations are often close to 1 or 0 in many cases while the classification rule from AdaBoost shows no signs of overfitting and performs quite well [5]. Niculescu-Mizil and Caruana have also noted this phenomenon and they have tried to solve it with some calibration techniques [6]. Mease and Wyner considered that the phenomenon implies complete overfit in terms of posterior probability (conditional class probability function) estimation [7]. Our probability model could give a theoretical analysis of the probability estimation about AdaBoost algorithms.

On the basic of Bayesian theory, the prediction of binary classification depends on the rate of the posterior probability function. Hence, we believe that there is a close connection between the posterior probabilities and the best performance of AdaBoost classifiers. This connection is given explicitly by Friedman et al. [4]. However, it is just a byproduct from the optimization model E(eyF(x)|x), where F(x) is the strong classifier or regression function. In this paper we take the posterior probability estimation as the starting point to deduce AdaBoost algorithms. Based on this, we expound: (1) the relationship between sampling and overfitting; (2) the relationship between AdaBoost classifier and Bayes error; (3) how to handle the imbalanced data; (4) how to build a classifier for a multi-classification problem; (5) how to obtain a robust prediction by the posterior probabilities.

A detailed description of the probability model is discussed in the following section. In Section 3, we deduce probability estimations for multi-class classification by pairwise coupling and get the analytic solution. In Section 4, we raise a question that whether there exists a more robust prediction framework than the maximum probability criterion. In fact, we find a different prediction model, called Phase-out Model, which weeds out one of all classes step by step until only one still insists when predict and the surviving class is the outcome. A local optimum algorithm is used to implement the model. We design some experiments to compare with classical algorithms in Sections 5 and end with some concluding remarks in Section 6.

Section snippets

Probability model

Firstly, we deduce AdaBoost algorithm again. More than ever, we regard it as a posterior probability model approaching to the Bayes minimum error decision rule. Although a number of theories have mentioned this, most of them rely on the optimization model E(eyF(x)|x) and use an inverse method to reach the conclusion. Our model is more direct. We would like to deal with classification tasks via Bayesian decision rule, so a more natural way is to use logistic model and it brings out AdaBoost

Probability estimates for multi-class classification

Let us aim to multi-class boosting. There are two main methods: one is to transform the multi-label problem into a set of binary classification problems, such as AdaBoost.MH [11], MC [12], and ECC [13]; the other is to combine many multi-class weak learners, such as M1 [14] and SAMME [15] . The two methods can be formulated in a uniform prediction framework, for all of them rely on the criterion, that the prediction is only determined by the highest score of all classes. Let (xi,yi)i=1m be a

Prediction problem

In this section, we consider the problem that how to get a robust prediction. The traditional framework prefers to choose the class with the maximum probability as the best one once for all. This method is based on the Bayesian decision theory. Considering a practical problem, there are k teams in a basketball game and we want to predict the champion once for all. This is the Bayesian decision idea. Now we turn to a new route: generally it is a difficult task to give a credible prediction of

Experiments

In this section, we experiment on some UCI datasets to evaluate the new algorithms. In all cases, decision stumps are used as weak learners. We design the first experiment for calibrating binary classifiers. The second and the third experiments are implemented on multi-class datasets. We choose DAB and LogitBoost as the basic strong classifiers in the experiments, for the reason that they are the representative algorithms of the two objective functions (7), (13).

Conclusions

We have proposed a probability method to solving classification problems. It can be used to estimate the posterior probabilities for two or more classes. A series of experiments show that it is helpful for us to overcome the problems that we face. And we must be careful when to use AdaBoost algorithms for the reasons: whether we need to calibrate the algorithm; whether the training model has been overfitting and so on.

About the prediction model, we focus on the sequence. The traditional model

Conflict of interest

None declared.

Acknowledgments

This work was partially supported by the Aeronautical Science Foundation of China under the Grant No. 20115169016, the General Armament Department Pre-research Foundation of China under the Grant No. 9140C460302130C46173 and the Natural Science Foundation of Jiangsu Province of China under the Grant No. BK20131296.

Qingfeng Nie received his B.S. degree in mathematics from Southeast University, Nanjing, China, in 2010. He is currently a Ph.D. Candidate in the School of Automation at the Southeast University. His research interests are in the areas of machine learning and data mining with emphasis on mathematical modeling.

References (28)

  • D. Mease et al.

    Evidence contrary to the statistical view of boosting

    J. Mach. Learn. Res.

    (2008)
  • T. Hastie et al.

    Generalized additive models

    Stat. Sci.

    (1986)
  • N. Landwehr et al.

    Logistic model trees

    Mach. Learn.

    (2005)
  • R.E. Schapire et al.

    Improved boosting algorithms using confidence-rated prediction

    Mach. Learn.

    (1999)
  • Cited by (0)

    Qingfeng Nie received his B.S. degree in mathematics from Southeast University, Nanjing, China, in 2010. He is currently a Ph.D. Candidate in the School of Automation at the Southeast University. His research interests are in the areas of machine learning and data mining with emphasis on mathematical modeling.

    Lizuo Jin received his Ph.D. degree in Pattern Recognition and Intelligent System from Southeast University, China, in 2000. He is now an associate professor at School of Automation, Southeast University. His research interests include theory and methods for machine learning, pattern recognition, computer vision and embedded systems.

    Shumin Fei received his Ph.D. degree from Beihang University, Beijing, China, in 1995. He is now a professor at School of Automation, Southeast University. His research interests are in the areas of the nonlinear control system design and synthesis, Hybrid system analysis, Neural Network Control and so on.

    View full text