An experimental study on diversity for bagging and boosting with linear classifiers
Introduction
A classifier is any function D by which we assign a class label ω from a set of predefined labels to an object represented as a data point x in a real n-dimensional space . In the general case, the classifier output is a c-dimensional vector where is the degree of “support” given by classifier D to the hypothesis that comes from class ωi, i=1,…,c. Without loss of generality we can restrict within the interval [0,1], and call the classifier outputs “soft labels”. Most often is an estimate of the posterior probability . In some cases, “crisp” class labels are required, i.e., , and . These can be obtained by “hardening” the soft labels by assigning the largest value to 1 (the winning class label), and the remaining values to 0. Ties are resolved arbitrarily.
Classifier combination aims at a higher accuracy than that of a single D. The literature on classifier combination highlights the necessity of measuring and using the degree of diversity, independence, orthogonality, complementarity, etc., which are intuitively desirable characteristics of a classifier team [5], [13], [16], [23], [29], [34]. Theoretically, a group of independent classifiers will improve upon the single classifier when majority vote combination is used. A dependent set of classifiers may be either better or worse [22]. Sometimes the difference is beneficial to the ensemble and yet sometimes it might be harmful. There is no consensus on what a “good” measure of diversity should be. The conceptual difficulty in defining diversity can be illustrated by an example. Assume that we have tested the L classifiers forming an ensemble on a data set of N=100 (N>L⩾3) objects (data points); each classifier recognizes all but one data points; and each classifier fails on a different point. Thus the estimated individual accuracy of each classifier is 0.99. Obviously, the classifier outputs are highly related, as a large amount of coincident decisions occur: the decisions of every pair of classifiers coincide in 98 out of 100 cases. Intuitively, this means that the diversity is low, and there is no gain in combining the classifiers. From another point of view, however, if we combine the classifier outputs, e.g., by taking the majority vote, we will arrive at a correct decision in all 100 cases. Thus the small outstanding improvement of 1% on the individual accuracy can be achieved through combining these “not too diverse” classifiers. So the potential for improvement is small, but this is all that is needed in this case. If we want diversity to measure the potential for improvement, what should its value be for this example, high or low? To account for this variety of viewpoints, in our experiments we use nine measures of diversity.
Bagging, Boosting,1 Arcing2 and the Random subspace method are guidelines for constructing classifier ensembles by varying the inputs. In this study we chose Bagging and Boosting which have shown good performance on various data sets [1], [6].
Once the ensemble is put together, different combination methods can be used to derive the final class label of an object from the individual classifier outputs. In this study we used eight simple methods: minimum, maximum, product, average, simple majority, weighted majority, Naive Bayes and decision templates. These were selected with the idea to explore the potential of the ensemble beyond the traditional simple majority voting for Bagging and weighted majority voting for Boosting. We were interested whether diversity of the ensembles constructed by Bagging and Boosting would exhibit a relationship with the accuracy of some of the combination methods.
The rest of the paper is organized as follows. Section 2 explains Bagging and Boosting. Section 3 introduces the nine measures of diversity and the eight combination methods. The experiments are described in Section 4. Section 5 offers our conclusions.
Section snippets
Bagging
Bagging and Boosting are strategies for creating classifier ensembles, similar by the concept, yet with fundamental differences [9]. Bagging was proposed by Breiman [2] and extended further to Arcing [3], [4] to accommodate the adaptive incremental construction of the ensemble which underlies the Boosting method (explained later). Bagging creates the classifiers in the ensemble by taking random samples with replacement (bootstrap sampling [7]) from the data set and building one classifier on
Combination methods
Let be the set of trained classifiers and be the set of class labels. Denote by the support given by classifier Di for the hypothesis that the given input x comes from class ωj, i=1,…,L, j=1,…,c. The L classifier outputs are then combined to get a label for x. Depending on the type of the classifier outputs and the combination rule, we can get a soft final output or a crisp one .
Here we consider eight combination
Measures of diversity
Diversity may be interpreted differently, as suggested in the introduction. Hence, there are different diversity measures in the literature. Some of these measures, such as the Q-statistic and the correlation coefficient have come directly from mainstream statistics, others have their origins in software engineering and comparing of software versions, and yet another group of measures have been proposed specifically for the problems of multiple classifier systems.
Experiments
We used the nine measures of diversity with the eight combination methods and the two ensemble building strategies: Bagging and Boosting. The experimental setup is described below. Table 5 contains a description of the seven datasets used.
- 1.
80-D correlated Gaussian data. This is an 80-dimensional data set consisting of two Gaussian classes with equal covariance matrices; 500 vectors sampled from each class. The mean of the first class is zero for all the features. The mean of the second class is
Discussion
The purpose of the experiment was to allow us to spot visually any pattern between diversity and accuracy that can guide our further studies. This is why we grouped the diversity measures once by their type ((↓) and (↑)) and second by the symmetry characteristic.
The first conspicuous observation from the scatterplots 1–4 is that there is no strong relationship between diversity and accuracy, whether it be linear or non-linear. There is however a general trend shown by the Boosted ensembles
Conclusions
This paper explores the relationship between diversity and accuracy on a large scale experiment. Bagging and Boosting have been nominated for the generation of ensembles of two models of linear base classifiers. Different sizes of the ensembles and the training data sets were considered using seven two-class data sets. Eight combination methods were applied with similar accuracies which led us to pool all the results so that we can plot and analyze a general “accuracy” versus different measures
References (37)
- et al.
A decision-theoretic generalization of on-line learning and an application to boosting
Journal of Computer and System Sciences
(1997) - et al.
Design of effective neural network ensembles for image classification processes
Image Vision and Computing Journal
(2001) Using measures of similarity and inclusion for multiple classifier fusion by decision templates
Fuzzy Sets and Systems
(2001)- et al.
Decision templates for multiple classifier fusion: an experimental comparison
Pattern Recognition
(2001) - et al.
Ensemble learning via negative correlation
Neural Networks
(1999) - et al.
Software diversity: practical statistics for its measurement and exploitation
Information and Software Technology
(1997) - et al.
An empirical comparison of voting classification algorithms: Bagging, boosting, and variants
Machine Learning
(1999) Bagging predictors
Machine Learning
(1996)Arcing classifiers
The Annals of Statistics
(1998)Combining predictors