Elsevier

Information Fusion

Volume 3, Issue 4, December 2002, Pages 245-258
Information Fusion

An experimental study on diversity for bagging and boosting with linear classifiers

https://doi.org/10.1016/S1566-2535(02)00093-3Get rights and content

Abstract

In classifier combination, it is believed that diverse ensembles have a better potential for improvement on the accuracy than non-diverse ensembles. We put this hypothesis to a test for two methods for building the ensembles: Bagging and Boosting, with two linear classifier models: the nearest mean classifier and the pseudo-Fisher linear discriminant classifier. To estimate diversity, we apply nine measures proposed in the recent literature on combining classifiers. Eight combination methods were used: minimum, maximum, product, average, simple majority, weighted majority, Naive Bayes and decision templates. We carried out experiments on seven data sets for different sample sizes, different number of classifiers in the ensembles, and the two linear classifiers. Altogether, we created 1364 ensembles by the Bagging method and the same number by the Boosting method. On each of these, we calculated the nine measures of diversity and the accuracy of the eight different combination methods, averaged over 50 runs. The results confirmed in a quantitative way the intuitive explanation behind the success of Boosting for linear classifiers for increasing training sizes, and the poor performance of Bagging in this case. Diversity measures indicated that Boosting succeeds in inducing diversity even for stable classifiers whereas Bagging does not.

Introduction

A classifier is any function D by which we assign a class label ω from a set of predefined labels Ω={ω1,…,ωc} to an object represented as a data point x in a real n-dimensional space Rn. In the general case, the classifier output is a c-dimensional vector [d1(x),…,dc(x)]T where di(x) is the degree of “support” given by classifier D to the hypothesis that x comes from class ωi, i=1,…,c. Without loss of generality we can restrict di(x) within the interval [0,1], and call the classifier outputs “soft labels”. Most often di(x) is an estimate of the posterior probability P(ωi|x). In some cases, “crisp” class labels are required, i.e., di(x)∈{0,1}, and i=1cdi(x)=1. These can be obtained by “hardening” the soft labels by assigning the largest value to 1 (the winning class label), and the remaining values to 0. Ties are resolved arbitrarily.

Classifier combination aims at a higher accuracy than that of a single D. The literature on classifier combination highlights the necessity of measuring and using the degree of diversity, independence, orthogonality, complementarity, etc., which are intuitively desirable characteristics of a classifier team [5], [13], [16], [23], [29], [34]. Theoretically, a group of independent classifiers will improve upon the single classifier when majority vote combination is used. A dependent set of classifiers may be either better or worse [22]. Sometimes the difference is beneficial to the ensemble and yet sometimes it might be harmful. There is no consensus on what a “good” measure of diversity should be. The conceptual difficulty in defining diversity can be illustrated by an example. Assume that we have tested the L classifiers forming an ensemble on a data set of N=100 (N>L⩾3) objects (data points); each classifier recognizes all but one data points; and each classifier fails on a different point. Thus the estimated individual accuracy of each classifier is 0.99. Obviously, the classifier outputs are highly related, as a large amount of coincident decisions occur: the decisions of every pair of classifiers coincide in 98 out of 100 cases. Intuitively, this means that the diversity is low, and there is no gain in combining the classifiers. From another point of view, however, if we combine the classifier outputs, e.g., by taking the majority vote, we will arrive at a correct decision in all 100 cases. Thus the small outstanding improvement of 1% on the individual accuracy can be achieved through combining these “not too diverse” classifiers. So the potential for improvement is small, but this is all that is needed in this case. If we want diversity to measure the potential for improvement, what should its value be for this example, high or low? To account for this variety of viewpoints, in our experiments we use nine measures of diversity.

Bagging, Boosting,1 Arcing2 and the Random subspace method are guidelines for constructing classifier ensembles by varying the inputs. In this study we chose Bagging and Boosting which have shown good performance on various data sets [1], [6].

Once the ensemble is put together, different combination methods can be used to derive the final class label of an object from the individual classifier outputs. In this study we used eight simple methods: minimum, maximum, product, average, simple majority, weighted majority, Naive Bayes and decision templates. These were selected with the idea to explore the potential of the ensemble beyond the traditional simple majority voting for Bagging and weighted majority voting for Boosting. We were interested whether diversity of the ensembles constructed by Bagging and Boosting would exhibit a relationship with the accuracy of some of the combination methods.

The rest of the paper is organized as follows. Section 2 explains Bagging and Boosting. Section 3 introduces the nine measures of diversity and the eight combination methods. The experiments are described in Section 4. Section 5 offers our conclusions.

Section snippets

Bagging

Bagging and Boosting are strategies for creating classifier ensembles, similar by the concept, yet with fundamental differences [9]. Bagging was proposed by Breiman [2] and extended further to Arcing [3], [4] to accommodate the adaptive incremental construction of the ensemble which underlies the Boosting method (explained later). Bagging creates the classifiers in the ensemble by taking random samples with replacement (bootstrap sampling [7]) from the data set and building one classifier on

Combination methods

Let D={D1,D2,…,DL} be the set of trained classifiers and Ω={ω1,…,ωc} be the set of class labels. Denote by di,j(x) the support given by classifier Di for the hypothesis that the given input x comes from class ωj, i=1,…,L, j=1,…,c. The L classifier outputs D1(x),…, DL(x) are then combined to get a label for x. Depending on the type of the classifier outputs and the combination rule, we can get a soft final output D(x)=[μ1(x),…,μc(x)]T or a crisp one D(x)∈Ω.

Here we consider eight combination

Measures of diversity

Diversity may be interpreted differently, as suggested in the introduction. Hence, there are different diversity measures in the literature. Some of these measures, such as the Q-statistic and the correlation coefficient have come directly from mainstream statistics, others have their origins in software engineering and comparing of software versions, and yet another group of measures have been proposed specifically for the problems of multiple classifier systems.

Experiments

We used the nine measures of diversity with the eight combination methods and the two ensemble building strategies: Bagging and Boosting. The experimental setup is described below. Table 5 contains a description of the seven datasets used.

  • 1.

    80-D correlated Gaussian data. This is an 80-dimensional data set consisting of two Gaussian classes with equal covariance matrices; 500 vectors sampled from each class. The mean of the first class is zero for all the features. The mean of the second class is

Discussion

The purpose of the experiment was to allow us to spot visually any pattern between diversity and accuracy that can guide our further studies. This is why we grouped the diversity measures once by their type ((↓) and (↑)) and second by the symmetry characteristic.

The first conspicuous observation from the scatterplots 1–4 is that there is no strong relationship between diversity and accuracy, whether it be linear or non-linear. There is however a general trend shown by the Boosted ensembles

Conclusions

This paper explores the relationship between diversity and accuracy on a large scale experiment. Bagging and Boosting have been nominated for the generation of ensembles of two models of linear base classifiers. Different sizes of the ensembles and the training data sets were considered using seven two-class data sets. Eight combination methods were applied with similar accuracies which led us to pool all the results so that we can plot and analyze a general “accuracy” versus different measures

References (37)

  • P. Cunningham, J. Carney. Diversity versus quality in classification ensembles based on feature selection, Technical...
  • T.G. Dietterich

    Ensemble methods in machine learning

  • B. Efron et al.

    An Introduction to the Bootstrap

    (1993)
  • Y. Freund et al.

    Discussion of the paper Arcing Classifiers by Leo Breiman

    The Annals of Statistics

    (1998)
  • S.W. Golomb et al.

    The search for Hadamard matrices

    American Mathematics Monthly

    (1963)
  • L.K. Hansen et al.

    Neural network ensembles

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1990)
  • S. Hashem, B. Schmeiser, Y. Yih, Optimal linear combinations of neural networks: an overview, in: IEEE International...
  • T.K. Ho

    The random space method for constructing decision forests

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1998)
  • Cited by (0)

    View full text