Keywords

1 Introduction

In the last decades, ensemble algorithms have been widely used in the Machine Learning and Data Science community, due to its remarkable results. There are both practical and theoretical reasons of why ensembles are preferred over single learners. For example, it is known that a group of learners with similar training performance may have different generalization performance when exposed to sparse data, large volume of data or data fusion. So, the basic idea of building an ensemble is to construct an inference model by combining a set of learning hypotheses, instead of designing the complete map between inputs and responses in a single step. Ensemble based systems have shown to produce favorable results compared to those of single-expert system for a broad range of applications such as financial, medical and social models, network security, web mining or bioinformatics, to name a few [3, 8].

Boosting algorithms, since the mid nineties, have been a very popular technique for constructing ensembles in the areas of Pattern Recognition and Machine Learning (see [2, 5,6,7]). Boosting is a learning algorithm designed to construct a predictor by combining, what are called, weak hypotheses. The AdaBoost algorithm, introduced by Freund and Schapire [5], builds an ensemble incrementally, placing increasing weights on the examples in the data set which appear to be “difficult”. Ensemble learning is the discipline which studies the use of a committee of models to construct a joint predictor which improves the performance over a single more complex model.

Currently, most of the modern computers have processors with multiple cores. AdaBoost was proposed in a time were the number of cores per machine was more limited. So, it seems natural to use all the resources available to improve the quality of the inference made by this model. In [1] a concurrent ensemble approach is presented that improves the weight estimation phase, which is one of the most important stages in the AdaBoost algorithm (Concurrent AdaBoost). By using concurrent computation, authors showed that one can effectively improve the generalization ability of the algorithm. In this work, we use this approach not only to improve the estimation of the weights for the resampling stage, but also to improve the time efficiency, without sacrificing generalization performance.

This paper is organized as follows. In Sect. 2, we briefly introduce AdaBoost and Parallel Adaboost approaches. In Sect. 3, we present our proposed model Concurrent AdaBoost with Subsampling. In Sect. 4 we compare the performance of our proposal using different percentages of subsampled data. The last section is devoted to concluding remarks and to delineate future work.

2 Adaptive Boosting and Some Parallel Variants

The AdaBoost Algorithm [5], introduced in 1997 by Freund and Schapire, has its theoretical background based on the “PAC” learning model [13]. The authors of this model were the first to pose the question of whether a weak learner that is slightly better than random guessing can be “boosted” to a strong learning algorithm. The classic AdaBoost takes as an input a training set \(\mathcal{Z}=\{(x_1,y_1)\ldots (x_n,y_n)\}\) where each \(x_i\) belongs to \(\mathcal{X}\subset \mathbb R^{d}\) and each label \(y_i\) is in some label set \(\mathcal{Y}\) such as \(\{-1,1\}\). Using a set of weak learners, AdaBoosts main idea is to maintain a sampling distribution \(D_t\) over the training set where in a sequence of \(t=1\ldots T\) rounds, \(D_t\) is used to train each weak learner. More formally, let \(D_t(i)\) be the sampling weight assigned to the example i on round t. In the initial round, \(D_t(i)=\frac{1}{n}\) for all i. Then, at each round, the weights of the incorrectly classified examples are increased, so that the following weak learner is forced to focus on the “hard” examples of the training set. The job of each weak learner is to find a hypothesis \(h_t : \mathcal{X}\rightarrow \{-1,1 \}\) appropriate for the distribution \(D_t\). The goodness of the obtained hypothesis can be quantified as the weighted error:

$$\begin{aligned} \epsilon _t&= Pr_{i\sim D_t}[h_t(x_i)\ne y_i] = \sum _{i:h_t(x_i)\ne y_i} D_t(i). \end{aligned}$$
(1)

AdaBoost loss function is \(\ell (y,f(x))=\exp (-yf(x))\) where y is the target and f(x) is the approximation made by the model. To obtain the exponential loss function, using the weak hypothesis \(h_T \in \{-1,1 \}\), one must solve:

$$\begin{aligned} (\alpha _t, h_t) = \arg \!\min _{\alpha ,h} \sum _{i=1}^{n} D_{i}^{(t)} exp(-\alpha y_i h(x_i)), \end{aligned}$$
(2)

for the weak hypothesis \(h_t\) and corresponding coefficient \(\alpha _t\) to be added at each step and with \(D_{i}^{(t)} = \exp (-y_i H_{t-1} (x_i))\), where \(H_{t-1}\) is the strong Hypothesis without the learner \(h_t\). The solution to Eq. 2 for \(h_t\) for any value of \(\alpha > 0\) is

$$\begin{aligned} h_t = \arg \!\min _{h} \sum _{i=1}^{n} D_{i}^{(t)} \text{ sign }(y_i h(x_i)), \end{aligned}$$
(3)

In general, weight changes of the boosting procedure can occur in two ways: reweighting (the numerical weights for each example are passed directly to the learner) and resampling (the training set is resampled following the weight distribution creating a new training set). In [12] the authors state that the latter approach gives better results. Nevertheless, the weak learner \(h_t\) that is selected with this approach may be sensitive to the resampling technique.

There have been several approaches to use Parallel Computing together with AdaBoost. In [10, 11] the authors propose two parallel boosting algorithms, ADABOOST.PL and LOGITBOOST.PL, which facilitate simultaneous participation of multiple computing nodes to construct a boosted ensemble classifier. The authors claim that these proposed algorithms are competitive to the corresponding serial versions in terms of the generalization performance while achieving a significant speedup. In [4] a randomised parallel version of Adaboost is proposed. The algorithm uses the fact that the logarithm of the exponential loss is a function with coordinate-wise Lipschitz continuous gradient, in order to define the step lengths. They provide the proof of convergence for this randomised Adaboost algorithm and a theoretical parallelization speedup factor. The authors in [9] propose an algorithm called BOOM, for boosting with momentum. Namely, BOOM retains the momentum and convergence properties of the accelerated gradient method while taking into account the curvature of the objective function. They describe a distributed implementation of BOOM which is suitable for massive high dimensional datasets. To the best of our knowledge, all proposed algorithms that use parallel computing, try to improve the computation time, instead of improving the generalization performance. The results obtained by these approaches are at most similar to the ones obtained with classic AdaBoost, because they try to approximate the behaviour of the classic algorithm.

3 Subsampling the Concurrent AdaBoost Algorithm

Concurrent computing is a form of computing in which several computations are executed during overlapping time periods concurrently instead of sequentially (one completing before the next starts). In this research the main idea is to work with all the processors of the machine in order to use, in other case, idle processors, to improve the generalization ability and the execution time of the AdaBoost algorithm. In most parallel Adaboost approaches, the multiple cores are used to decrease the computation times, by partitioning the dataset in smaller fractions. These approaches try to get an approximation to the model that would have been obtained it had been trained with classic approaches.

Fig. 1.
figure 1

Concurrent approach. p weak learner \(h^i\) per boosting round, where \(i=1,\ldots ,p\)

In this proposal, instead of using a single weak learner in each AdaBoost round, the idea is to use all p processors available to subsample, in a parallel fashion, the original data, and also in parallel train several weak learners. With all p weak learners we build an ensemble, that is weighted with its training accuracy, and then with the output of the ensemble update the weights of the examples. This proposal is based in a previous work, where the parallel resampling was performed using the original size of the dataset [1]. In this work, we choose each \(h^{j}\) according to Eq. 3 where \(j=1,\ldots ,p\), and then obtain the ensemble output \(E_t(x_i)\) for the example \(x_i\) at the t-th round as:

$$\begin{aligned} E_t (x_i)&= \text{ sign }\left( y_i \sum _{j=1}^{p} \phi ^{j} h_{t}^{j}(x_i)\right) , \end{aligned}$$
(4)

where \(\phi ^{j}\) is the training accuracy of weak hypothesis \(h_{t}^{j}\) and \(\sum _{j=1}^{p}\phi ^{j} h_{t}^{j}(x_i)\) is the decision of the p weak learners that were trained in parallel. Then, the weighted error \(\epsilon _t\) is computed considering the output of the ensemble \(E_t\) using \(\epsilon _t = Pr_{i\sim D_t}[E_{t}(x_i)\ne y_i]\). With this, we avoid the necessity of having to select a learner explicitly and use an ensemble of learners instead. Algorithm 1 shows our proposal.

Note that in the classic AdaBoost approach, a single weak learner is trained with a resample of the original data, and the updates of the weights of the distributions are changed using the outputs of the single weak hipothesis. In the case of the concurrent approach (see Fig. 1), on each boosting round, p resamples obtained in a concurrent fashion are used to train p weak learner to build an ensemble weighted by the training accuracy. This ensemble is then used to update the weights of the distribution.

figure a
Table 1. Test results of Quasar, Breast Cancer and Twitter in terms of ROC-AUC, F1-score, Precision and Recall performance measures with different percentages of concurrent subsampling s, using \(p=7\) parallel processes.

4 Experimentation

In this section we validate our proposal with 3 real datasets commonly used for binary classification tasks. The datasets are: SDSS DR7 Quasar Catalog/SDSS SEGUE Stellar Parameter Pipeline data (Quasar/Star)Footnote 1, the well-known Breast Cancer dataset and a set of Twitter messages used for Sentiment Analysis tasks. The Quasar/Star dataset consists of 433043 examples and 10 attributes and the Twitter dataset contains 400 examples and has 59794 attributes (this is only a subset of the entire dataset which has 4000 examples). The latter was obtained in 2011 and is related with the chilean presidential election. The sentiment (positive or negative) from the twitter messages was labeled by 3 journalism students.

The weak learners used were Decision Stumps, and the number of experiments made was 20. The data was split in two groups: 80% training and 20% test. We implemented our proposal in the Python Language (v2.7.6) and used the latest version of the library for parallel processing, Parallel PythonFootnote 2. The experiments were run in an Intel i7 2.6 GHz (8 threads) with 16 GB RAM running with Ubuntu 14.04 x64. The number of parallel processes p in each proposal experiment was 7 (odd number of decisions). The performance measures used to report the results are: Accuracy, Area under the ROC curve, F1-score, Precision and Recall.

In Table 1 we observe the results of the experimentation over the 3 datasets in terms of ROC-AUC, F1-score, Precision and Recall. We show the results with s equal to 1%, 5%, 10%, 50% and 100% (for subsampling).

Analyzing the results of our experimentation, we observe that the proposed approach has a similar performance on the classification task using smaller resamples compared to using resamples with more data. Event in some cases (Quasar/Star dataset) with the smallest subsample we reach the best performance. This shows that our proposed algorithm is not affected considerably by reducing the size of the resamples. In Table 2 we show the execution time and the test accuracy of the experiments made over all three datasets, using different subsample size percentages. Although there is a slight improvement when the size of the resamples increases, the execution time increases greatly. The best exemplification of this, are the results obtained in the Quasar/Star dataset. The results for 1% and 50% are the same in 20 experiments but the execution time is considerably less.

Table 2. Execution time and test accuracy with \(p=7\) parallel learners in each round.

5 Conclusions

In this work we introduce a subsampled concurrent approach of the classic AdaBoost algorithm. By using subsamples of the training data, our proposal is able to improve the efficiency in terms of execution speed, with minimum to none accuracy loss, specially in large datasets. From a previous work it is known that by training more than one weak learner per round (using concurrent computation), a significant improvement in the generalization ability of the algorithm can be obtained. By using this result, we showed that via resampling methods we can speed up in a remarkable fashion the execution time of the process, making this algorithm suitable and efficient for problems involving large datasets.

In future work we will try this approach in other boosting algorithms and using new datasets. Also we are planning to formalize the approach in terms of the convergence of the AdaBoost algorithm.