Subsampling the Concurrent AdaBoost Algorithm: An Efficient Approach for Large Datasets

Allende-Cid, Héctor; Acuña, Diego; Allende, Héctor

doi:10.1007/978-3-319-52277-7_39

Héctor Allende-Cid¹⁶,
Diego Acuña¹⁷ &
Héctor Allende^17,18

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10125))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

1466 Accesses

Abstract

In this work we propose a subsampled version of the Concurrent AdaBoost algorithm in order to deal with large datasets in an efficient way. The proposal is based on a concurrent computing approach focused on improving the distribution weight estimation in the algorithm, hence obtaining better capacity of generalization. On each round, we train in parallel several weak hypotheses, and using a weighted ensemble we update the distribution weights of the following boosting rounds. Instead of creating resamples of size equal to the original dataset, we subsample the datasets in order to obtain a speed-up in the training phase. We validate our proposal with different resampling sizes using 3 datasets, obtaining promising results and showing that the size of the resamples does not affect considerably the performance of the algorithm, but the execution time improves greatly.

You have full access to this open access chapter, Download conference paper PDF

Improving the Weighted Distribution Estimation for AdaBoost Using a Novel Concurrent Approach

A Scalable Boosting Learner Using Adaptive Sampling

A Scalable Boosting Learner for Multi-class Classification Using Adaptive Sampling

Keywords

1 Introduction

In the last decades, ensemble algorithms have been widely used in the Machine Learning and Data Science community, due to its remarkable results. There are both practical and theoretical reasons of why ensembles are preferred over single learners. For example, it is known that a group of learners with similar training performance may have different generalization performance when exposed to sparse data, large volume of data or data fusion. So, the basic idea of building an ensemble is to construct an inference model by combining a set of learning hypotheses, instead of designing the complete map between inputs and responses in a single step. Ensemble based systems have shown to produce favorable results compared to those of single-expert system for a broad range of applications such as financial, medical and social models, network security, web mining or bioinformatics, to name a few [3, 8].

Boosting algorithms, since the mid nineties, have been a very popular technique for constructing ensembles in the areas of Pattern Recognition and Machine Learning (see [2, 5,6,7]). Boosting is a learning algorithm designed to construct a predictor by combining, what are called, weak hypotheses. The AdaBoost algorithm, introduced by Freund and Schapire [5], builds an ensemble incrementally, placing increasing weights on the examples in the data set which appear to be “difficult”. Ensemble learning is the discipline which studies the use of a committee of models to construct a joint predictor which improves the performance over a single more complex model.

Currently, most of the modern computers have processors with multiple cores. AdaBoost was proposed in a time were the number of cores per machine was more limited. So, it seems natural to use all the resources available to improve the quality of the inference made by this model. In [1] a concurrent ensemble approach is presented that improves the weight estimation phase, which is one of the most important stages in the AdaBoost algorithm (Concurrent AdaBoost). By using concurrent computation, authors showed that one can effectively improve the generalization ability of the algorithm. In this work, we use this approach not only to improve the estimation of the weights for the resampling stage, but also to improve the time efficiency, without sacrificing generalization performance.

This paper is organized as follows. In Sect. 2, we briefly introduce AdaBoost and Parallel Adaboost approaches. In Sect. 3, we present our proposed model Concurrent AdaBoost with Subsampling. In Sect. 4 we compare the performance of our proposal using different percentages of subsampled data. The last section is devoted to concluding remarks and to delineate future work.

2 Adaptive Boosting and Some Parallel Variants

The AdaBoost Algorithm [5], introduced in 1997 by Freund and Schapire, has its theoretical background based on the “PAC” learning model [13]. The authors of this model were the first to pose the question of whether a weak learner that is slightly better than random guessing can be “boosted” to a strong learning algorithm. The classic AdaBoost takes as an input a training set $\mathcal{Z}=\{(x_1,y_1)\ldots (x_n,y_n)\}$ where each $x_i$ belongs to $\mathcal{X}\subset \mathbb R^{d}$ and each label $y_i$ is in some label set $\mathcal{Y}$ such as $\{-1,1\}$. Using a set of weak learners, AdaBoosts main idea is to maintain a sampling distribution $D_t$ over the training set where in a sequence of $t=1\ldots T$ rounds, $D_t$ is used to train each weak learner. More formally, let $D_t(i)$ be the sampling weight assigned to the example i on round t. In the initial round, $D_t(i)=\frac{1}{n}$ for all i. Then, at each round, the weights of the incorrectly classified examples are increased, so that the following weak learner is forced to focus on the “hard” examples of the training set. The job of each weak learner is to find a hypothesis $h_t : \mathcal{X}\rightarrow \{-1,1 \}$ appropriate for the distribution $D_t$. The goodness of the obtained hypothesis can be quantified as the weighted error:

$$\begin{aligned} \epsilon _t&= Pr_{i\sim D_t}[h_t(x_i)\ne y_i] = \sum _{i:h_t(x_i)\ne y_i} D_t(i). \end{aligned}$$

(1)

AdaBoost loss function is $\ell (y,f(x))=\exp (-yf(x))$ where y is the target and f(x) is the approximation made by the model. To obtain the exponential loss function, using the weak hypothesis $h_T \in \{-1,1 \}$, one must solve:

$$\begin{aligned} (\alpha _t, h_t) = \arg \!\min _{\alpha ,h} \sum _{i=1}^{n} D_{i}^{(t)} exp(-\alpha y_i h(x_i)), \end{aligned}$$

(2)

for the weak hypothesis $h_t$ and corresponding coefficient $\alpha _t$ to be added at each step and with $D_{i}^{(t)} = \exp (-y_i H_{t-1} (x_i))$, where $H_{t-1}$ is the strong Hypothesis without the learner $h_t$. The solution to Eq. 2 for $h_t$ for any value of $\alpha > 0$ is

$$\begin{aligned} h_t = \arg \!\min _{h} \sum _{i=1}^{n} D_{i}^{(t)} \text{ sign }(y_i h(x_i)), \end{aligned}$$

(3)

In general, weight changes of the boosting procedure can occur in two ways: reweighting (the numerical weights for each example are passed directly to the learner) and resampling (the training set is resampled following the weight distribution creating a new training set). In [12] the authors state that the latter approach gives better results. Nevertheless, the weak learner $h_t$ that is selected with this approach may be sensitive to the resampling technique.

There have been several approaches to use Parallel Computing together with AdaBoost. In [10, 11] the authors propose two parallel boosting algorithms, ADABOOST.PL and LOGITBOOST.PL, which facilitate simultaneous participation of multiple computing nodes to construct a boosted ensemble classifier. The authors claim that these proposed algorithms are competitive to the corresponding serial versions in terms of the generalization performance while achieving a significant speedup. In [4] a randomised parallel version of Adaboost is proposed. The algorithm uses the fact that the logarithm of the exponential loss is a function with coordinate-wise Lipschitz continuous gradient, in order to define the step lengths. They provide the proof of convergence for this randomised Adaboost algorithm and a theoretical parallelization speedup factor. The authors in [9] propose an algorithm called BOOM, for boosting with momentum. Namely, BOOM retains the momentum and convergence properties of the accelerated gradient method while taking into account the curvature of the objective function. They describe a distributed implementation of BOOM which is suitable for massive high dimensional datasets. To the best of our knowledge, all proposed algorithms that use parallel computing, try to improve the computation time, instead of improving the generalization performance. The results obtained by these approaches are at most similar to the ones obtained with classic AdaBoost, because they try to approximate the behaviour of the classic algorithm.

3 Subsampling the Concurrent AdaBoost Algorithm

Concurrent computing is a form of computing in which several computations are executed during overlapping time periods concurrently instead of sequentially (one completing before the next starts). In this research the main idea is to work with all the processors of the machine in order to use, in other case, idle processors, to improve the generalization ability and the execution time of the AdaBoost algorithm. In most parallel Adaboost approaches, the multiple cores are used to decrease the computation times, by partitioning the dataset in smaller fractions. These approaches try to get an approximation to the model that would have been obtained it had been trained with classic approaches.

In this proposal, instead of using a single weak learner in each AdaBoost round, the idea is to use all p processors available to subsample, in a parallel fashion, the original data, and also in parallel train several weak learners. With all p weak learners we build an ensemble, that is weighted with its training accuracy, and then with the output of the ensemble update the weights of the examples. This proposal is based in a previous work, where the parallel resampling was performed using the original size of the dataset [1]. In this work, we choose each $h^{j}$ according to Eq. 3 where $j=1,\ldots ,p$, and then obtain the ensemble output $E_t(x_i)$ for the example $x_i$ at the t-th round as:

$$\begin{aligned} E_t (x_i)&= \text{ sign }\left( y_i \sum _{j=1}^{p} \phi ^{j} h_{t}^{j}(x_i)\right) , \end{aligned}$$

(4)

where $\phi ^{j}$ is the training accuracy of weak hypothesis $h_{t}^{j}$ and $\sum _{j=1}^{p}\phi ^{j} h_{t}^{j}(x_i)$ is the decision of the p weak learners that were trained in parallel. Then, the weighted error $\epsilon _t$ is computed considering the output of the ensemble $E_t$ using $\epsilon _t = Pr_{i\sim D_t}[E_{t}(x_i)\ne y_i]$. With this, we avoid the necessity of having to select a learner explicitly and use an ensemble of learners instead. Algorithm 1 shows our proposal.

Note that in the classic AdaBoost approach, a single weak learner is trained with a resample of the original data, and the updates of the weights of the distributions are changed using the outputs of the single weak hipothesis. In the case of the concurrent approach (see Fig. 1), on each boosting round, p resamples obtained in a concurrent fashion are used to train p weak learner to build an ensemble weighted by the training accuracy. This ensemble is then used to update the weights of the distribution.

Table 1. Test results of Quasar, Breast Cancer and Twitter in terms of ROC-AUC, F1-score, Precision and Recall performance measures with different percentages of concurrent subsampling s, using $p=7$ parallel processes.

Full size table

4 Experimentation

In this section we validate our proposal with 3 real datasets commonly used for binary classification tasks. The datasets are: SDSS DR7 Quasar Catalog/SDSS SEGUE Stellar Parameter Pipeline data (Quasar/Star)^{Footnote 1}, the well-known Breast Cancer dataset and a set of Twitter messages used for Sentiment Analysis tasks. The Quasar/Star dataset consists of 433043 examples and 10 attributes and the Twitter dataset contains 400 examples and has 59794 attributes (this is only a subset of the entire dataset which has 4000 examples). The latter was obtained in 2011 and is related with the chilean presidential election. The sentiment (positive or negative) from the twitter messages was labeled by 3 journalism students.

The weak learners used were Decision Stumps, and the number of experiments made was 20. The data was split in two groups: 80% training and 20% test. We implemented our proposal in the Python Language (v2.7.6) and used the latest version of the library for parallel processing, Parallel Python^{Footnote 2}. The experiments were run in an Intel i7 2.6 GHz (8 threads) with 16 GB RAM running with Ubuntu 14.04 x64. The number of parallel processes p in each proposal experiment was 7 (odd number of decisions). The performance measures used to report the results are: Accuracy, Area under the ROC curve, F1-score, Precision and Recall.

In Table 1 we observe the results of the experimentation over the 3 datasets in terms of ROC-AUC, F1-score, Precision and Recall. We show the results with s equal to 1%, 5%, 10%, 50% and 100% (for subsampling).

Analyzing the results of our experimentation, we observe that the proposed approach has a similar performance on the classification task using smaller resamples compared to using resamples with more data. Event in some cases (Quasar/Star dataset) with the smallest subsample we reach the best performance. This shows that our proposed algorithm is not affected considerably by reducing the size of the resamples. In Table 2 we show the execution time and the test accuracy of the experiments made over all three datasets, using different subsample size percentages. Although there is a slight improvement when the size of the resamples increases, the execution time increases greatly. The best exemplification of this, are the results obtained in the Quasar/Star dataset. The results for 1% and 50% are the same in 20 experiments but the execution time is considerably less.

Table 2. Execution time and test accuracy with $p=7$ parallel learners in each round.

Full size table

5 Conclusions

In this work we introduce a subsampled concurrent approach of the classic AdaBoost algorithm. By using subsamples of the training data, our proposal is able to improve the efficiency in terms of execution speed, with minimum to none accuracy loss, specially in large datasets. From a previous work it is known that by training more than one weak learner per round (using concurrent computation), a significant improvement in the generalization ability of the algorithm can be obtained. By using this result, we showed that via resampling methods we can speed up in a remarkable fashion the execution time of the process, making this algorithm suitable and efficient for problems involving large datasets.

In future work we will try this approach in other boosting algorithms and using new datasets. Also we are planning to formalize the approach in terms of the convergence of the AdaBoost algorithm.

Notes

References

Allende-Cid, H., Valle, C., Moraga, C., Allende, H., Salas, R.: Improving the weighted distribution estimation for AdaBoost using a novel concurrent approach. Intell. Distrib. Comput. IX 616(1), 223–232 (2015)
Google Scholar
Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach. Learn. 36(1-2), 105–139 (1999)
Google Scholar
Bergstra, J., Casagrande, N., Erhan, D., Eck, D., Kégl, B.: Aggregate features and AdaBoost for music classification. Mach. Learn. 65(2), 473–484 (2006)
Article Google Scholar
Fercoq, O.: Parallel coordinate descent for the AdaBoost problem. In: 12th International Conference on Machine Learning and Applications (ICMLA), vol. 1, no. 1, pp. 354–358, 4–7 December 2013
Google Scholar
Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Article MathSciNet MATH Google Scholar
Kuncheva, L.I., Whitaker, C.J.: Using diversity with three variants of boosting: aggressive, conservative, and inverse. In: Roli, F., Kittler, J. (eds.) MCS 2002. LNCS, vol. 2364, pp. 81–90. Springer, Heidelberg (2002). doi:10.1007/3-540-45428-4_8
Chapter Google Scholar
Liu, H., Tian, H.Q., Li, Y.F., Zhang, L.: Comparison of four AdaBoost algorithm based artificial neural networks in wind speed predictions. Energy Convers. Manag. 92(1), 67–81 (2015)
Article Google Scholar
Markoski, B., Ivanković, Z., Ratgeber, L., Pecev, P., Glusac, D.: Application of AdaBoost algorithm in basketball player detection. Acta Polytechnica Hungarica 12(1), 189–207 (2015)
Google Scholar
Mukherjee, I., Canini, K., Frongillo, R., Singer, Y.: Parallel boosting with momentum. Mach. Learn. Knowl. Discov. Databases 8190(1), 17–32 (2013)
Google Scholar
Palit, I., Reddy, C.K.: Scalable and parallel boosting with MapReduce. IEEE Trans. Knowl. Data Eng. 24(10), 1904–1916 (2012)
Article Google Scholar
Palit, I., Reddy, C.K.: Parallelized boosting with MapReduce. In: 2010 IEEE International Conference on Data Mining Workshops (ICDMW), vol. 1, no. 1, pp. 1346–1353, 13 December 2010
Google Scholar
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Resampling or reweighting: a comparison of boosting implementations. In: Proceedings of 20th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2008), vol. 1, no. 1, pp. 445–451, November 2008
Google Scholar
Valiant, L.G.: A theory of the learnable. Commun. ACM 27(11), 1134–1142 (1984)
Article MATH Google Scholar

Download references

Acknowledgments

This work was supported by the following research grants: Fondecyt Initiation into Research 11150248 and DGIP-UTFSM.

Author information

Authors and Affiliations

Pontificia Universidad Católica de Valparaíso, Avda. Brasil 2241, Valparaíso, Chile
Héctor Allende-Cid
Universidad Técnica Federico Santa María, Avda. España 1680, Valparaíso, Chile
Diego Acuña & Héctor Allende
Universidad Adolfo Ibáñez, Padre Hurtado 750, Viña del Mar, Chile
Héctor Allende

Authors

Héctor Allende-Cid
View author publications
You can also search for this author in PubMed Google Scholar
Diego Acuña
View author publications
You can also search for this author in PubMed Google Scholar
Héctor Allende
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Héctor Allende-Cid .

Editor information

Editors and Affiliations

Pontificia Universidad Católica del Perú, Lima, Peru
César Beltrán-Castañón
Uppsala University, Uppsala, Sweden
Ingela Nyström
University of Ottawa, Ottawa, Ontario, Canada
Fazel Famili

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Allende-Cid, H., Acuña, D., Allende, H. (2017). Subsampling the Concurrent AdaBoost Algorithm: An Efficient Approach for Large Datasets. In: Beltrán-Castañón, C., Nyström, I., Famili, F. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2016. Lecture Notes in Computer Science(), vol 10125. Springer, Cham. https://doi.org/10.1007/978-3-319-52277-7_39

Download citation

DOI: https://doi.org/10.1007/978-3-319-52277-7_39
Published: 16 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52276-0
Online ISBN: 978-3-319-52277-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Subsampling the Concurrent AdaBoost Algorithm: An Efficient Approach for Large Datasets

Abstract

Similar content being viewed by others

Improving the Weighted Distribution Estimation for AdaBoost Using a Novel Concurrent Approach

A Scalable Boosting Learner Using Adaptive Sampling

A Scalable Boosting Learner for Multi-class Classification Using Adaptive Sampling

Keywords

1 Introduction

2 Adaptive Boosting and Some Parallel Variants

3 Subsampling the Concurrent AdaBoost Algorithm

4 Experimentation

5 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Subsampling the Concurrent AdaBoost Algorithm: An Efficient Approach for Large Datasets

Abstract

Similar content being viewed by others

Improving the Weighted Distribution Estimation for AdaBoost Using a Novel Concurrent Approach

A Scalable Boosting Learner Using Adaptive Sampling

A Scalable Boosting Learner for Multi-class Classification Using Adaptive Sampling

Keywords

1 Introduction

2 Adaptive Boosting and Some Parallel Variants

3 Subsampling the Concurrent AdaBoost Algorithm

4 Experimentation

5 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation