Abstract
The growing amount of available information and its distributed and heterogeneous nature has a major impact on the field of data mining. In this paper, we propose a framework for parallel and distributed boosting algorithms intended for efficient integrating specialized classifiers learned over very large, distributed and possibly heterogeneous databases that cannot fit into main computer memory. Boosting is a popular technique for constructing highly accurate classifier ensembles, where the classifiers are trained serially, with the weights on the training instances adaptively set according to the performance of previous classifiers. Our parallel boosting algorithm is designed for tightly coupled shared memory systems with a small number of processors, with an objective of achieving the maximal prediction accuracy in fewer iterations than boosting on a single processor. After all processors learn classifiers in parallel at each boosting round, they are combined according to the confidence of their prediction. Our distributed boosting algorithm is proposed primarily for learning from several disjoint data sites when the data cannot be merged together, although it can also be used for parallel learning where a massive data set is partitioned into several disjoint subsets for a more efficient analysis. At each boosting round, the proposed method combines classifiers from all sites and creates a classifier ensemble on each site. The final classifier is constructed as an ensemble of all classifier ensembles built on disjoint data sets. The new proposed methods applied to several data sets have shown that parallel boosting can achieve the same or even better prediction accuracy considerably faster than the standard sequential boosting. Results from the experiments also indicate that distributed boosting has comparable or slightly improved classification accuracy over standard boosting, while requiring much less memory and computational time since it uses smaller data sets.
Similar content being viewed by others
References
J. Blackard, “Comparison of neural networks and discriminant analysis in predicting forest cover types,” Ph.D. dissertation, Colorado State University, 1998.
C.L. Blake and C.J. Merz, “UCI repository of machine learning databases,” http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA: University of California, Department of Information and Computer Science, 1998.
L. Breiman and N. Shang, “Born again trees,” ftp://ftp.stat.berkeley.edu/pub/users/breiman/BAtrees.ps, 1996.
P. Chan and S. Stolfo, “On the accuracy of meta-learning for scalable data mining,” Journal of Intelligent Integration of Information, L. Kerschberg (Ed.), 1998.
S. Clearwater, T. Cheng, H. Hirsh, and B. Buchanan, “Incremental batch learning.” in Proc. of the Sixth Int. Machine Learning Workshop, Ithaca, NY, 1989, pp. 366–370.
J. Dy and C. Brodley, “Feature subset selection and order identification for unsupervised learning,” in Proc. of the Seventeenth Int. Conf. on Machine Learning, Stanford, CA, 2000, pp. 247–254.
W. Fan, S. Stolfo, and J. Zhang, “The application of Adaboost for distributed, scalable and on-line learning,” in Proc. of the Fifth ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, San Diego, CA, 1999, pp. 362–366.
B. Flury, “A first course in multivariate statistics, Springer-Verlag: New York, NY, 1997.
Y. Freund and R.E. Schapire, “Experiments with a new boosting algorithm,” in Proc. of the Thirteenth Int. Conf. on Machine Learning, San Francisco, CA, 1996, pp. 325–332.
J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: A statistical view of boosting,” The Annals of Statistics, vol. 38, no.2, pp. 337–374, 2000.
M. Hagan and M.B. Menhaj, “Training feedforward networks with the Marquardt algorithm,” IEEE Trans. on Neural Networks, vol. 5, pp. 989–993, 1994.
S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall: Englewood Cliffs, NJ, 1999.
M. Jordan and R. Jacobs, “Hierarchical mixture of experts and the EM algorithm,” Neural Computation, vol. 6, no.2, pp. 181–214, 1994.
A. Lazarevic, T. Fiez, and Z. Obradovic, “Adaptive boosting for spatial functions with unstable driving attributes,” in Proc. Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Kyoto, Japan, 2000, pp. 329–340.
A. Lazarevic and Z. Obradovic. “Boosting localized classifiers in heterogeneous databases,” in Proc. on First Int. SIAM Conf. on Data Mining, Chicago, IL, 2001.
A. Lazarevic and Z. Obradovic, “The effective pruning of neural network ensembles,” in Proc. of IEEE Int. Joint Conf. on Neural Networks, Washington, D.C., 2001, pp. 796–801.
A. Lazarevic, D. Pokrajac, and Z. Obradovic, “Distributed clustering and local regression for knowledge discovery in multiple spatial databases,” in Proc. 8th European Symp. on Art. Neural Networks, Bruges, Belgium, 2000, pp. 129–134.
A. Lazarevic, X. Xu, T. Fiez, and Z. Obradovic, “Clustering-Regression-Ordering steps for knowledge discovery in spatial databases,” in Proc. IEEE/INNS Int. Conf. on Neural Networks, Washington D.C., No. 345, Session 8.1B, 1999.
L. Mason, J. Baxter, P. Bartlett, and M. Frean. “Function gradient techniques for combining hypotheses,” in Advances in Large Margin Classifiers, A. Smola, P. Bartlett, B. Scholkopf, and D. Schuurmans (Eds.), MIT Press: Cambridge, MA, 2000, chap. 12.
D. Pokrajac, T. Fiez, and Z. Obradovic, “A data generator for evaluating spatial issues in precision agriculture,” Precision Agriculture, in press.
D. Pokrajac, A. Lazarevic, V. Megalooikonomou, and Z. Obradovic, “Classification of brain image data using measures of distributional distance,” in Proc. 7th Annual Meeting of the Organization for Human Brain Mapping, London, UK, 2001.
A. Prodromidis, P. Chan, and S. Stolfo, “Meta-Learning in distributed data mining systems: Issues and approaches,” in Advances in Distributed Data Mining, H. Kargupta and P. Chan (Eds.), AAAI Press: Menlo Park, CA, 2000.
F. Provost and D. Hennesy, “Scaling Up: Distributed machine learning with cooperation,” in Proc. of the Thirteenth National Conf. on Artificial Intelligence, Portland, OR, 1996, pp. 74–79.
M. Riedmiller and H. Braun, “A direct adaptive method for faster backpropagation learning: The RPROP algorithm,” in Proc. of the IEEE Int. Conf. on Neural Networks, San Francisco, CA, 1993, pp. 586–591.
J. Sander, M. Ester, H.-P. Kriegel, and X. Xu “Density-Based clustering in spatial databases: The algorithm GDBSCAN and its applications,” Data Mining and Knowledge Discovery, vol. 2, no.2, pp. 169–194, 1998.
J. Shafer, R. Agrawal, and M. Mehta, “SPRINT: A scalable parallel classifier for data mining,” in Proc. of the 22nd Int. Conf. on Very Large Data Bases, Mumbai (Bombay), India, 1996, pp. 544–555.
P. Sollich and A. Krogh, “Learning with ensembles: How over-fitting can be useful.” Advances in Neural Information Processing Systems, vol. 8, pp. 190–196, 1996.
M. Sreenivas, K. AlSabti, and S. Ranka, “Parallel out-of-core decision tree classifiers,” in Advances in Distributed Data Mining, H. Kargupta and P. Chan (Eds.), AAAI Press: Menlo Park, CA, 2000.
A. Srivastava, E. Han, V. Kumar, and V. Singh, “Parallel formulations of decision-tree classification algorithms,” Data Mining and Knowledge Discovery, vol. 3, no.3, pp. 237–261, 1999.
P. Utgoff, “An improved algorithm for incremental induction of decision trees,” in Proc. of the Eleventh Int. Conf. on Machine Learning, New Brunswick, NJ, 1994, pp. 318–325.
M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “Parallel algorithms for discovery of association rules,” Data Mining and Knowledge Discovery: An International Journal, special issue on Scalable High-Performance Computing, vol. 1, no.4, pp. 343–373, 1997.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Lazarevic, A., Obradovic, Z. Boosting Algorithms for Parallel and Distributed Learning. Distributed and Parallel Databases 11, 203–229 (2002). https://doi.org/10.1023/A:1013992203485
Issue Date:
DOI: https://doi.org/10.1023/A:1013992203485