CLUB-DRF: A Clustering Approach to Extreme Pruning of Random Forests

Fawagreh, Khaled; Gaber, Mohamed Medhat; Elyan, Eyad

doi:10.1007/978-3-319-25032-8_4

Khaled Fawagreh³,
Mohamed Medhat Gaber³ &
Eyad Elyan³

Included in the following conference series:

International Conference on Innovative Techniques and Applications of Artificial Intelligence

543 Accesses
5 Altmetric

Abstract

Random Forest (RF) is an ensemble supervised machine learning technique that was developed by Breiman over a decade ago. Compared with other ensemble techniques, it has proved its superiority. Many researchers, however, believe that there is still room for enhancing and improving its performance accuracy. This explains why, over the past decade, there have been many extensions of RF where each extension employed a variety of techniques and strategies to improve certain aspect(s) of RF. Since it has been proven empirically that ensembles tend to yield better results when there is a significant diversity among the constituent models, the objective of this paper is twofold. First, it investigates how data clustering (a well known diversity technique) can be applied to identify groups of similar decision trees in an RF in order to eliminate redundant trees by selecting a representative from each group (cluster). Second, these likely diverse representatives are then used to produce an extension of RF termed CLUB-DRF that is much smaller in size than RF, and yet performs at least as good as RF, and mostly exhibits higher performance in terms of accuracy. The latter refers to a known technique called ensemble pruning. Experimental results on 15 real datasets from the UCI repository prove the superiority of our proposed extension over the traditional RF. Most of our experiments achieved at least 92 % or above pruning level while retaining or outperforming the RF accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Auto-CES: An Automatic Pruning Method Through Clustering Ensemble Selection

Pruning a Random Forest by Learning a Learning Algorithm

Increasing Diversity in Random Forests Using Naive Bayes

References

Adeva, J.J.G., Beresi, U., Calvo, R.: Accuracy and diversity in ensembles of text categorisers. CLEI Electron. J. 9(1) (2005)
Google Scholar
Amit, Y., Geman, D.: Shape quantization and recognition with randomized trees. Neural Comput. 9(7), 1545–1588 (1997)
Article Google Scholar
Bache, K., Lichman, M.: Uci Machine Learning Repository. University of California, Irvine (2013)
Google Scholar
Bakker, B., Heskes, T.: Clustering ensembles of neural network models. Neural Netw. 16(2), 261–269 (2003)
Article Google Scholar
Bernard, S., Heutte, L., Adam, S.: On the selection of decision trees in random forests. In: International Joint Conference on Neural Networks. IJCNN 2009, pp. 302–307. June 2009
Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
MATH Google Scholar
Breiman, L.: Stacked regressions. Mach. Learn. 24(1), 49–64 (1996)
MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Brown, G., Wyatt, J., Harris, R., Yao, X.: Diversity creation methods: a survey and categorisation. Inf. Fusion 6(1), 5–20 (2005)
Article Google Scholar
Brown, R.D., Martin, Y.C.: An evaluation of structural descriptors and clustering methods for use in diversity selection. SAR QSAR Environ. Res. 8(1–2), 23–39 (1998)
Article Google Scholar
Diao, R., Chao, F., Peng, T., Snooke, N., Shen, Q.: Feature selection inspired classifier ensemble reduction. Cybern. IEEE Trans. 44(8), 1259–1268 (2014)
Article Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xiaowei, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996)
Google Scholar
Fleiss, J.L., Levin, B., Cho Paik, M.: Statistical Methods for Rates and Proportions. Wiley, New York (2013)
MATH Google Scholar
Freund, Y., Robert, E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Article MathSciNet MATH Google Scholar
Giacinto, G., Roli, F.: Design of effective neural network ensembles for image classification purposes. Image Vis. Comput. 19(9), 699–707 (2001)
Article Google Scholar
Giacinto, G., Roli, F., Fumera, G.: Design of effective multiple classifier systems by clustering of classifiers. In: Proceedings of 15th International Conference on Pattern Recognition, vol. 2, pp. 160–163. IEEE (2000)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: ACM SIGMOD Record, vol. 27, pp. 73–84. ACM (1998)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update, vol. 11. ACM, New York (2009)
Google Scholar
Ho, T.H: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995)
Google Scholar
Ho, T.K.: The random subspace method for constructing decision forests. Pattern Anal. Mach. Intell. IEEE Trans. 20(8), 832–844 (1998)
Article Google Scholar
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)
Article Google Scholar
Huang, Z., Ng, M.K.: A fuzzy k-modes algorithm for clustering categorical data. Fuzzy Syst. IEEE Trans. 7(4), 446–452 (1999)
Article Google Scholar
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)
Article Google Scholar
Kohavi, R., et al.: A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 14, 1137–1145 (1995)
Google Scholar
Kohavi, R., Wolpert, D.H., et al.: Bias plus variance decomposition for zero-one loss functions. In: ICML, pp. 275–283 (1996)
Google Scholar
Kulkarni, V.Y., Sinha, P.K.: Pruning of random forest classifiers: a survey and future directions. In: International Conference on Data Science Engineering (ICDSE), pp. 64–68, July 2012
Google Scholar
Kuncheva, L.I., Hadjitodorov, S.T.: Using diversity in cluster ensembles. In: IEEE International Conference on Systems, Man and Cybernetics, vol. 2, pp. 1214–1219. IEEE (2004)
Google Scholar
Kuncheva, L.I., Whitaker, C.J.: Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach. Learn. 51(2), 181–207 (2003)
Article MATH Google Scholar
Lazarevic, A., Obradovic, Z.: Effective pruning of neural network classifier ensembles. In: Proceedings of International Joint Conference on Neural Networks. IJCNN’01, vol. 2, pp. 796–801. IEEE (2001)
Google Scholar
Lee, J., Sun, Y., Nabar, R., Lou, H.-L.: Cluster-based transmit diversity scheme for mimo ofdm systems. In: IEEE 68th Vehicular Technology Conference, VTC 2008-Fall, pp. 1–5. IEEE (2008)
Google Scholar
Leo, B., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and regression trees. Wadsworth Int. Group (1984)
Google Scholar
Li, J., Yi, Ke., Zhang, Q.: Clustering with diversity. In: Automata, Languages and Programming, pp. 188–200. Springer (2010)
Google Scholar
Maclin, R., Opitz, D.: Popular ensemble methods: an empirical study. J. Artif. Intell. Res. 11(1–2), 169–198 (1999)
MATH Google Scholar
MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, p. 14. California (1967)
Google Scholar
Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate analysis (1980)
Google Scholar
Ng, R.T., Han, J.: Clarans: a method for clustering objects for spatial data mining. Knowl. Data Eng. IEEE Trans. 14(5), 1003–1016 (2002)
Article Google Scholar
Pakhira, M.K.: A modified k-means algorithm to avoid empty clusters. Int. J. Recent Trends Eng. 1(1), 1 (2009)
Google Scholar
Partridge, D., Krzanowski, W.: Software diversity: practical statistics for its measurement and exploitation. Inf. Softw. Technol. 39(10), 707–717 (1997)
Article Google Scholar
Polikar, R.: Ensemble based systems in decision making. Circuits Syst. Mag. IEEE 6(3), 21–45 (2006)
Article Google Scholar
Qiang, F., Shang-Xu, H., Sheng-Ying, Z.: Clustering-based selective neural network ensemble. J. Zhejiang Univ. Sci. A 6(5), 387–392 (2005)
Google Scholar
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1–2), 1–39 (2010)
Article Google Scholar
San, O.M., Huynh, V.-N., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. Math. Comput. Sci. 14(2), 241–248 (2004)
MathSciNet MATH Google Scholar
Sharpton, T., Jospin, G., Wu, D., Langille, M., Pollard, K., Eisen, J.: Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource. BMC Bioinform. 13(1), 264 (2012)
Article Google Scholar
Shemetulskis, N.E., Dunbar Jr, J.B., Dunbar, B.W., Moreland, D.W., Humblet, C.: Enhancing the diversity of a corporate database using chemical database clustering and analysis. J. Comput.-Aided Mol. Des. 9(5), 407–416 (1995)
Article Google Scholar
Skalak, D.B.: The sources of increased accuracy for two proposed boosting algorithms. In: Proceedings of American Association for Artificial Intelligence, AAAI-96, Integrating Multiple Learned Models Workshop, vol. 1129, p. 1133. Citeseer (1996)
Google Scholar
Smyth, P., Wolpert, D.: Linearly combining density estimators via stacking. Mach. Learn. 36(1–2), 59–83 (1999)
Article Google Scholar
Soto, V., Garcia-Moratilla, S., Martinez-Munoz, G., Hernández-Lobato, D., Suarez, A.: A double pruning scheme for boosting ensembles. Cybern. IEEE Trans. 44(12), 2682–2695 (2014). Dec
Article Google Scholar
Tang, EKe, Suganthan, P.N., Yao, X.: An analysis of diversity measures. Mach. Learn. 65(1), 247–271 (2006)
Article Google Scholar
Tsoumakas, G., Partalas, I., Vlahavas, I.: An ensemble pruning primer. In: Applications of supervised and unsupervised ensemble methods, pp. 1–13. Springer (2009)
Google Scholar
Williams, G.: Use R: Data Mining with Rattle and R: the Art of Excavating Data for Knowledge Discovery. Springer, New York (2011)
MATH Google Scholar
Wolpert, D.H.: Stacked generalization. Neural Netw. 5(2), 241–259 (1992)
Article Google Scholar
Yan, W., Goebel, K.F.: Designing classifier ensembles with constrained performance requirements. In: Defense and Security, International Society for Optics and Photonics, pp. 59–68 (2004)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. In: ACM SIGMOD Record, vol. 25, pp. 103–114. ACM (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing Science and Digital Medial, Robert Gordon University, Garthdee Road, Aberdeen, AB10 7GJ, UK
Khaled Fawagreh, Mohamed Medhat Gaber & Eyad Elyan

Authors

Khaled Fawagreh
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Medhat Gaber
View author publications
You can also search for this author in PubMed Google Scholar
Eyad Elyan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Medhat Gaber .

Editor information

Editors and Affiliations

School of Computing, University of Portsmouth, Portsmouth, United Kingdom
Max Bramer
School of Computing, Engineering and Mathematics, University of Brighton, Brighton, United Kingdom
Miltos Petridis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fawagreh, K., Gaber, M.M., Elyan, E. (2015). CLUB-DRF: A Clustering Approach to Extreme Pruning of Random Forests. In: Bramer, M., Petridis, M. (eds) Research and Development in Intelligent Systems XXXII. SGAI 2015. Springer, Cham. https://doi.org/10.1007/978-3-319-25032-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-25032-8_4
Published: 12 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25030-4
Online ISBN: 978-3-319-25032-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics