Abstract
This paper presents and analyzes ForestDisc, a discretization method based on tree ensemble and moment matching optimization. ForestDisc is a supervised and multivariate discretizer that transforms continuous attributes into categorical ones following two steps. At first, ForestDisc extracts for each continuous attribute the ensemble of split points learned while constructing a Random Forest model. It then constructs a reduced set of split points based on moment matching optimization. Previous works showed that ForestDisc enables an excellent performance compared to 22 popular discretizers. This work analyzes ForestDisc performance sensitivity to its tunning parameters and provides some guidelines for users when using the ForestDisc package.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agre, G.: On supervised and unsupervised discretization. Cybern. Inf. Technol. (2002)
Bazaraa, M.S., Sherali, H.D., Shetty, C.M.: Nonlinear Programming: Theory and Algorithms, 3rd edn. Wiley-Interscience, Hoboken (2006). oCLC: ocm61478842
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth & Brooks/Cole Advanced Books & Software, Monterey (1984). 358 p., the wadsworth statistics/probability series edn. (1884)
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2016, pp. 785–794. ACM Press, San Francisco (2016). https://doi.org/10.1145/2939672.2939785
Ching, J., Wong, A., Chan, K.: Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Trans. Pattern Anal. Mach. Intell. 17(7), 641–651 (1995). https://doi.org/10.1109/34.391407
Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Machine Learning Proceedings 1995, pp. 194–202. Elsevier (1995). https://doi.org/10.1016/B978-1-55860-377-6.50032-3
Dua, D., Graff, C.: UCI machine learning repository (2017)
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2000)
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2), 131–163 (1997). https://doi.org/10.1023/A:1007465528199
Garcia, S., Luengo, J., Sáez, J.A., López, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013). https://doi.org/10.1109/TKDE.2012.35
Haddouchi, M.: ForestDisc: forest discretization. R package version 0.1.0 (2020). https://CRAN.R-project.org/package=ForestDisc
Haddouchi, M., Berrado, A.: An implementation of a multivariate discretization for supervised learning using Forestdisc, pp. 1–6 (2020). https://doi.org/10.1145/3419604.3419772
Haddouchi, M., Berrado, A.: Discretizing continuous attributes for machine learning using nonlinear programming. Int. J. Comput. Sci. Appl. 18(1), 26–44, 20 (2021)
Alcalá-Fdez, J., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Log. Soft Comput. 17(2–3), 255–287 (2011)
Jones, D.R., Perttunen, C.D., Stuckman, B.E.: Lipschitzian optimization without the Lipschitz constant. J. Optim. Theory Appl. 79(1), 157–181 (1993). https://doi.org/10.1007/BF00941892
Kraft, D.: A Software Package for Sequential Quadratic Programming. Deutsche Forschungs- Und Versuchsanstalt Für Luft- Und Raumfahrt Köln: Forschungsbericht, Wiss. Berichtswesen d. DFVLR (1988)
Kraft, D., Munchen, I.: Algorithm 733: TOMP - Fortran modules for optimal control calculations. ACM Trans. Math. Soft 20, 262–281 (1994)
Liu, H., Hussain, F., Tan, C.L., Dash, M.: Discretization: an enabling technique. Data Min. Knowl. Disc. 6, 393–423 (2002)
Maissae, H., Abdelaziz, B.: A novel approach for discretizing continuous attributes based on tree ensemble and moment matching optimization. Int. J. Data Sci. Anal. (2022). https://doi.org/10.1007/s41060-022-00316-1
Haddouchi, M., errado, A.: A survey of methods and tools used for interpreting random forest, pp. 1–6 (2019). https://doi.org/10.1109/ICSSD47982.2019.9002770
Mehta, S., Parthasarathy, S., Yang, H.: Toward unsupervised correlation preserving discretization. IEEE Trans. Knowl. Data Eng. 17(9), 1174–1185 (2005). https://doi.org/10.1109/TKDE.2005.153
Nelder, J.A., Mead, R.: A simplex method for function minimization. Comput. J. 7, 308–313 (1965). https://doi.org/10.1093/comjnl/7.4.308
Ramırez-Gallego, S., Garcıa, S., Martınez-Rego, D., Benıtez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge, p. 26 (2016)
Samworth, R.J.: Optimal weighted nearest neighbour classifiers. Ann. Stat. 40(5), 2733–2763 (2012). https://doi.org/10.1214/12-AOS1049
Wang, C., Wang, M., She, Z., Cao, L.: CD: a coupled discretization algorithm. In: Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J. (eds.) PAKDD 2012. LNCS (LNAI), vol. 7302, pp. 407–418. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30220-6_34
Wilcoxon, F.: Individual comparisons by ranking methods. Biometr. Bull. 1(6), 80 (1945). https://doi.org/10.2307/3001968
Yang, Y., Webb, G.I., Wu, X.: Discretization methods. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 101–116. Springer, Boston (2010). https://doi.org/10.1007/978-0-387-09823-4_6
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Haddouchi, M., Berrado, A. (2022). Tuning ForestDisc Hyperparameters: A Sensitivity Analysis. In: Dorronsoro, B., Pavone, M., Nakib, A., Talbi, EG. (eds) Optimization and Learning. OLA 2022. Communications in Computer and Information Science, vol 1684. Springer, Cham. https://doi.org/10.1007/978-3-031-22039-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-22039-5_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22038-8
Online ISBN: 978-3-031-22039-5
eBook Packages: Computer ScienceComputer Science (R0)