Skip to main content

Advertisement

Log in

A novel approach for discretizing continuous attributes based on tree ensemble and moment matching optimization

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

This paper introduces ForestDisc, an optimized, supervised, multivariate, and nonparametric discretization algorithm based on tree ensemble learning and moment matching optimization. At its core, ForestDisc uses, for each continuous attribute in the data space, moment matching to elect popular split points based on those generated while constructing a random forest model. An extensive empirical study involving 50 benchmark datasets and six classification algorithms reveals that ForestDisc is highly competitive compared with 20 major discretizers based on both intrinsic and extrinsic performance measures. The intrinsic metrics include the number of resulting bins per variable and the execution time necessary for discretizing an attribute. The extrinsic metrics concern the performance of the discretizers when applied as a preprocessing step to classification tasks, and include accuracy, F1, and Kappa measures. ForestDisc discretizer also enables an excellent trade-off between intrinsic and extrinsic performance measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Availability of data and materials

All data used are publicly available in UCI Machine Learning repository and Keel datasets repository.

Code Availability Statement

The implementation and the computational work are done using the R language and environment for statistical computing. The code, the data files, and the resulting files of the benchmark reported in the article are available via https://github.com/HMAISSAE/ForestDisc_Bench.git.

References

  1. Frank, E., Witten, I.H.: Making better use of global discretization, 115–123 (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, Conference held at Bled, Slovenia, to 1999-06-30)

  2. Liu, H., Hussain, F., Tan, C.L., Dash, M.: Discretization: an enabling technique. Data Min. Knowl. Discov. 6, 393–423 (2002)

    Article  MathSciNet  Google Scholar 

  3. Lustgarten, J.L., Gopalakrishnan, V., Grover, H., Visweswaran, S.: Improving classification performance with discretization on biomedical datasets. AMIA Annu. Symp. Proc. 2008, 445–449 (2008)

    Google Scholar 

  4. Yang, Y., Webb, G.I., Wu, X.: Discretization methods. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 101–116. Springer, Boston (2010)

    Google Scholar 

  5. Vorobeva, A.A.: Influence of features discretization on accuracy of random forest classifier for web user identification. IEEE, St-Petersburg, Russia, 498–504 (2017)

  6. Berrado, A., Runger, G.C.: Using metarules to organize and group discovered association rules. Data Min. Knowl. Discov. 14(3), 409–431 (2007). https://doi.org/10.1007/s10618-006-0062-6

    Article  MathSciNet  Google Scholar 

  7. Azmi, M., Runger, G.C., Berrado, A.: Interpretable regularized class association rules algorithm for classification in a categorical data space. Inf. Sci. 483, 313–331 (2019). https://doi.org/10.1016/j.ins.2019.01.047

    Article  MATH  Google Scholar 

  8. Deng, H.: Interpreting tree ensembles with inTrees. Int. J. Data Sci. Anal. 7(4), 277–287 (2019). https://doi.org/10.1007/s41060-018-0144-8

    Article  Google Scholar 

  9. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and Unsupervised Discretization of Continuous Features, pp. 194–202. Elsevier, Amsterdam (1995)

    Google Scholar 

  10. Ramırez-Gallego, S., Garcıa, S., Martınez-Rego, D., Benıtez, J. M., Herrera, F.: Data Discretization: Taxonomy and Big Data Challenge 26

  11. Agre, G.: On supervised and unsupervised discretization. Cybern. Inf. Technol. (2002)

  12. Ching, J., Wong, A., Chan, K.: Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Trans. Pattern Anal. Mach. Intell. 17(7), 641–651 (1995). https://doi.org/10.1109/34.391407

    Article  Google Scholar 

  13. Wang, C., Wang, M., She, Z., Cao, L., Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J.: CD: a coupled discretization algorithm. In: Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J. (eds.) Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, pp. 407–418. Springer, Berlin (2012)

    Chapter  Google Scholar 

  14. Wong, A.K.C., Chiu, D.K.Y.: Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Trans. Pattern Anal. Mach. Intell. PAMI9(6), 796–805 (1987). https://doi.org/10.1109/TPAMI.1987.4767986

    Article  Google Scholar 

  15. Ali, R., Siddiqi, M.H., Lee, S.: Rough set-based approaches for discretization: a compact review. Artif. Intell. Rev. 44(2), 235–263 (2015). https://doi.org/10.1007/s10462-014-9426-2

    Article  Google Scholar 

  16. Mehta, S., Parthasarathy, S., Yang, H.: Toward unsupervised correlation preserving discretization. IEEE Trans. Knowl. Data Eng. 17(9), 1174–1185 (2005). https://doi.org/10.1109/TKDE.2005.153

    Article  Google Scholar 

  17. Muhlenbach, F., Rakotomalala, R.: Discretization of Continuous Attributes Idea group reference edn hal-00383757v2, 397–402 (2005)

  18. Garcia, S., Luengo, J., Sáez, J.A., López, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013). https://doi.org/10.1109/TKDE.2012.35

    Article  Google Scholar 

  19. Berrado, A., Runger, G.C.: Supervised multivariate discretization in mixed data with Random Forests. IEEE, Rabat, Morocco, pp. 211–217 (2009)

  20. Haddouchi, M., Berrado, A.: An implementation of a multivariate discretization for supervised learning using Forestdisc 1–6 (2020). https://doi.org/10.1145/3419604.3419772

  21. Haddouchi, M.: ForestDisc: Forest Discretization. R package version 0.1.0. https://CRAN.R-project.org/package=ForestDisc (2020)

  22. Sriwanna, K., Puntumapon, K., Waiyamai, K., Zhou, S., Zhang, S., Karypis, G.: An enhanced class-attribute interdependence maximization discretization algorithm. In: Zhou, S., Zhang, S., Karypis, G. (eds.) Advanced Data Mining and Applications. Lecture Notes in Computer Science, pp. 465–476. Springer, Berlin (2012)

    Chapter  Google Scholar 

  23. Kurtcephe, M., Güvenir, H.A.: A discretization method based on maximizing the area under receiver operating characteristic curve. Int. J. Pattern Recognit. Artif. Intell. 27(01), 1350002 (2013). https://doi.org/10.1142/S021800141350002X

    Article  MathSciNet  Google Scholar 

  24. Baka, A., Wettayaprasit, W., Vanichayobon, S.: A novel discretization technique using Class Attribute Interval Average, pp. 95–100 (2014)

  25. Yan, D., Liu, D., Sang, Y.: A new approach for discretizing continuous attributes in learning systems. Neurocomputing 133, 507–511 (2014). https://doi.org/10.1016/j.neucom.2013.12.005

    Article  Google Scholar 

  26. Sang, Y., et al.: An effective discretization method for disposing high-dimensional data. Inf. Sci. 270, 73–91 (2014). https://doi.org/10.1016/j.ins.2014.02.113

    Article  MathSciNet  MATH  Google Scholar 

  27. Huang, W., Pan, Y., Wu, J.: Supervised discretization for optimal prediction. Procedia Comput. Sci. 30, 75–80 (2014). https://doi.org/10.1016/j.procs.2014.05.383

    Article  Google Scholar 

  28. CanoAlberto, T.N., VenturaSebastián, JC.: Ur-CAIM. Soft Computing - A Fusion of Foundations, Methodologies and Applications (2016)

  29. Ramírez-Gallego, S., García, S., Benítez, J. M., Herrera, F., Burduk, R., Jackowski, K., Kurzyński, M., Woźniak, M., Żołnierek, A.: A Wrapper evolutionary approach for supervised multivariate discretization: a case study on decision trees. In: Burduk, R., Jackowski, K., Kurzyński, M., Woźniak, M., Żołnierek, A. (eds) Proceedings of the 9th International Conference on Computer Recognition Systems CORES 2015, Advances in Intelligent Systems and Computing. Springer, Cham, pp. 47–58 (2016)

  30. Sriwanna, K., Boongoen, T., Iam-On, N. Lavangnananda, K., Phon-Amnuaisuk, S., Engchuan, W., Chan, J.H.: An enhanced univariate discretization based on cluster ensembles. In:Lavangnananda, K., Phon-Amnuaisuk, S., Engchuan, W., Chan, J. H. (eds) Proceedings in Adaptation, Learning and Optimization, Intelligent and Evolutionary Systems. Springer, Cham, pp. 85–98 (2016)

  31. Khanmohammadi, S., Chou, C.-A.: A Gaussian mixture model based discretization algorithm for associative classification of medical data. Expert Syst. Appl. 58, 119–129 (2016). https://doi.org/10.1016/j.eswa.2016.03.046

    Article  Google Scholar 

  32. Geaur Rahman, M., Zahidul Islam, M.: Discretization of continuous attributes through low frequency numerical values and attribute interdependency. Expert Syst. Appl. 45, 410–423 (2016). https://doi.org/10.1016/j.eswa.2015.10.005

    Article  Google Scholar 

  33. Qiu, Q., Huang, W.: Forward supervised discretization for multivariate with categorical responses. Big Data Inf. Anal. 1(2/3), 217–225 (2016). https://doi.org/10.3934/bdia.2016005

    Article  Google Scholar 

  34. Wen, L.-Y., Min, F., Wang, S.-Y.: A two-stage discretization algorithm based on information entropy. Appl. Intell. 47(4), 1169–1185 (2017). https://doi.org/10.1007/s10489-017-0941-0

    Article  Google Scholar 

  35. Sriwanna, K., Boongoen, T., Iam-On, N.: Graph clustering-based discretization of splitting and merging methods (GraphS and GraphM). Human-Centric Comput. Inf. Sci. 7(1), 21 (2017). https://doi.org/10.1186/s13673-017-0103-8

    Article  Google Scholar 

  36. Tahan, M.H., Asadi, S.: MEMOD: a novel multivariate evolutionary multi-objective discretization. Soft Comput. 22(1), 301–323 (2018). https://doi.org/10.1007/s00500-016-2475-5

    Article  Google Scholar 

  37. Hacibeyoglu, M., Ibrahim, M.H.: EFunique: an improved version of unsupervised equal frequency discretization method. Arabian J. Sci. Eng. 43(12), 7695–7704 (2018). https://doi.org/10.1007/s13369-018-3144-z

    Article  Google Scholar 

  38. Ehrhardt, A., Vandewalle, V., Biernacki, C., Heinrich, P.: Supervised multivariate discretization and levels merging for logistic regression. Iasi, Romania (2018)

    Google Scholar 

  39. Drias, H., Moulai, H., Rehkab, N.: LR-SDiscr: an efficient algorithm for supervised discretization. In: Nguyen, N.T., Hoang, D.H., Hong, T.-P., Pham, H., Trawiński, B. (eds.) Intelligent Information and Database Systems, vol. 10751, pp. 266–275. Springer, Cham (2018)

    Chapter  Google Scholar 

  40. Abachi, H.M., Hosseini, S., Maskouni, M.A., Kangavari, M., Cheung, N.-M., Wang, J., Cong, G., Chen, J., Qi, J.: Statistical discretization of continuous attributes using Kolmogorov-Smirnov test. In: Wang, J., Cong, G., Chen, J., Qi, J. (eds.) Databases Theory and Applications. Lecture Notes in Computer Science, pp. 309–315. Springer, Cham (2018)

    Google Scholar 

  41. Flores, J.L., Calvo, B., Perez, A.: Supervised non-parametric discretization based on Kernel density estimation. Pattern Recognit. Lett. 128, 496–504 (2019). https://doi.org/10.1016/j.patrec.2019.10.016

    Article  Google Scholar 

  42. Mutlu, A., Göz, F., Akbulut, O.: lFIT: an unsupervised discretization method based on the Ramer–Douglas–Peucker algorithm. Turkish J. Electr. Eng. Comput. Sci. 27(3), 2344–2360 (2019). https://doi.org/10.3906/elk-1806-192

    Article  Google Scholar 

  43. Mitra, G., Sundereisan, S., Sarkar, B.K.: A simple data discretizer. arXiv:1710.05091 19

  44. Tahan, M.H., Ghasemzadeh, M.: An evolutionary multi-objective discretization based on normalized cut. J. AI Data Min. 8(1), 14 (2020). https://doi.org/10.22044/JADM.2019.8507.1989

    Article  Google Scholar 

  45. Liu, H., Jiang, C., Wang, M., Wei, K., Yan, S.: An Improved Data Discretization Algorithm based on Rough Sets Theory, pp. 1432–1437 (2020)

  46. Xun, Y., Yin, Q., Zhang, J., Yang, H., Cui, X.: A novel discretization algorithm based on multi-scale and information entropy. Appl. Intell. 51(2), 991–1009 (2021). https://doi.org/10.1007/s10489-020-01850-w

    Article  Google Scholar 

  47. Alexandre, L., Costa, R.S., Henriques, R.: DI2: Prior-free and multi-item discretization of biological data and its applications. BMC Bioinf. 22(1), 426 (2021). https://doi.org/10.1186/s12859-021-04329-8

    Article  Google Scholar 

  48. Jun, S.: Evolutionary algorithm for improving decision tree with global discretization in manufacturing. Sensors 21(8), 2849 (2021). https://doi.org/10.3390/s21082849

    Article  Google Scholar 

  49. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd edn. Springer, Berlin (2009)

    Book  Google Scholar 

  50. Haddouchi, M., Berrado, A.: A survey of methods and tools used for interpreting Random Forest, pp. 1–6 (2019). https://doi.org/10.1109/ICSSD47982.2019.9002770

  51. Høyland, K., Wallace, S.W.: Generating scenario trees for multistage decision problems. Manage. Sci. 47(2), 295–307 (2001). https://doi.org/10.1287/mnsc.47.2.295.9834

    Article  MATH  Google Scholar 

  52. Haddouchi, M., Berrado, A.: Discretizing continuous attributes for machine learning using nonlinear programming. Int. J. Comput. Sci. Appl. 18(1), 26–44 (2021)

    Google Scholar 

  53. Rouaud, M: Probability, Statistics and Estimation. Propagation of Uncertainties in Experimental Measurement. Short edition edn. Creative Commons (2017)

  54. Bazaraa, M.S., Sherali, H.D., Shetty, C.M.: Nonlinear Programming: Theory and Algorithms, 3rd ed edn. Wiley-Interscience, Hoboken, N.J, (2006). OCLC: ocm61478842

  55. Johnson, S.G.: The NLopt nonlinear optimization package. http://github.com/stevengj/nlopt

  56. Dubitzky, W., Granzow, M., Berrar, D.P.: Fundamentals of Data Mining in Genomics and Proteomics. Springer, Berlin (2007)

    Book  Google Scholar 

  57. Kaelo, P., Ali, M.M.: Some variants of the controlled random search algorithm for global optimization. J. Optim. Theory Appl. 130(2), 253–264 (2006). https://doi.org/10.1007/s10957-006-9101-0

    Article  MathSciNet  MATH  Google Scholar 

  58. Price, W.L.: Global optimization by controlled random search. J. Optim. Theory Appl. 40(3), 333–348 (1983). https://doi.org/10.1007/BF00933504

    Article  MathSciNet  MATH  Google Scholar 

  59. Runarsson, T., Yao, X.: Stochastic ranking for constrained evolutionary optimization. IEEE Trans. Evolut. Comput. 4(3), 284–294 (2000). https://doi.org/10.1109/4235.873238

    Article  Google Scholar 

  60. Runarsson, T., Yao, X.: Search biases in constrained evolutionary optimization. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 35(2), 233–243 (2005). https://doi.org/10.1109/TSMCC.2004.841906

    Article  Google Scholar 

  61. Jones, D.R., Perttunen, C.D., Stuckman, B.E.: Lipschitzian optimization without the Lipschitz constant. J. Optim. Theory Appl. 79(1), 157–181 (1993). https://doi.org/10.1007/BF00941892

    Article  MathSciNet  MATH  Google Scholar 

  62. Madsen, K., Zertchaninov, S.: Global Optimization using Branch-and-Bound 17 (1998)

  63. Zertchaninov, S., Madsen, K., Zilinskas, A.: A C++ Programme for Global Optimization. IMM Publications 14 (1998)

  64. Powell, M.: A direct search optimization method that models the objective and constraint functions by linear interpolation. In: Gomez, S., Hennart, J.-P. (eds.) Advances in Optimization and Numerical Analysis, pp. 51–67. Springer, Dordrecht (1994)

    Chapter  Google Scholar 

  65. Powell, M.: Direct search algorithms for optimization calculations. Acta Numerica 7, 287–336 (1998). https://doi.org/10.1017/S0962492900002841

    Article  MathSciNet  MATH  Google Scholar 

  66. Powell, M.: The BOBYQA algorithm for bound constrained optimization without derivatives. Tech. Rep., Department of Applied Mathematics and Theoretical Physics, Cambridge England, technical report NA2009/06 (2009)

  67. Nelder, J.A., Mead, R.: A simplex method for function minimization. Comput. J. 7, 308–313 (1965). https://doi.org/10.1093/comjnl/7.4.308

    Article  MathSciNet  MATH  Google Scholar 

  68. Box, M.J.: A new method of constrained optimization and a comparison with other methods. Comput. J. 8(1), 42–52 (1965). https://doi.org/10.1093/comjnl/8.1.42

    Article  MathSciNet  MATH  Google Scholar 

  69. Richardson, J.A., Kuester, J.L.: The complex method for constrained optimization. Commun. ACM 16, 487–489 (1973). https://doi.org/10.1145/355609.362324

    Article  Google Scholar 

  70. Rowan, T.H.: Functional Stability Analysis of Numerical Algorithms. Ph.D. thesis, Ph.D. thesis, Department of Computer Sciences, University of Texas at Austin (1990)

  71. Svanberg, K.: A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM J. Optim. 12, 555–573 (2002)

    Article  MathSciNet  Google Scholar 

  72. Kraft, D.: A Software Package for Sequential Quadratic Programming Deutsche Forschungs- Und Versuchsanstalt Für Luft- Und Raumfahrt Köln: Forschungsbericht. DFVLR, Wiss. Berichtswesen d (1988)

  73. Kraft, D., Munchen, I.: Algorithm 733: TOMP - Fortran modules for optimal control calculations. ACM Trans. Math. Soft 262–281 (1994)

  74. Nocedal, J.: Updating quasi-newton matrices with limited storage. Math. Comput. 35(773–782), 10 (1980)

    MathSciNet  MATH  Google Scholar 

  75. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45, 503–528 (1989)

    Article  MathSciNet  Google Scholar 

  76. Dembo, R.S., Steihaug, T.: Truncated-newtono algorithms for large-scale unconstrained optimization. Math. Program. 26(2), 190–212 (1983). https://doi.org/10.1007/BF02592055

    Article  MATH  Google Scholar 

  77. Vlcek, J., Luksan, L.: Shifted limited-memory variable metric methods for large-scale unconstrained optimization. J. Comput. Appl. Math. 186, 365–390 (2006)

    Article  MathSciNet  Google Scholar 

  78. Conn, A.R., Gould, N.I.M., Philippe, Toint, L.: A globally convergent augmented lagrangian algorithm for optimization with general constraints and simple bounds. SIAM J. Numer. Anal. 572 (1991)

  79. Birgin, E.G., Martínez, J.M.: Improving ultimate convergence of an Augmented Lagrangian method. Optim. Methods Softw. 23(2), 177–195 (2008)

    Article  MathSciNet  Google Scholar 

  80. Louppe, G.: Understanding random forests: from theory to practice. arXiv:1407.7502 [stat] (2015)

  81. Chen, J., et al.: A parallel random forest algorithm for big data in a spark cloud computing environment. IEEE Trans. Parallel Distrib. Syst. 28(4), 919–933. https://doi.org/10.1109/TPDS.2016.2603511, arXiv:1810.07748

  82. Singer, S., Singer, S.: Complexity Analysis of Nelder-Mead Search Iterations, vol. 12. Dubrovnik, Croatia (1999)

    MATH  Google Scholar 

  83. Singer, S., Singer, S.: Efficient implementation of the Nelder-Mead search algorithm. Appl. Numer. Anal. Comput. Math. 1(2), 524–534 (2004). https://doi.org/10.1002/anac.200410015

    Article  MathSciNet  MATH  Google Scholar 

  84. Galántai, A.: Convergence of the Nelder-Mead method. Numer. Algorithms (2021). https://doi.org/10.1007/s11075-021-01221-7

    Article  MATH  Google Scholar 

  85. R Core Team: R: A Language and Environment for Statistical Computing (Vienna, Austria, 2019)

  86. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. Artif. Intell. 13, 1022–1027 (1993)

    Google Scholar 

  87. Liu, H., Setiono, R.: Chi2: Feature Selection and Discretization of Numeric Attributes, 388–391 (1995)

  88. Riza, L.S., et al.: Implementing algorithms of rough set theory and fuzzy rough set theory in the R package “RoughSets’’. Inf. Sci. 287(Complete), 68–89 (2014). https://doi.org/10.1016/j.ins.2014.07.029

    Article  Google Scholar 

  89. von Jouanne-Diedrich, H.: Vonjd/OneR (2017)

  90. Kerber, R.: ChiMerge: Discretization of numeric attributes, AAAI’92, 123–128. AAAI Press, San Jose, California (1992)

  91. Liu, H., Setiono, R.: Feature selection via discretization. IEEE Trans. Knowl. Data Eng. 9(4), 642–645 (1997). https://doi.org/10.1109/69.617056

    Article  Google Scholar 

  92. Kurgan, L., Cios, K.: CAIM discretization algorithm. IEEE Trans. Knowl. Data Eng. 16(2), 145–153 (2004). https://doi.org/10.1109/TKDE.2004.1269594

    Article  Google Scholar 

  93. Tsai, C.-J., Lee, C.-I., Yang, W.-P.: A discretization algorithm based on class-attribute contingency coefficient. Inf. Sci. 178(3), 714–731 (2008). https://doi.org/10.1016/j.ins.2007.09.004

    Article  Google Scholar 

  94. Gonzalez-Abril, L., Cuberos, F., Velasco, F., Ortega, J.: Ameva: an autonomous discretization algorithm. Expert Syst. Appl. 36(3), 5327–5332 (2009). https://doi.org/10.1016/j.eswa.2008.06.063

    Article  Google Scholar 

  95. Chao-Ton, Su., Hsu, Jyh-Hwa.: An extended Chi2 algorithm for discretization of real value attributes. IEEE Trans. Knowl. Data Eng. 17(3), 437–441 (2005). https://doi.org/10.1109/TKDE.2005.39

    Article  Google Scholar 

  96. Tay, F., Shen, L.: A modified Chi2 algorithm for discretization. IEEE Trans. Knowl. Data Eng. 14(3), 666–670 (2002). https://doi.org/10.1109/TKDE.2002.1000349

    Article  Google Scholar 

  97. Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 32 (1993)

  98. Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S Fourth, edition Springer Publishing Company, Incorporated, Berlin (2002)

    Book  Google Scholar 

  99. Casas, P.: Discretization based on gain ratio maximization. https://blog.datascienceheroes.com/discretization-recursive-gain-ratio-maximization/ (2019)

  100. Nguyen, H.S.: On efficient handling of continuous attributes in large data bases. Fundam. Inform. 48, 61–81 (2001)

    MathSciNet  MATH  Google Scholar 

  101. Bazan, J.G., Nguyen, H.S., Nguyen, S.H., Synak, P., Wróblewski, J.: Rough set algorithms in classification problem. In: Kacprzyk, J., Polkowski, L., Tsumoto, S., Lin, T.Y. (eds.) Rough Set Methods and Applications, vol. 56, pp. 49–88. Physica-Verlag, Heidelberg (2000)

    Chapter  Google Scholar 

  102. Celeux, G., Chauveau, D., Diebolt, J.: On Stochastic Versions of the EM Algorithm. Research Report RR-2514, INRIA (1995)

  103. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and regression trees the wadsworth statistics/probability series edn. Monterey, CA : Wadsworth & Brooks/Cole Advanced Books & Software, 1984. - 358 p. (1884)

  104. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  105. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2000)

    MathSciNet  MATH  Google Scholar 

  106. Chen, T., Guestrin, C.: XGBoost: A Scalable Tree Boosting System, pp. 785–794. ACM Press, San Francisco (2016)

    Google Scholar 

  107. Samworth, R.J.: Optimal weighted nearest neighbour classifiers. Ann. Stat. 40(5), 2733–2763 (2012). https://doi.org/10.1214/12-AOS1049.. arXiv:1101.5783

    Article  MathSciNet  MATH  Google Scholar 

  108. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2), 131–163 (1997). https://doi.org/10.1023/A:1007465528199

  109. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018

    Article  MATH  Google Scholar 

  110. Prati, R.C., Monard, M.C.: A survey on graphical methods for classification predictive performance evaluation. IEEE Trans. Knowl. Data Eng. 1601–1618

  111. He, Haibo, Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009). https://doi.org/10.1109/TKDE.2008.239

    Article  Google Scholar 

  112. Cohen, J.: A coefficient of agreement for nominal scales. Edu. Psychol. Meas. 20(1), 37–46 (1960). https://doi.org/10.1177/001316446002000104

    Article  Google Scholar 

  113. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159 (1977). https://doi.org/10.2307/2529310

    Article  MATH  Google Scholar 

  114. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1(6), 80 (1945). https://doi.org/10.2307/3001968

    Article  Google Scholar 

  115. Garcıa, S., Herrera, F.: An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons 18

  116. Dua, D., Graff, C.: UCI machine learning repository (2017)

  117. Alcalá-Fdez, J., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17(2–3), 255–287 (2011)

    Google Scholar 

  118. Marron, J.S., Todd, M.J., Ahn, J.: Distance-weighted discrimination. J. Am. Stat. Assoc. 102(480), 1267–1271 (2007). https://doi.org/10.1198/016214507000001120

  119. Batuwita, R., Palade, V.: Class imbalance learning methods for support vector machines. In: He, H., Ma, Y. (eds.) Imbalanced Learning, pp. 83–99. Wiley, Hoboken (2013)

    Chapter  Google Scholar 

Download references

Funding

Not applicable

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Haddouchi Maissae. The first draft of the manuscript was written by Haddouchi Maissae and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Haddouchi Maissae.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Maissae, H., Abdelaziz, B. A novel approach for discretizing continuous attributes based on tree ensemble and moment matching optimization. Int J Data Sci Anal 14, 45–63 (2022). https://doi.org/10.1007/s41060-022-00316-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-022-00316-1

Keywords