Skip to main content
Log in

Improving financial bankruptcy prediction in a highly imbalanced class distribution using oversampling and ensemble learning: a case from the Spanish market

  • Regular Paper
  • Published:
Progress in Artificial Intelligence Aims and scope Submit manuscript

Abstract

Bankruptcy is one of the most critical financial problems that reflects the company’s failure. From a machine learning perspective, the problem of bankruptcy prediction is considered a challenging one mainly because of the highly imbalanced distribution of the classes in the datasets. Therefore, developing an efficient prediction model that is able to detect the risky situation of a company is a challenging and complex task. To tackle this problem, in this paper, we propose a hybrid approach that combines the synthetic minority oversampling technique with ensemble methods. Moreover, we apply five different feature selection methods to find out what are the most dominant attributes on bankruptcy prediction. The proposed approach is evaluated based on a real dataset collected from Spanish companies. The conducted experiments show promising results, which prove that the proposed approach can be used as an efficient alternative in case of highly imbalanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Bought from http://infotel.es.

References

  1. Adnan Aziz, M., Dar, H.A.: Predicting corporate bankruptcy: where we stand? Corp. Gov. Int. J. Bus. Soc. 6(1), 18–33 (2006)

    Google Scholar 

  2. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991)

    Google Scholar 

  3. Alejo, R., García, V., Marqués, A., Sánchez, J., Antonio-Velázquez, J.: Making accurate credit risk predictions with cost-sensitive MLP neural networks. In: Management Intelligent Systems. Springer, Berlin, pp. 1–8 (2013)

    Google Scholar 

  4. Alfaro-Cid, E., Castillo, P., Esparcia, A., Sharman, K., Merelo, J., Prieto, A., Mora, A.M., Laredo, J. L.J.: Comparing multiobjective evolutionary ensembles for minimizing type I and II errors for bankruptcy prediction. In: Evolutionary Computation, 2008. CEC 2008. (IEEE World Congress on Computational Intelligence), pp. 2902–2908 (2008)

  5. Alhaj, T.A., Siraj, M.M., Zainal, A., Elshoush, H.T., Elhaj, F.: Feature selection using information gain for improved structural-based alert correlation. PloS one 11(11), e0166017 (2016)

    Google Scholar 

  6. Altman, E.I.: Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J. Finance 23(4), 589–609 (1968)

    Google Scholar 

  7. Amjadian, S., Pardegi, K., et al.: New approach to bankruptcy prediction using genetic algorithm. Int. J. Comput. Appl. 44(4), 34–38 (2012)

    Google Scholar 

  8. Aoki, S., Hosonuma, Y.: Bankruptcy prediction using decision tree. In: The Application of Econophysics. Springer, Berlin, pp. 299–302 (2004)

    Google Scholar 

  9. Barboza, F., Kimura, H., Altman, E.: Machine learning models and bankruptcy prediction. Expert Syst. Appl. 83, 405–417 (2017)

    Google Scholar 

  10. Beaver, W.H.: Financial ratios as predictors of failure. J. Account. Res. 4, 71–111 (1966)

    Google Scholar 

  11. Brabazon, A., Keenan, P.B.: A hybrid genetic model for the prediction of corporate failure. Comput. Manag. Sci. 1(3–4), 293–310 (2004)

    MATH  Google Scholar 

  12. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)

    MATH  Google Scholar 

  13. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    MATH  Google Scholar 

  14. Castillo, P.A., Mora, A.M., Faris, H., Merelo, J., García-Sánchez, P., Fernández-Ares, A.J., De las Cuevas, P., García-Arenas, M.I.: Applying computational intelligence methods for predicting the sales of newly published books in a real editorial business management environment. Knowl. Based Syst. 115, 133–151 (2017)

    Google Scholar 

  15. Chandrashekar, G., Sahin, F.: A survey on feature selection methods. Comput. Electr. Eng. 40(1), 16–28 (2014)

    Google Scholar 

  16. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    MATH  Google Scholar 

  17. Chawla, N.V., Japkowicz, N., Kotcz, A.: Special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)

    Google Scholar 

  18. Chen, M.-Y.: Bankruptcy prediction in firms with statistical and intelligent techniques and a comparison of evolutionary computation approaches. Comput. Math. Appl. 62(12), 4514–4524 (2011)

    MathSciNet  MATH  Google Scholar 

  19. Chen, N., Chen, A., Ribeiro, B.: Influence of class distribution on cost-sensitive learning: a case study of bankruptcy analysis. Intell. Data Anal. 17(3), 423–437 (2013)

    Google Scholar 

  20. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn, p. 776. Wiley, Hoboken, New Jersey (2006)

    MATH  Google Scholar 

  21. Dietterich, T.G.: Ensemble methods in machine learning. In: International Workshop on Multiple Classifier Systems. Springer, Berlin, pp. 1–15 (2000)

    Google Scholar 

  22. Drummond, C., Holte, R.C., et al.: C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II. Vol. 11. Citeseer (2003)

  23. Fatourechi, M., Ward, R.K., Mason, S.G., Huggins, J., Schlögl, A., Birch, G.E.: Comparison of evaluation metrics in classification applications with imbalanced datasets. In: Seventh International Conference on Machine Learning and Applications, 2008. ICMLA’08, pp. 777–782 (2008)

  24. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. AI Mag. 17(3), 37 (1996)

    Google Scholar 

  25. Freund, Y., Mason, L.: The alternating decision tree learning algorithm. In: ICML, vol. 99. pp. 124–133 (1999)

  26. Freund, Y., Schapire, R., Abe, N.: A short introduction to boosting. J. Jpn. Soc. Artif. Intell. 14(771–780), 1612 (1999)

    Google Scholar 

  27. Galathiya, A., Ganatra, A., Bhensdadia, C.: Classification with an improved decision tree algorithm. Int. J. Comput. Appl. 46(23), 1–6 (2012)

    Google Scholar 

  28. García, V., Marqués, A.I., Sánchez, J.S.: Exploring the synergetic effects of sample types on the performance of ensembles for credit risk and corporate bankruptcy prediction. Inf. Fusion 47, 88–101 (2019)

    Google Scholar 

  29. Gopika, D., Azhagusundari, B.: A novel approach on ensemble classifiers with fast rotation forest algorithm. Int. J. Innov. Res. Comput. Commun. Eng. 2, 5380–5387 (2014)

    Google Scholar 

  30. Hall, M.A.: Correlation-based feature selection for machine learning. Ph.D. thesis, The University of Waikato (1999)

  31. Hall, M.A.: Correlation-based feature subset selection for machine learning. Ph. D. dissertation, Univ. Waikato, Waikato, New Zealand (1999)

  32. Hall, M.A.: Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 359–366. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2000)

  33. Hall, M.A., Smith, L.A.: Feature subset selection: a correlation based filter approach. In: Proceedings of international conference on neural information processing and intelligent information systems, pp 855–858 (1997)

  34. Han, H., Wang, W., Mao, B.: Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: 2005 International Conference on Intelligent Computing (ICIC05). Lecture Notes on Computer Science, vol. 3644. Springer, New York, pp. 878–887 (2005)

    Google Scholar 

  35. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Advances in Intelligent Computing, pp. 878–887 (2005)

    Google Scholar 

  36. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)

    MATH  Google Scholar 

  37. He, H., Bai, Y., Garcia, E., Li, S.: Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 International Joint Conference on Neural Networks (IJCNN08). pp. 1322–1328 (2008)

  38. Hecht-Nielsen, R., et al.: Theory of the backpropagation neural network. Neural Netw. 1(Supplement–1), 445–448 (1988)

    Google Scholar 

  39. Hosaka, T.: Bankruptcy prediction using imaged financial ratios and convolutional neural networks. Expert Syst. Appl. 117, 287–299 (2019)

    Google Scholar 

  40. Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)

    MATH  Google Scholar 

  41. Jawazneh, H., Mora, A., Castillo, P.: Predicting the financial status of companies using data balancing and classification methods. In: International Work-Conference on Time Series (ITISE 2017). Godel Impresiones Digitales S.L, Granada, Spain, pp. 661–673 (September 2017)

  42. Jayanthi, S., Sasikala, S.: Reptree classifier for identifying link spam in web search engines. IJSC 3(2), 498–505 (2013)

    Google Scholar 

  43. Jeni, L.A., Cohn, J.F., De La Torre, F.: Facing imbalanced data-recommendations for the use of performance metrics. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), pp. 245–251 (2013)

  44. Jiang, S.-Y., Wang, L.-X.: Efficient feature selection based on correlation measure between continuous and discrete features. Inf. Process. Lett. 116(2), 203–215 (2016)

    MathSciNet  MATH  Google Scholar 

  45. Kalmegh, S.: Analysis of weka data mining algorithm reptree, simple cart and randomtree for classification of indian news. Int. J. Innov. Sci. Eng. Technol. 2(2), 438–46 (2015)

    Google Scholar 

  46. Kim, H.-J., Jo, N.-O., Shin, K.-S.: Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction. Expert Syst. Appl. 59, 226–234 (2016)

    Google Scholar 

  47. Kim, M.-J., Kang, D.-K., Kim, H.B.: Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Syst. Appl. 42(3), 1074–1082 (2015)

    Google Scholar 

  48. Kononenko, I.: Estimating attributes: analysis and extensions of relief. In: European Conference on Machine Learning. Springer, Berlin, pp. 171–182 (1994)

    Google Scholar 

  49. Kotsiantis, S., Kanellopoulos, D., Pintelas, P., et al.: Handling imbalanced datasets: a review. GESTS Int. Trans. Comput. Sci. Eng. 30(1), 25–36 (2006)

    Google Scholar 

  50. Kuncheva, L.I., Rodríguez, J.J.: An experimental study on rotation forest ensembles. In: International Workshop on Multiple Classifier Systems. Springer, Berlin, pp. 459–468 (2007)

  51. Lakshmi Devasena, C.: Comparative analysis of random forest, REP Tree and J48 classifiers for credit risk prediction. In: IJCA Proceedings on International Conference on Communication, Computing and Information Technology ICCCMIT 2014 (3), pp. 30–36 (2015, March)

  52. Le, T., Lee, M., Park, J., Baik, S.: Oversampling techniques for bankruptcy prediction: novel features from a transaction dataset. Symmetry 10(4), 79 (2018)

    Google Scholar 

  53. Le, T., Vo, B., Fujita, H., Nguyen, N.-T., Baik, S.W.: A fast and accurate approach for bankruptcy forecasting using squared logistics loss with GPU-based extreme gradient boosting. Inf. Sci. 494, 294–310 (2019)

    Google Scholar 

  54. Lee, K., Caverlee, J., Webb, S.: Uncovering social spammers: social honeypots+ machine learning. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, pp. 435–442 (2010)

  55. Liaw, A., Wiener, M.: Classification and regression by random forest. R News 2(3), 18–22 (2002)

    Google Scholar 

  56. Lin, W.-C., Lu, Y.-H., Tsai, C.-F.: Feature selection in single and ensemble learning-based bankruptcy prediction models. Expert Syst. 36(1), e12335 (2019)

    Google Scholar 

  57. Ling, C.X., Sheng, V.S.: Cost-sensitive learning. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 231–235. Springer, New York (2010)

    Google Scholar 

  58. Liu, H., Motoda, H.: Feature extraction, construction and selection: a data mining perspective. Springer, Berlin (1998)

    MATH  Google Scholar 

  59. Mai, F., Tian, S., Lee, C., Ma, L.: Deep learning models for bankruptcy prediction using textual disclosures. Eur. J. Oper. Res. 274(2), 743–758 (2019)

    Google Scholar 

  60. Marqués, A., García, V., Sánchez, J.S.: Exploring the behaviour of base classifiers in credit scoring ensembles. Expert Syst. Appl. 39(11), 10244–10250 (2012)

    Google Scholar 

  61. Marqués, A.I., García, V., Sánchez, J.S.: On the suitability of resampling techniques for the class imbalance problem in credit scoring. J. Oper. Res. Soc. 64(7), 1060–1070 (2013)

    Google Scholar 

  62. McCallum, A., Nigam, K., et al.: A comparison of event models for naive Bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization. vol. 752. Citeseer, pp. 41–48 (1998)

  63. Melville, P.: Creating Diverse Ensemble Classifiers. University of Texas at Austin, Computer Science Department (2003)

  64. Min, J.H., Lee, Y.-C.: Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters. Expert Syst. Appl. 28(4), 603–614 (2005)

    Google Scholar 

  65. Mora, A.M., Herrera, L.J., Urquiza, J., Rojas, I., Merelo, J.: Applying support vector machines and mutual information to book losses prediction. In: The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–7 (2010)

  66. Novaković, J., Strbac, P., Bulatović, D.: Toward optimal feature selection using ranking methods and classification algorithms. Yugosl. J. Oper. Res. 21(1), 119–135 (2011)

    MathSciNet  MATH  Google Scholar 

  67. Ohlson, J.A.: Financial ratios and the probabilistic prediction of bankruptcy. J. Account. Res. 18, 109–131 (1980)

    Google Scholar 

  68. Opitz, D.W.: Feature selection for ensembles. In: AAAI/IAAI, pp. 379–384 (1999)

  69. Pal, S.K., Mitra, S.: Multilayer perceptron, fuzzy sets, and classification. IEEE Trans. Neural Netw. 3(5), 683–697 (1992)

    Google Scholar 

  70. Pandya, R., Pandya, J.: C5.0 algorithm to improved decision tree with feature selection and reduced error pruning. Int. J. Comput. Appl. 117(16), 18–21 (2015)

    Google Scholar 

  71. Park, H., Kwon, H.-C.: Extended relief algorithms in instance-based feature filtering. In: Sixth International Conference on Advanced Language Processing and Web Information Technology, 2007. ALPIT 2007. pp. 123–128 (2007)

  72. Patro, S., Sahu, K.K.: Normalization: a preprocessing stage (2015). arXiv preprint arXiv:1503.06462

  73. Rodan, A., Castillo, P., Faris, H., Al-Zoubi, A.M., Mora, A., Jawazneh, H.: Forecasting business failure in highly imbalanced distribution based on delay line reservoir. In: European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESSAN, 2018). i6doc Publishers, Bruges, Belgium, pp. 431–436 (2018)

  74. Rodriguez, J.J., Kuncheva, L.I., Alonso, C.J.: Rotation forest: a new classifier ensemble method. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1619–1630 (2006)

    Google Scholar 

  75. Rosario, S.F., Thangadurai, K.: RELIEF: feature selection approach. Int. J. Innov. Res. Dev. 4(11) ISSN 2278-0211 (2015)

  76. Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5(2), 197–227 (1990)

    Google Scholar 

  77. Sewell, M.: Ensemble learning. RN 11(02) (2008)

  78. Shin, K.-S., Lee, Y.-J.: A genetic algorithm application in bankruptcy prediction modeling. Expert Syst. Appl. 23(3), 321–328 (2002)

    Google Scholar 

  79. Singh, A., Purohit, A.: A survey on methods for solving data imbalance problem for classification. Int. J. Comput. Appl. 127(15), 37–41 (2015)

    Google Scholar 

  80. Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427–437 (2009)

    Google Scholar 

  81. Tian, S., Yu, Y., Zhou, M.: Data sample selection issues for bankruptcy prediction. Risk Hazards Crisis Public Policy 6(1), 91–116 (2015)

    Google Scholar 

  82. Tsai, C.-F.: Feature selection in bankruptcy prediction. Knowl. Based Syst. 22(2), 120–127 (2009)

    Google Scholar 

  83. Tsai, C.-F., Cheng, K.-C.: Simple instance selection for bankruptcy prediction. Knowl. Based Syst. 27, 333–342 (2012)

    Google Scholar 

  84. Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning. ACM, pp. 935–942 (2007)

  85. Veganzones, D., Séverin, E.: An investigation of bankruptcy prediction in imbalanced datasets. Decis. Support Syst. 112, 111–124 (2018)

    Google Scholar 

  86. Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. In: IEEE Symposium on Computational Intelligence and Data Mining, 2009. CIDM’09. pp. 324–331 (2009)

  87. Wilson, R.L., Sharda, R.: Bankruptcy prediction using neural networks. Decis. Support Syst. 11(5), 545–557 (1994)

    Google Scholar 

  88. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Los Altos (2016)

    Google Scholar 

  89. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML. vol. 97, pp. 412–420 (1997)

  90. Zebardast, M., Javid, D., Taherinia, M.: The use of artificial neural network in predicting bankruptcy and its comparison with genetic algorithm in firms accepted in Tehran Stock Exchange. J. Novel Appl. Sci. 3(2), 151–160 (2014)

    Google Scholar 

Download references

Acknowledgements

This work has been partially funded by projects TIN2017-85727-C4-2-P, RTI2018-102002-A-I00 (Spanish Ministry of Science, Innovation and Universities) and TEC2015-68752 (Spanish Ministry of Economy and Competitiveness \(+\) FEDER).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ibrahim Aljarah.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Note that the numbers in brackets in the following tables indicate the standard deviations. Also note that the names of the ensemble approaches in the tables are reported as follows ‘Ensemble technique/base learner (best number of iterations)’ (Tables 5, 6, 7 and Figs. 8, 9, 10).

Table 5 Results of bankruptcy prediction without re-sampling
Table 6 Results of bankruptcy prediction with re-sampling
Table 7 AB-Rep tree with re-sampling based on top selected attributes (feature selection)
Fig. 8
figure 8

AUC results of Experiment I (without performing oversampling or feature selection)

Fig. 9
figure 9

AUC results of Experiment II (with oversampling bout without feature selection)

Fig. 10
figure 10

AUC results of Experiment III (with oversampling and feature selection)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Faris, H., Abukhurma, R., Almanaseer, W. et al. Improving financial bankruptcy prediction in a highly imbalanced class distribution using oversampling and ensemble learning: a case from the Spanish market. Prog Artif Intell 9, 31–53 (2020). https://doi.org/10.1007/s13748-019-00197-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-019-00197-9

Keywords

Navigation