Predicting corporate financial distress based on integration of decision tree classification and logistic regression

doi:10.1016/j.eswa.2011.02.173

Expert Systems with Applications

Volume 38, Issue 9, September 2011, Pages 11261-11272

https://doi.org/10.1016/j.eswa.2011.02.173 Get rights and content

Abstract

Lately, stock and derivative securities markets continuously and rapidly evolve in the world. As quick market developments, enterprise operating status will be disclosed periodically on financial statement. Unfortunately, if executives of firms intentionally dress financial statements up, it will not be observed any financial distress possibility in the short or long run. Recently, there were occurred many financial crises in the international marketing, such as Enron, Kmart, Global Crossing, WorldCom and Lehman Brothers events. How these financial events affect world’s business, especially for the financial service industry or investors has been public’s concern. To improve the accuracy of the financial distress prediction model, this paper referred to the operating rules of the Taiwan Stock Exchange Corporation (TSEC) and collected 100 listed companies as the initial samples. Moreover, the empirical experiment with a total of 37 ratios which composed of financial and other non-financial ratios and used principle component analysis (PCA) to extract suitable variables. The decision tree (DT) classification methods (C5.0, CART, and CHAID) and logistic regression (LR) techniques were used to implement the financial distress prediction model. Finally, the experiments acquired a satisfying result, which testifies for the possibility and validity of our proposed methods for the financial distress prediction of listed companies.

This paper makes four critical contributions: (1) the more PCA we used, the less accuracy we obtained by the DT classification approach. However, the LR approach has no significant impact with PCA; (2) the closer we get to the actual occurrence of financial distress, the higher the accuracy we obtain in DT classification approach, with an 97.01% correct percentage for 2 seasons prior to the occurrence of financial distress; (3) our empirical results show that PCA increases the error of classifying companies that are in a financial crisis as normal companies; and (4) the DT classification approach obtains better prediction accuracy than the LR approach in short run (less one year). On the contrary, the LR approach gets better prediction accuracy in long run (above one and half year). Therefore, this paper proposes that the artificial intelligent (AI) approach could be a more suitable methodology than traditional statistics for predicting the potential financial distress of a company in short run.

Highlights

► Our empirical results show that PCA increases the error of classifying companies that are in a financial crisis as normal companies. ► The Decision Tree classification approach obtains better prediction accuracy in short run. ► The Logistic Regression approach gets better prediction accuracy in long run.

Introduction

Recently, one of the most attractive business news is a series of financial crisis events related to the public companies. Some of these companies are famous and also at high stock prices, originally (e.g. Enron Corp., Kmart Corp., WorldCom Corp., Lehman Brothers Bank, etc.). In consequence of the financial crisis, it is always too late for many creditors to withdraw their loans, as well as for investors to sell their own stocks, futures, or options. Therefore, corporate bankruptcy is a very important economic phenomenon and also affects the economy of every country. In Taiwan, domestic and foreign capital markets have developed rapidly in recent years, gradually giving people the idea of making a financial investment. Nevertheless, Procomp Corp. and Cdbank Corp. bankruptcy events have also caused tremendous disorder in the financial market and related industries are also affected by these economic shocks in Taiwan. The number of bankruptcy firms is important for the economy of a country and it can be viewed as an indictor of the development and robustness of the economy (Zopounidis & Dimitras, 1998). The high individual, economic, and social costs encountered in corporate failures or bankruptcies have spurred searches for better understanding and prediction capability (McKee & Lensberg, 2002). Therefore, forecasting corporate financial distress plays an increasingly important role in today’s society since it has a significant impact on lending decisions and the profitability of financial institutions.

A common methodology to bankruptcy prediction is to summarize the literature to search a large set of potential predictive financial and/or non-financial variables and then reduce a set of not significant variables, through traditional mathematical analysis that will predict bankruptcy (Lensberg, Eilifsen, & McKee, 2006). Many traditional classification techniques have been presented to predict financial distress using ratios, e.g., univariate approaches (Beaver, 1966), multivariate approaches, linear multiple discriminant approaches (MDA) (Altman, 1968, Altman et al., 1977), multiple regression (Meyer & Pifer, 1970), logistic regression (Dimitras, Zanakis, & Zopounidis, 1996), factor analysis (Blum, 1974), and stepwise (Laitinen & Laitinen, 2000). However strict assumptions of traditional statistics such as linearity, normality, independence among predictor variables and pre-existing functional form relating to the criterion variable and the predictor variable limit their application in the real world (Hua, Wang, Xu, Zhang, & Liang, 2007).

Therefore, this paper proposes a model of financial distress prediction comparing decision tree (DT) classification and logistic regression (LR) techniques. The main objectives of this paper are to (1) adopt DT and LR techniques to construct a financial distress prediction model, (2) use financial and non-financial ratios to enhance the accuracy of the financial distress prediction model, (3) employ a traditional statistical method (principle component analysis, PCA) to compare the degree of accuracy with that of the artificial intelligent (AI) approach, and (4) to expand this model so that it will work within a financial distress prediction system to provide information to investors as well as investment monitoring organizations. The data for our experiment were collected from the Taiwan Stock Exchange Corporation (TSEC) database.

The rest of this paper is organized as follows. A literature review of related techniques is provided in Section 2. We describe our proposed approach and its capabilities of each step in Section 3. Section 4 presents the process for choosing appropriate variables by PCA. In Section 5, we analyzed the prediction performance of our approach and fulfilled several experiments. Moreover, we compared our results with the DT, and LR approaches in Section 6. Finally, we inference our conclusions and discuss future research in Section 7.

Section snippets

Decision trees algorithm

Data mining (DM), also known as “knowledge discovery in databases” (KDD), is the process of discovering meaningful patterns in huge databases (Han & Kamber, 2001). In addition, it is also an application that can provide significant competitive advantages for making the right decision (Huang, Chen, & Lee, 2007). The more common model functions in the current data mining process include the classification, regression, clustering, association rules, summarization, dependency modeling and sequence

Research methodology

In this research, we compare DT and LR techniques for financial distress prediction (FDP) performance. The research methodology is as shown in Fig. 1. In the FDP Choosing phase, we handle the original huge datasets from the TSEC which will be processed by data pre-processing. Data pre-processing includes cleaning, normalization, transformation, feature extraction and selection. The product of data pre-processing is the final training and testing set. The goal in this phase is to choose the

Data

Our samples contained raw data from 100 Taiwan firms listed in the TSEC. The period of sampling was from 2000 January to May, 2007, amounting to 7 years and 5 months. The 50 firms in financial distress were matched with 50 non-bankruptcy firms. These firms were distinguished as non-bankruptcy based on the absence of any indication or proof concerning the issuing of financial distress in the auditors’ reports. All the variables used in the sample were extracted from formal financial statements,

DT experiments and results

This process uses the finance and non-finance ratios, and constructs a financial distress prediction model after carrying out a second time factor analysis. The variables are then loaded as DT and LR input nodes. In addition, we also apply these experiment parameters to investigate the past 2 seasons, the past 4 seasons, the past 6 seasons, and the past 8 seasons before the financial distress occurred, for the sake of prediction accuracy. In this experiment, we will use the C5.0, CART, CHAID as

The FDP comparing phase

After the implementation for the FDP modeling phase, we will compare the DT and LR approaches with the accuracy rate, Type II error rate, and factor analysis. The detail descriptions will be discussed as following sections.

Conclusions

This research aimed at the financial and the non-financial ratios in the financial statement, and used the DT and the LR models to compare the performance of the financial distress predictions, in order to find a better early-warning method. This research took 50 companies that were facing a financial crisis, and matched them with 50 normal companies of the similar industry. In addition, we adopted the necessary dataset from the TSEC database and sampled them into the past 2, 4, 6, 8 seasons

Acknowledgements

We thank the support of National Scientific Council (NSC) of the Republic of China (ROC) to this work under Grant No. NSC 96-2416-H-018-011. We also gratefully acknowledge the Editor and anonymous reviewers for their valuable comments and constructive suggestions.

References (36)

E.L. Altman et al.
A new model to identify bankruptcy risk of corporations
Journal of Banking and Finance
(1977)
G.S. Atsalakis et al.
Forecasting stock market short-term trends using a neuro-fuzzy based methodology
Expert Systems with Applications
(2009)
H.A. Camdeviren et al.
Comparison of logistic regression model and classification tree: An application to postpartum depression data
Expert Systems with Applications
(2007)
C.L. Chang et al.
Applying decision tree and neural network to increase quality of dermatologic diagnosis
Expert Systems with Applications
(2009)
A.I. Dimitras et al.
A survey of business failure with an emphasis on prediction methods and industrial applications
European Journal of Operational Research
(1996)
Z. Hua et al.
Predicting corporate financial distress based on integration of support vector machine and logistic regression
Expert Systems with Applications
(2007)
M.J. Huang et al.
Integrating data mining with case-based reasoning for chronic diseases prognosis and diagnosis
Expert Systems with Applications
(2007)
W. Kinney et al.
Characteristics of firms correcting previously reported quarterly earnings
Journal of Accounting and Economics
(1989)
E. Kirkos et al.
Data mining techniques for the detection of fraudulent financial statements
Expert Systems with Applications
(2007)
E.K. Laitinen et al.
Bankruptcy prediction application of the Taylor’s expansion in logistic regression
International Review of Financial Analysis
(2000)

T. Lensberg et al.

Bankruptcy theory development and classification via genetic programming

European Journal of Operational Research

(2006)

H. Li et al.

Majority voting combination of multiple case-based reasoning for financial distress prediction

Expert Systems with Applications

(2009)

T.E. McKee et al.

Genetic programming and rough sets: A hybrid approach to bankruptcy classification

European Journal of Operational Research

(2002)

P. Xidonas et al.

On the selection of equity securities: An expert systems methodology and an application on the Athens Stock Exchange

Expert Systems with Applications

(2009)

E.L. Altman

Financial ratios, discriminant analysis and the prediction of corporate bankruptcy

The Journal of Finance

(1968)

W. Beaver

Financial ratios as predictors of failure, empirical research in accounting: Selected studied

Journal of Accounting Research

(1966)

M. Blum

Failing company discriminant analysis

Journal of Accounting Research

(1974)

L. Breiman et al.

Classification and regression trees

(1984)

Cited by (158)

Predicting financial distress using machine learning approaches: Evidence China
2024, Journal of Contemporary Accounting and Economics
This study uses machine learning techniques to construct financial distress prediction (FDP) models for Chinese A-listed construction companies and compares their classification performance with conventional Z-Score models. Three machine learning algorithms (Classification and Regression Tree, AdaBoost, and CUSBoost) are used to generate machine-learning-based classifiers, and four Z-Score models (Altman Z-Score, Sorins/Voronova Z-Score, Springate, and Z-Score of Ng et al.) are selected for comparison. The sample comprises 1782 firm-year observations from Chinese A-listed construction companies on the Shenzhen and Shanghai Stock Exchanges from 2012 to 2021. The out-of-sample predicting performance of the classifiers are measured using the areas under the receiver operating characteristic curve (AUC) and under the precision-recall curve (AUPR). In additional tests, Pearson's correlation coefficients and the variance inflation factor are utilized to identify correlations among the raw financial predictors, while principal component analysis is used to address high-correlation issues among the features. Results confirm that machine learning classifiers can effectively predict financial distress for Chinese A-listed construction companies and are more accurate than Z-Score models. Furthermore, the CUSBoost classifier is identified as the most precise model based on the AUC and AUPR metrics in both primary and additional tests. This study addresses the gap concerning the application of machine learning in FDP for Chinese-listed construction companies. Additionally, the CUSBoost Algorithm is introduced into the field of FDP research for the first time. Through the comparison of machine learning and Z-Score models, this study also contributes to the literature related to the contrast between machine learning and statistical modeling techniques.
Simultaneous optimal prediction of various influent indexes based on a model fusion algorithm in wastewater treatment plant
2023, Biochemical Engineering Journal
Accurate prediction of influent indicators is crucial for timely warning and optimal control of wastewater treatment plants (WWTPs). However, simultaneous optimal prediction of multiple indicators with one single intelligent model remains a challenge. In this study, a more general modeling approach was proposed and demonstrated to achieve simultaneous optimal prediction of multiple indicators. Based on the seven algorithms, a model fusion algorithm was developed and optimized. The results showed that the proposed model fusion algorithm achieved superior prediction results for all indexes, with R² values of 0.987 (inflow), 0.915 (pH), 0.943 (NH₃-N), 0.898 (TP), 0.861 (COD) and 0.897 (TN), outperforming any base learner. Furthermore, a fast model fusion algorithm was conducted by screening and fusing high-quality models to improve efficiency. The developed fast model could improve running efficiency by up to 50% without significantly compromising prediction performance. To verify the applicability of the fast model, it was applied to another WWTP, and excellent prediction results for influent indicators were observed. These results provide new insights for timely and precise warning of influent variations and improve the practicality of machine learning models in WWTPs.
Contextual combinatorial bandit on portfolio management
2023, Expert Systems with Applications
Portfolio optimization is a classic problem in finance, and assumptions are generally accepted about the stochastic dynamics of prices and market information. The Linear Upper Confidence Bound (LinUCB) algorithm based on bandit learning is a typical reinforcement learning approach that can be a data-driven approach to the portfolio problem. Most investors may have inconsistent preferences for different assets; therefore, the reward function needs to be designed to balance exploration (allocating part of the investment to searching among other alternative portfolios) and exploitation (concentrating the investment on the historically best portfolio). The study consists of developing a two-stage investing strategy that uses the supervised adaptive decision tree approach to build a pool of candidate portfolios and the reinforcement LinUCB algorithm to derive a portfolio update rule with respect to a specific utility function. The two-stage strategy has been tested on CSI300 constitute stocks listed on the Shanghai Stock Exchange. The accumulated returns, as well as other performance measures (e.g., maximum drawdown, Sharpe ratio, and information ratio) at different testing periods achieved by the model, can well outperform the benchmark index.
Multiple granularity user intention fairness recognition of intelligent government Q & A system via three-way decision
2023, Information Sciences
With respect to an intelligent government questions & answers (Q & A) system, user intention recognition for government affairs is a key issue. Accurate intention recognition can effectively reduce manual participation and improve the user satisfaction. However, the intention recognition model in the field of government affairs not only requires the recognition accuracy, but also needs to meet the fairness demands of users. In order to improve the fairness of the model, this paper makes the model focus on the unrecognizable intention samples in each intention type as much as possible. Hence, we firstly design a two-stage intention recognition method based on the idea of three-way decision (TWD). In the first stage, we use the Bert model as the intention recognition model and divide the samples with insufficient classification confidence into the boundary region. In the second stage, we combine the divide-and-rule idea of TWD with focal loss to suppress the easily recognized samples in the non-boundary region, so as to reduce the contribution of these samples to the loss of the classifier. Meanwhile, we can enhance the contribution of samples with insufficient classification confidence in the boundary region to the loss of the classifier, and then optimize the recognition ability of the classifier. Then, by utilizing sequential three-way decision (STWD), we recognize the user's intention types at multiple granularity. According to the recognition results of coarser granularity, we can optimize the recognition ability of the classifier for the intention that is difficult-to-recognize by adjusting the loss function. On the premise of ensuring that it has little impact on the intention recognition ability of other types of users, we improve the recognition ability of the intention that is difficult-to-recognize. Based on the above-mentioned methods, we further propose a multi-stage training method that can make the model focus on the unrecognizable intention text and the unrecognizable intention type. Finally, the effectiveness of the proposed method is verified through some series of experimental analysis.
Mining semantic features in patent text for financial distress prediction
2023, Technological Forecasting and Social Change
Financial distress prediction has been a popular topic over the decades. Most studies have used accounting features from financial statements to predict financial distress. Compared to listed companies, unlisted public companies have longer financial disclosure cycles, less required disclosure of market trading information, and higher financial risk. However, they can also have a strong ability to innovate and great growth potential, attributes that cannot be fully reflected in financial statements. In this study, as a supplement to accounting features, we propose a framework for mining the statistical features and semantic features in patent text by comprehensively analyzing the patent's structured information, abstract, claims, citations, and specifications. The results of empirical evaluation confirm that patent features contain incremental information related to financial distress. This research broadens the feature space of financial distress research and expands the research on patent text. It also provides decision support for banks approving loans, investment decision-making, and patent pledges.
Class-imbalanced financial distress prediction with machine learning: Incorporating financial, management, textual, and social responsibility features into index system
2024, Journal of Forecasting

View all citing articles on Scopus

View full text

Predicting corporate financial distress based on integration of decision tree classification and logistic regression

Abstract

Highlights

Introduction

Section snippets

Decision trees algorithm

Research methodology

Data

DT experiments and results

The FDP comparing phase

Conclusions

Acknowledgements

Journal of Banking and Finance

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

European Journal of Operational Research

Expert Systems with Applications

Expert Systems with Applications

Journal of Accounting and Economics

Expert Systems with Applications

International Review of Financial Analysis

European Journal of Operational Research

Expert Systems with Applications

European Journal of Operational Research

Expert Systems with Applications

Financial ratios, discriminant analysis and the prediction of corporate bankruptcy

The Journal of Finance

Financial ratios as predictors of failure, empirical research in accounting: Selected studied

Journal of Accounting Research

Failing company discriminant analysis

Journal of Accounting Research

Classification and regression trees