Comparison of the decision tree, artificial neural network, and linear regression methods based on the number and types of independent variables and sample size

https://doi.org/10.1016/j.eswa.2006.12.017Get rights and content

Abstract

In this article, the performance of data mining and statistical techniques was empirically compared while varying the number of independent variables, the types of independent variables, the number of classes of the independent variables, and the sample size. Our study employed 60 simulated examples, with artificial neural networks and decision trees as the data mining techniques, and linear regression as the statistical method. In the performance study, we use the RMSE value as the metric and come up with some additional findings: (i) for continuous independent variables, a statistical technique (i.e., linear regression) was superior to data mining (i.e., decision tree and artificial neural network) regardless of the number of variables and the sample size; (ii) for continuous and categorical independent variables, linear regression was best when the number of categorical variables was one, while the artificial neural network was superior when the number of categorical variables was two or more; (iii) the artificial neural network performance improved faster than that of the other methods as the number of classes of categorical variable increased.

Introduction

The difficulties posed by prediction problems have resulted in a variety of problem-solving techniques. For example, data mining methods comprise artificial neural networks and decision trees, and statistical techniques include linear regression and stepwise polynomial regression. It is difficult, however, to compare the efficacy of the techniques and determine the best one because their performance is data-dependent.

A few studies have compared data mining and statistical approaches to solving prediction problems. Gorr, Nagin, and Szczypula (1994) compared linear regression, stepwise polynomial regression, and neural networks in the context of predicting student GPAs. Although they found that linear regression performed best overall, none of the methods performed significantly better than the ordering index used by the investigator. Shuhui, Wunsch, Hair, and Giesselmann (2001) reported that neural networks performed better than linear regression for wind farm data, while Hardgrave, Wilson, and Walstrom (1994) experimentally showed that neural networks did not significantly outperform statistical techniques in predicting the academic success of students entering the MBA program. Subbanarasimha, Arinze, and Anadarajan (2000) demonstrated that linear regression performed better than neural networks when the distribution of the dependent variable was skewed, and Kumar (2005) expanded on Subbanarasimha et al. (2000) result, developing a hybrid method that improved the prediction accuracy.

These comparison studies have mainly considered a specific data set or the distribution of the dependent variable. Other unexplored criteria, however, affect the performance of decision problem techniques, such as sample size and characteristics of the independent variables. We empirically compared the performance of data mining and statistical techniques while varying the number of independent variables, the types of independent variables, the number of classes of the independent variables, and the sample size. Our study employed 60 simulated examples, with artificial neural networks and decision trees as the data mining techniques, and linear regression as the statistical method.

In addition to these general comparison results, we used the RMSE value as the metric and determined the following: for continuous independent variables, a statistical technique (i.e., linear regression) was superior to data mining (i.e., decision tree and artificial neural network) regardless of the number of variables; for continuous and categorical independent variables, linear regression was best when the number of categorical variables was one, while the artificial neural network was superior when the number of categorical variables was two or more; and the artificial neural network performance improved faster than that of the other methods as the number of classes of categorical variable increased.

The article is organized as follows. Section 2 illustrates the generation of the data sets and analysis methods for the empirical study. The experimental results are described in Section 3, and the conclusions and future research directions are presented in Section 4.

Section snippets

Data generation

In this section, we describe the 60 simulated prediction problems that we generated to evaluate the performance of the decision tree, neural network, and linear regression techniques. First, Table 1 shows 12 simulated examples with continuous independent variables.

These 12 examples were obtained from the linear model, where xi was randomly selected in the range [0, 1], and ε was normally distributed with mean 0 and standard deviation 1. The number of independent variables was set to one, three,

Experimental evaluation

Computational results (i.e., RMSE values) are summarized in Table 4, Table 5, Table 6 for LR, ANN, and DT prediction methods (M), sample size (S), number of independent variables (V), number of categorical variables (CA), and number of classes of categorical variables (CL).

Table 4 shows the RMSE values for LR, ANN, and DT when the independent variables were continuous. The prediction methods of LR and ANN performed consistently better than DT. Furthermore, in almost all cases considered, LR was

Conclusions

In this article, we present the results of an experimental comparison study of data mining and statistical techniques based on varying the number of independent variables, the types of independent variables, the number of classes of the independent variables, and the sample size. To evaluate the performance of the different techniques, we generated various simulated problems and used the RMSE metric.

The main results include the following: when independent variables are continuous, LR is

References (9)

There are more references available in the full text version of this article.

Cited by (83)

  • Sexual homicide and the forensic process: The decision-making process of collecting and analyzing traces and its implication for crime solving

    2022, Forensic Science International
    Citation Excerpt :

    Finally, a multivariate approach was conducted with a limited sample size. While the use of artificial neural network models is quite appropriate with small sample sizes [66,42], other studies have suggested that the ‘factor 10’ rule-of-thumb that was adopted in this study could be insufficient and that a ‘factor 50’ rule of thumb was preferable [1]. Despite these analytical limitations, we believe that the results should be understood in terms of trends (i.e., positive or negative) rather than the exact values of the statistical weight of each factor.

View all citing articles on Scopus
View full text