A hybrid decision tree algorithm for mixed numeric and categorical data in regression analysis
Introduction
One of the primary tasks in data mining is regression analysis, the objective of which is to identify the relationships between a dependent variable and one or more independent variables and it has been used widely for prediction and forecasting in many real-world problems. In many cases, datasets consist of both numerical and categorical independent variables. Many regression algorithms such as linear regression, support vector regression, and neural networks are well-defined and validated in the support of the computation of numeric variables, because it is easy to model the relations between a target and its predictors when both data types are numeric. In contrast, categorical variables describe non-numeric, qualitative attributes of data and it is not possible to perform numeric operations on categorical variables.
One possible conversion of categorical variable data to numeric data is the use of various coding systems such as dummy coding, effects coding, and contrast coding to manage the presence of categorical predictors [8], [16]. Through these coding systems, qualitative variables representing categories or group memberships are converted to quantitative variables. Another approach is to define similarity or dissimilarity measures between categorical and numeric variables. Distance-based regression algorithms have utilized this approach by defining appropriate distance metrics [5], [10], [11]. When dissimilar or similar functions are defined on both categorical and numeric variables, then an original input data matrix can be transformed into a distance configuration matrix. Any distance-based regression algorithm can be applied to the converted matrix. In addition, distance metrics are used to extract neighbor samples for k-nearest neighbor regression and non-parametric kernel regression.
These two approaches, however, have limitations. Coding systems can cause a significant increase in the number of predictors when the number of categorical variables or the number of categories in the categorical variable is large. For this reason, coding systems include additional steps in order to reduce the number of variables [23], [36]. Additionally, characterizing suitable similarity measures between categorical variables or between categorical and numeric variables constitutes a complex challenge.
To address the complexity, we propose a hybrid regression model constructed using a decision tree algorithm. In the proposed model, categorical variables are used to train a decision tree, and then the portions that are not explained by the decision tree are predicted from numeric variables using one of other regression algorithms. We compared the performance of the proposed model with that of the conventional coding approach in five different regression algorithms.
The proposed model has properties in common with M5, a hybrid decision tree model which combines decision tree and regression algorithms [27]. M5 is an improved regression tree algorithm, which splits the parameter space into subspaces and builds local linear regression models [27]. There are two differences between the proposed model and M5. First, the proposed model utilizes only categorical variables to split the input space. Second, a single regression model is trained based on numeric variables after tree growth while the M5 algorithm trains linear regression models at every terminal node. Basically, such differences lie in their different origins. The proposed model is devised to effectively manage the impact of categorical variables in a regression model whereas the M5 model was developed to improve the performance of regression trees. Despite such differences, we compared the performance of the proposed model with that of the M5 model tree in terms of accuracy and computational time. This is reasonable since the two algorithms resemble each other in that the objective of both algorithms is to predict continuous target variables by combining a regression model with a decision tree.
The remainder of this paper is organized as follows. In Section 2, previous studies of the inclusion of categorical variables in regression analysis and decision tree algorithms are reviewed. Section 3 provides a detailed description of the proposed hybrid regression algorithm. The experimental results are presented in Section 4. Finally, the conclusion of the paper is summarized in Section 5.
Section snippets
Handling categorical variables
A categorical variable is a variable that can be assigned one value of a finite set of possible values. Each value is assigned to a particular group or category. Categorical variables are of two types: a nominal variable and an ordinal variable. While an ordinal variable can be assigned a value that can be logically ordered or ranked, a nominal variable can be assigned a value that is not able to be organized in a logical sequence. In both cases, the values are qualitative and non-numeric. It
The proposed algorithm
The primary purpose of the proposed algorithm is to provide a new hybrid algorithm that functions and performs better for mixed data. The algorithm combines the individual strengths of a decision tree and other regression algorithms and mitigates the disadvantages of the two methods. The primary concept was inspired by the observation that the behavior of categorical variables of linear regression models that include ridge and lasso with dummy coding are always discrete and discontinuous [32],
Data description and preparation
The purpose of the proposed method is to effectively handle mixed data in real world regression problems. Therefore, we selected 10 different mixed datasets from Kaggle (http://www.kaggle.com), an online platform for predictive modeling and analytics competition on which companies and researchers post their data and data miners compete to build the best models, one mixed data from UCI repository [21], and one mixed data from Korea Energy Statistical Information System. Preprocessing procedures
Conclusion
The primary objective of the proposed hybrid regression algorithm is to effectively process both categorical and numeric variables for inclusion in regression analysis. The primary concept of the proposed algorithm is to utilize a decision tree to estimate the effect of the categorical variables on the continuous target variable since a decision tree supports categorical variables as well as numeric variables and is inherently non-parametric. In the proposed algorithm, the decision tree
Acknowledgment
This study was supported by the Research Program funded by the SeoulTech(Seoul National University of Science and Technology).
References (37)
- et al.
Evolutionary model trees for handling continuous classes in machine learning
Inf. Sci.
(2011) - et al.
Classification tree analysis using TARGET
Comput. Stat. Data Anal.
(2008) - et al.
Combining non-parametric models with logistic regression: an application to motor vehicle injury data
Comput. Stat. Data Anal.
(2000) - et al.
Evolving model trees for mining data sets with continuous-valued classes
Expert Syst. Appl.
(2008) - et al.
Categorical data clustering: what similarity measure to recommend?
Expert Syst. Appl.
(2015) - et al.
Combining decision tree and Naive Bayes for classification
Knowl. Based Syst.
(2006) Model Comparisons and R2
Am. Stat.
(1994)- et al.
A survey of evolutionary algorithms for decision-Tree induction
IEEE Trans. Syst., Man, Cybern., Part C (Appl. Rev.)
(2012) - et al.
Automatic Design of Decision-Tree Induction Algorithms
(2015) - et al.
Selection of predictors in distance-Based regression
Commun. Stat. Simul.Comput.
(2007)