Elsevier

Pattern Recognition Letters

Volume 98, 15 October 2017, Pages 39-45
Pattern Recognition Letters

A hybrid decision tree algorithm for mixed numeric and categorical data in regression analysis

https://doi.org/10.1016/j.patrec.2017.08.011Get rights and content

Highlights

  • We propose a simple but efficient hybrid regression algorithm using decision tree.

  • The proposed algorithm significantly improves combined regression algorithms.

  • The proposed algorithm can consider the nonlinearity for categorical variables.

  • The proposed hybrid algorithm does not increase computation cost.

Abstract

In many real world problems, the collected data are not always numeric; rather, the data can include categorical variables. Inclusion of different types of variables may lead to complications in regression analysis. Many regression algorithms such as linear regression, support vector regression, and neural networks that train parameters of a model to identify relations between input and output variables, can easily process numeric variables; however, there are additional considerations for categorical variables. On the other hand, a decision tree algorithm estimates a target based on the specified rules; therefore, it can support categorical variables as well as numeric variables. Using this property, a new hybrid model combining a decision tree with another regression algorithm is proposed to analyze mixed data. In the proposed model, the portions explained by categorical variables in target values are estimated by the decision tree and the remaining parts are predicted by any regression algorithm trained by numerical variables. The proposed algorithm was evaluated using 12 datasets selected from real decision problems, and it was confirmed that the proposed algorithm achieved better or comparable accuracy than the comparison methods including the M5 decision tree and the evolutionary tree. In addition, the new hybrid method does not significantly increase computational complexity, even though it builds two separate models, which is an advantage that is in contrast with the M5 decision tree and the evolutionary tree.

Introduction

One of the primary tasks in data mining is regression analysis, the objective of which is to identify the relationships between a dependent variable and one or more independent variables and it has been used widely for prediction and forecasting in many real-world problems. In many cases, datasets consist of both numerical and categorical independent variables. Many regression algorithms such as linear regression, support vector regression, and neural networks are well-defined and validated in the support of the computation of numeric variables, because it is easy to model the relations between a target and its predictors when both data types are numeric. In contrast, categorical variables describe non-numeric, qualitative attributes of data and it is not possible to perform numeric operations on categorical variables.

One possible conversion of categorical variable data to numeric data is the use of various coding systems such as dummy coding, effects coding, and contrast coding to manage the presence of categorical predictors [8], [16]. Through these coding systems, qualitative variables representing categories or group memberships are converted to quantitative variables. Another approach is to define similarity or dissimilarity measures between categorical and numeric variables. Distance-based regression algorithms have utilized this approach by defining appropriate distance metrics [5], [10], [11]. When dissimilar or similar functions are defined on both categorical and numeric variables, then an original input data matrix can be transformed into a distance configuration matrix. Any distance-based regression algorithm can be applied to the converted matrix. In addition, distance metrics are used to extract neighbor samples for k-nearest neighbor regression and non-parametric kernel regression.

These two approaches, however, have limitations. Coding systems can cause a significant increase in the number of predictors when the number of categorical variables or the number of categories in the categorical variable is large. For this reason, coding systems include additional steps in order to reduce the number of variables [23], [36]. Additionally, characterizing suitable similarity measures between categorical variables or between categorical and numeric variables constitutes a complex challenge.

To address the complexity, we propose a hybrid regression model constructed using a decision tree algorithm. In the proposed model, categorical variables are used to train a decision tree, and then the portions that are not explained by the decision tree are predicted from numeric variables using one of other regression algorithms. We compared the performance of the proposed model with that of the conventional coding approach in five different regression algorithms.

The proposed model has properties in common with M5, a hybrid decision tree model which combines decision tree and regression algorithms [27]. M5 is an improved regression tree algorithm, which splits the parameter space into subspaces and builds local linear regression models [27]. There are two differences between the proposed model and M5. First, the proposed model utilizes only categorical variables to split the input space. Second, a single regression model is trained based on numeric variables after tree growth while the M5 algorithm trains linear regression models at every terminal node. Basically, such differences lie in their different origins. The proposed model is devised to effectively manage the impact of categorical variables in a regression model whereas the M5 model was developed to improve the performance of regression trees. Despite such differences, we compared the performance of the proposed model with that of the M5 model tree in terms of accuracy and computational time. This is reasonable since the two algorithms resemble each other in that the objective of both algorithms is to predict continuous target variables by combining a regression model with a decision tree.

The remainder of this paper is organized as follows. In Section 2, previous studies of the inclusion of categorical variables in regression analysis and decision tree algorithms are reviewed. Section 3 provides a detailed description of the proposed hybrid regression algorithm. The experimental results are presented in Section 4. Finally, the conclusion of the paper is summarized in Section 5.

Section snippets

Handling categorical variables

A categorical variable is a variable that can be assigned one value of a finite set of possible values. Each value is assigned to a particular group or category. Categorical variables are of two types: a nominal variable and an ordinal variable. While an ordinal variable can be assigned a value that can be logically ordered or ranked, a nominal variable can be assigned a value that is not able to be organized in a logical sequence. In both cases, the values are qualitative and non-numeric. It

The proposed algorithm

The primary purpose of the proposed algorithm is to provide a new hybrid algorithm that functions and performs better for mixed data. The algorithm combines the individual strengths of a decision tree and other regression algorithms and mitigates the disadvantages of the two methods. The primary concept was inspired by the observation that the behavior of categorical variables of linear regression models that include ridge and lasso with dummy coding are always discrete and discontinuous [32],

Data description and preparation

The purpose of the proposed method is to effectively handle mixed data in real world regression problems. Therefore, we selected 10 different mixed datasets from Kaggle (http://www.kaggle.com), an online platform for predictive modeling and analytics competition on which companies and researchers post their data and data miners compete to build the best models, one mixed data from UCI repository [21], and one mixed data from Korea Energy Statistical Information System. Preprocessing procedures

Conclusion

The primary objective of the proposed hybrid regression algorithm is to effectively process both categorical and numeric variables for inclusion in regression analysis. The primary concept of the proposed algorithm is to utilize a decision tree to estimate the effect of the categorical variables on the continuous target variable since a decision tree supports categorical variables as well as numeric variables and is inherently non-parametric. In the proposed algorithm, the decision tree

Acknowledgment

This study was supported by the Research Program funded by the SeoulTech(Seoul National University of Science and Technology).

References (37)

  • S. Boriah et al.

    Similarity measures for categorical data: acomparative evaluation

    Proceedings of the {SIAM} International Conference on Data Mining, {SDM} 2008, April 24–26, 2008, Atlanta, Georgia, {USA}

    (2008)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • J. Cohen et al.

    Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences

    (2013)
  • J. Cohen et al.

    Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences

    (2013)
  • C.M. Cuadras et al.

    Some computational aspects of a distance - based model for prediction

    Commun. Stat. - Simul.Comput.

    (1996)
  • C.M. Cuadras et al.

    A distance based regression model for prediction with mixed data

    Commun. Stat. - Theory Methods

    (1990)
  • G. Fan et al.

    Regression tree analysis using TARGET

    J. Comput. Graphical Stat.

    (2005)
  • J.C. Gower

    A general coefficient of similarity and some of its properties

    Biometrics

    (1971)
  • Cited by (0)

    View full text