Skip to main content
Log in

Classification and regression using augmented trees

  • Applications
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

In this work, we present an algorithm for regression and classification tasks on big datasets using augmented tree models. Partitioning a big dataset using a tree model permits us to apply a divide and conquer strategy to classification and regression tasks. Experiments conducted as part of this study illustrate that such an approach has an important benefit. Methods associated with good accuracies on learning tasks on big datasets such as ensemble tree methods or neural networks produce models that are not interpretable. The models produced by the proposed algorithm are interpretable while being as accurate as ensemble methods such as random forests or gradient boosted trees. Model interpretation can be performed at coarse and fine granularity. This permits us to extract insights that characterize the entire dataset or a particular subset of the data. Models that are accurate and interpretable are highly desirable in many application settings. The partitions created by the algorithm also permit a divide and conquer approach to model analysis. Analysis of performance by partition helped identify problems such as possible data errors and model overfitting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. Feature names are encoded for readability. Using full feature names clutters the visualization. Feature names have the prefix ‘TF’followed by an integer indicating the column number in the dataset.

References

  1. Bach, F.R., Lanckriet, G.R., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 6. ACM (2004)

  2. Boser, B., Guyon, I., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: 5th Annual ACM Workshop on COLT, pp. 144–152. ACM Press, Pittsburgh (1992)

  3. Bousquet, O., Boucheron, S., Lugosi, G.: Introduction to statistical learning theory. In: Bousquet, O., von Luxburg, U., Rätsch, G. (eds.) Advanced Lectures on Machine Learning, pp. 169–207. Springer (2004)

  4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  5. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)

    MATH  Google Scholar 

  6. Briand, B., Ducharme, G.R., Parache, V., Mercat-Rommens, C.: A similarity measure to assess the stability of classification trees. Comput. Stat. Data Anal. 53(4), 1208–1217 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  7. Brochu, E., Cora, V.M., De Freitas, N.: A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010)

  8. Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., Elhadad, N.: Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM (2015)

  9. Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM (2016)

  10. Chipman, H., McCulloch, R.E.: Hierarchical priors for bayesian cart shrinkage. Stat. Comput. 10(1), 17–24 (2000)

    Article  Google Scholar 

  11. Chipman, H.A., George, E.I., McCulloch, R.E.: Bayesian treed models. Mach. Learn. 48(1), 299–320 (2002)

    Article  MATH  Google Scholar 

  12. Das, S.: Generalized Linear Models and Beyond: An Innovative Approach from Bayesian Perspective. University of Connecticut, Storrs (2008)

    Google Scholar 

  13. Das, S., Dey, D.K.: On bayesian analysis of generalized linear models using the Jacobian technique. Am. Stat. 60(3), 264–268 (2006)

    Article  MathSciNet  Google Scholar 

  14. Das, S., Dey, D.K.: On bayesian inference for generalized multivariate gamma distribution. Stat. Probab. Lett. 80(19), 1492–1499 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  15. Das, S., Dey, D.K.: On dynamic generalized linear models with applications. Methodol. Comput. Appl. Probab. 15, 407–421 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  16. Das, S., Harel, O., Dey, D.K., Covault, J., Kranzler, H.R.: Analysis of extreme drinking in patients with alcohol dependence using Pareto regression. Stat. Med. 29(11), 1250–1258 (2010)

    MathSciNet  Google Scholar 

  17. Das, S., Roy, S., Sambasivan, R.: Fast Gaussian process regression for big data. Big Data Research (2018). https://doi.org/10.1016/j.bdr.2018.06.002. http://www.sciencedirect.com/science/article/pii/S2214579617301909

  18. Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, New York (2001)

    MATH  Google Scholar 

  19. Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Cambridge (2006)

    Book  Google Scholar 

  20. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)

    Article  MATH  Google Scholar 

  21. Gini, C.: Variability and mutability, contribution to the study of statistical distributions and relations. studieconomico-giuridici della r. universita de cagliari (1912). Reviewed in: Light, R.J., Margolin, B.H.: An analysis of variance for categorical data. J. Am. Stat. Assoc. 66, 534–544 (1971)

  22. Gramacy, R.B., Lee, H.K.H.: Bayesian treed Gaussian process models with an application to computer modeling. J. Am. Stat. Assoc. 103(483), 1119–1130 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  23. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963). http://www.jstor.org/stable/2282952

  24. kaggle: The home of data science and machine learning. http://www.kaggle.com/

  25. kaggle: The playground (2016). http://blog.kaggle.com/2013/09/25/the-playground/

  26. Karalič, A.: Employing linear regression in regression tree leaves. In: Proceedings of the 10th European Conference on Artificial Intelligence, pp. 440–441. Wiley (1992)

  27. Kohavi, R.: Scaling up the accuracy of Naive–Bayes classifiers: a decision-tree hybrid. In: KDD, vol. 96, pp. 202–207 (1996)

  28. Lee, Y., Nelder, J.A.: Hierarchical generalized linear models. J. R. Stat. Soc. Ser. B (Methodol.) 58, 619–678 (1996)

    MathSciNet  MATH  Google Scholar 

  29. Lichman, M.: UCI machine learning repository (2016). http://archive.ics.uci.edu/ml

  30. Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: Eighth IEEE International Conference on Data Mining, 2008. ICDM’08, pp. 413–422. IEEE (2008)

  31. Luc Devroye, L.G.: A Probabilistic Theory of Pattern Recognition. Springer Series in Statistics, vol. 1. Springer, New York (1996)

    Google Scholar 

  32. Nowak, R.: Introduction to classification and regression (2009). http://nowak.ece.wisc.edu/SLT09/lecture2.pdf

  33. Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, Amsterdam (2014)

    Google Scholar 

  34. Sambasivan, R., Das, S.: Big data regression using tree based segmentation. In: Proceedings of the 14th IEEE India Council International Conference. IEEE (2017)

  35. Shmueli, G., et al.: To explain or to predict? Stat. Sci. 25(3), 289–310 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  36. Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2. NIPS’12, Lake Tahoe, Nevada, pp. 2951–2959. Curran Associates Inc., USA (2012)

  37. Targo, L.: Large regression datasets (2016). http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

  38. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser B (Methodol.) 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  39. Torgo, L.: Functional models for regression tree leaves. ICML 97, 385–393 (1997)

    Google Scholar 

  40. Tumer, K., Ghosh, J.: Bayes error rate estimation using classifier ensembles. Int. J. Smart Eng. Syst. Des. 5(2), 95–109 (2003)

    Article  Google Scholar 

  41. USDOT, B.: Rita airline delay data download (2016). http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

  42. Waterhouse, S., MacKay, D., Robinson, T.: Bayesian methods for mixtures of experts. In: Advances in Neural Information Processing Systems, pp. 351–357. MIT Press (1996)

  43. Xiong, H., Pandey, G., Steinbach, M., Kumar, V.: Enhancing data analysis with noise removal. IEEE Trans. Knowl. Data Eng. 18(3), 304–319 (2006)

    Article  Google Scholar 

  44. Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 116. ACM (2004)

  45. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajiv Sambasivan.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Appendix

Appendix

1.1 Optimality of Bayes classifier

Consider a binary classifier (From [31] and [32]). A classifier is associated with a decision function, \(f(X): X \mapsto \{0,1\}\). \(X\ \in \mathbb {R}^d\), denotes the input space and \(Y \in \{ 0, 1\}\), denotes the output space.

Definition 1

The Bayes classifier \(f^*(X)\) is the following mapping:

$$\begin{aligned} f^*(X) = {\left\{ \begin{array}{ll} 1 &{} \quad \eta (X) \ge \frac{1}{2}, \\ 0 &{} \quad \text {otherwise}. \end{array}\right. } \end{aligned}$$
(3)

Here \(\eta (x) = \mathbf {P}[Y = 1 | X = x]\).

Theorem 1

The Bayes classifier, \(f^*(X): X \mapsto \{0,1\}\) is one that minimizes the probability of misclassification, i.e.,

$$\begin{aligned} \mathbf {P}[f^*(X) \ne Y] \le \mathbf {P}[f(X) \ne Y] \end{aligned}$$
(4)

Proof

$$\begin{aligned} \mathbf {P}[f(X) \ne Y | X = x]&= (1 - \mathbf {P}[Y = f(X)| X = x])\nonumber \\&= (1 - (\mathbf {P}[Y = 1, f(X) = 1 | X = x]) \nonumber \\&\quad +\mathbf {P}[Y = 0, f(X) = 0 | X = x]))\nonumber \\&= ( 1 - (\mathbf {I}_{(f(X) = 1)} \mathbf {P}[Y = 1 | X = x] \nonumber \\&\quad + \mathbf {I}_{(f(X) = 0)} \mathbf {P}[Y = 0 | X = x]))\nonumber \\&= 1 - (\mathbf {I}_{(f(X) = 1)]} \eta (x) \nonumber \\&\quad + \mathbf {I}_{(f(X) = 0)]}(1 -\eta (x))). \end{aligned}$$
(5)

Here \(\mathbf {I}_A\) is the indicator function associated with A. Along similar lines, we can express the probability of misclassification associated with the Bayes classifier (\(f^*\)) as:

$$\begin{aligned} \mathbf {P}[f^*(X) \ne Y | X = x]&= 1 - (\mathbf {I}_{(f*(X) = 1)]} \eta (x) \nonumber \\&\quad + \mathbf {I}_{(f^*(X) = 0)]}(1 -\eta (x))). \end{aligned}$$
(6)

Subtracting Eqs. (6) from (5) and using the relations:

$$\begin{aligned} \mathbf {I}_{(f(X) = 0)} = (1 - \mathbf {I}_{(f(X) = 1)}) \end{aligned}$$

and

$$\begin{aligned} \mathbf {I}_{(f^*(X) = 0)} = (1 - \mathbf {I}_{(f^*(X) = 1)}), \end{aligned}$$

we have:

$$\begin{aligned}&\mathbf {P}[f(X) \ne Y | X = x] - \mathbf {P}[f^*(X) \ne Y | X = x] \nonumber \\&\quad =2 (\eta (x) - 1) ( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \end{aligned}$$
(7)

Let us examine the right-hand side of Eq. 7:

$$\begin{aligned} 2 (\eta (x) - 1) ( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \end{aligned}$$

\(\eta (X)\) takes values in [0, 1]. There are two cases to consider to evaluate the right-hand side of Eq. (7). These are :

  1. 1.

    \(0\le \eta (X) < \frac{1}{2}\)

  2. 2.

    \(\frac{1}{2} \le \eta (X) \le 1 \).

When we have \(0\le \eta (X) < \frac{1}{2}\):

  1. 1.

    \(2 (\eta (x) - 1) < 0 \),

  2. 2.

    Applying Eq. (3), \(( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \le 0 \).

Therefore, \(2 (\eta (x) - 1) ( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \ge 0 \).

When we have \(\frac{1}{2} \le \eta (X) \le 1 \):

  1. 1.

    \(2 (\eta (x) - 1) \ge 0 \),

  2. 2.

    Applying Eq. (3), \(( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \ge 0 \).

Therefore, \(2 (\eta (x) - 1) ( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \ge 0 \). In other words, the difference in probability of misclassification between any classifier f and the Bayes classifier \(f^*\) is always a positive quantity:

$$\begin{aligned} \mathbf {P}[f(X) \ne Y | X = x] - \mathbf {P}[f^*(X) \ne Y | X = x] \ge 0. \end{aligned}$$

This implies that the Bayes classifier, \(f^*\), is the optimal classifier. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sambasivan, R., Das, S. Classification and regression using augmented trees. Int J Data Sci Anal 7, 259–276 (2019). https://doi.org/10.1007/s41060-018-0146-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-018-0146-6

Keywords

Navigation