Classification and regression using augmented trees

Sambasivan, Rajiv; Das, Sourish

doi:10.1007/s41060-018-0146-6

Classification and regression using augmented trees

Applications
Published: 01 August 2018

Volume 7, pages 259–276, (2019)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

368 Accesses
7 Citations
Explore all metrics

Abstract

In this work, we present an algorithm for regression and classification tasks on big datasets using augmented tree models. Partitioning a big dataset using a tree model permits us to apply a divide and conquer strategy to classification and regression tasks. Experiments conducted as part of this study illustrate that such an approach has an important benefit. Methods associated with good accuracies on learning tasks on big datasets such as ensemble tree methods or neural networks produce models that are not interpretable. The models produced by the proposed algorithm are interpretable while being as accurate as ensemble methods such as random forests or gradient boosted trees. Model interpretation can be performed at coarse and fine granularity. This permits us to extract insights that characterize the entire dataset or a particular subset of the data. Models that are accurate and interpretable are highly desirable in many application settings. The partitions created by the algorithm also permit a divide and conquer approach to model analysis. Analysis of performance by partition helped identify problems such as possible data errors and model overfitting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Double random forest

Article 02 July 2020

A Method to Build Classification and Regression Trees

Automated Machine Learning for Studying the Trade-Off Between Predictive Accuracy and Interpretability

Notes

Feature names are encoded for readability. Using full feature names clutters the visualization. Feature names have the prefix ‘TF’followed by an integer indicating the column number in the dataset.

References

Bach, F.R., Lanckriet, G.R., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 6. ACM (2004)
Boser, B., Guyon, I., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: 5th Annual ACM Workshop on COLT, pp. 144–152. ACM Press, Pittsburgh (1992)
Bousquet, O., Boucheron, S., Lugosi, G.: Introduction to statistical learning theory. In: Bousquet, O., von Luxburg, U., Rätsch, G. (eds.) Advanced Lectures on Machine Learning, pp. 169–207. Springer (2004)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
MATH Google Scholar
Briand, B., Ducharme, G.R., Parache, V., Mercat-Rommens, C.: A similarity measure to assess the stability of classification trees. Comput. Stat. Data Anal. 53(4), 1208–1217 (2009)
Article MathSciNet MATH Google Scholar
Brochu, E., Cora, V.M., De Freitas, N.: A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010)
Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., Elhadad, N.: Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM (2015)
Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM (2016)
Chipman, H., McCulloch, R.E.: Hierarchical priors for bayesian cart shrinkage. Stat. Comput. 10(1), 17–24 (2000)
Article Google Scholar
Chipman, H.A., George, E.I., McCulloch, R.E.: Bayesian treed models. Mach. Learn. 48(1), 299–320 (2002)
Article MATH Google Scholar
Das, S.: Generalized Linear Models and Beyond: An Innovative Approach from Bayesian Perspective. University of Connecticut, Storrs (2008)
Google Scholar
Das, S., Dey, D.K.: On bayesian analysis of generalized linear models using the Jacobian technique. Am. Stat. 60(3), 264–268 (2006)
Article MathSciNet Google Scholar
Das, S., Dey, D.K.: On bayesian inference for generalized multivariate gamma distribution. Stat. Probab. Lett. 80(19), 1492–1499 (2010)
Article MathSciNet MATH Google Scholar
Das, S., Dey, D.K.: On dynamic generalized linear models with applications. Methodol. Comput. Appl. Probab. 15, 407–421 (2013)
Article MathSciNet MATH Google Scholar
Das, S., Harel, O., Dey, D.K., Covault, J., Kranzler, H.R.: Analysis of extreme drinking in patients with alcohol dependence using Pareto regression. Stat. Med. 29(11), 1250–1258 (2010)
MathSciNet Google Scholar
Das, S., Roy, S., Sambasivan, R.: Fast Gaussian process regression for big data. Big Data Research (2018). https://doi.org/10.1016/j.bdr.2018.06.002. http://www.sciencedirect.com/science/article/pii/S2214579617301909
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, New York (2001)
MATH Google Scholar
Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Cambridge (2006)
Book Google Scholar
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Article MATH Google Scholar
Gini, C.: Variability and mutability, contribution to the study of statistical distributions and relations. studieconomico-giuridici della r. universita de cagliari (1912). Reviewed in: Light, R.J., Margolin, B.H.: An analysis of variance for categorical data. J. Am. Stat. Assoc. 66, 534–544 (1971)
Gramacy, R.B., Lee, H.K.H.: Bayesian treed Gaussian process models with an application to computer modeling. J. Am. Stat. Assoc. 103(483), 1119–1130 (2008)
Article MathSciNet MATH Google Scholar
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963). http://www.jstor.org/stable/2282952
kaggle: The home of data science and machine learning. http://www.kaggle.com/
kaggle: The playground (2016). http://blog.kaggle.com/2013/09/25/the-playground/
Karalič, A.: Employing linear regression in regression tree leaves. In: Proceedings of the 10th European Conference on Artificial Intelligence, pp. 440–441. Wiley (1992)
Kohavi, R.: Scaling up the accuracy of Naive–Bayes classifiers: a decision-tree hybrid. In: KDD, vol. 96, pp. 202–207 (1996)
Lee, Y., Nelder, J.A.: Hierarchical generalized linear models. J. R. Stat. Soc. Ser. B (Methodol.) 58, 619–678 (1996)
MathSciNet MATH Google Scholar
Lichman, M.: UCI machine learning repository (2016). http://archive.ics.uci.edu/ml
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: Eighth IEEE International Conference on Data Mining, 2008. ICDM’08, pp. 413–422. IEEE (2008)
Luc Devroye, L.G.: A Probabilistic Theory of Pattern Recognition. Springer Series in Statistics, vol. 1. Springer, New York (1996)
Google Scholar
Nowak, R.: Introduction to classification and regression (2009). http://nowak.ece.wisc.edu/SLT09/lecture2.pdf
Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, Amsterdam (2014)
Google Scholar
Sambasivan, R., Das, S.: Big data regression using tree based segmentation. In: Proceedings of the 14th IEEE India Council International Conference. IEEE (2017)
Shmueli, G., et al.: To explain or to predict? Stat. Sci. 25(3), 289–310 (2010)
Article MathSciNet MATH Google Scholar
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2. NIPS’12, Lake Tahoe, Nevada, pp. 2951–2959. Curran Associates Inc., USA (2012)
Targo, L.: Large regression datasets (2016). http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser B (Methodol.) 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Torgo, L.: Functional models for regression tree leaves. ICML 97, 385–393 (1997)
Google Scholar
Tumer, K., Ghosh, J.: Bayes error rate estimation using classifier ensembles. Int. J. Smart Eng. Syst. Des. 5(2), 95–109 (2003)
Article Google Scholar
USDOT, B.: Rita airline delay data download (2016). http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
Waterhouse, S., MacKay, D., Robinson, T.: Bayesian methods for mixtures of experts. In: Advances in Neural Information Processing Systems, pp. 351–357. MIT Press (1996)
Xiong, H., Pandey, G., Steinbach, M., Kumar, V.: Enhancing data analysis with noise removal. IEEE Trans. Knowl. Data Eng. 18(3), 304–319 (2006)
Article Google Scholar
Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 116. ACM (2004)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.

Author information

Authors and Affiliations

Department of Computer Science, Chennai Mathematical Institute, Kelambakkam, 603103, India
Rajiv Sambasivan
Department of Mathematics, Chennai Mathematical Institute, Kelambakkam, 603103, India
Sourish Das

Authors

Rajiv Sambasivan
View author publications
You can also search for this author in PubMed Google Scholar
Sourish Das
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajiv Sambasivan.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Appendix

1.1 Optimality of Bayes classifier

Consider a binary classifier (From [31] and [32]). A classifier is associated with a decision function, $f(X): X \mapsto \{0,1\}$. $X\ \in \mathbb {R}^d$, denotes the input space and $Y \in \{ 0, 1\}$, denotes the output space.

Definition 1

The Bayes classifier $f^*(X)$ is the following mapping:

$$\begin{aligned} f^*(X) = {\left\{ \begin{array}{ll} 1 &{} \quad \eta (X) \ge \frac{1}{2}, \\ 0 &{} \quad \text {otherwise}. \end{array}\right. } \end{aligned}$$

(3)

Here $\eta (x) = \mathbf {P}[Y = 1 | X = x]$.

Theorem 1

The Bayes classifier, $f^*(X): X \mapsto \{0,1\}$ is one that minimizes the probability of misclassification, i.e.,

$$\begin{aligned} \mathbf {P}[f^*(X) \ne Y] \le \mathbf {P}[f(X) \ne Y] \end{aligned}$$

(4)

Proof

$$\begin{aligned} \mathbf {P}[f(X) \ne Y | X = x]&= (1 - \mathbf {P}[Y = f(X)| X = x])\nonumber \\&= (1 - (\mathbf {P}[Y = 1, f(X) = 1 | X = x]) \nonumber \\&\quad +\mathbf {P}[Y = 0, f(X) = 0 | X = x]))\nonumber \\&= ( 1 - (\mathbf {I}_{(f(X) = 1)} \mathbf {P}[Y = 1 | X = x] \nonumber \\&\quad + \mathbf {I}_{(f(X) = 0)} \mathbf {P}[Y = 0 | X = x]))\nonumber \\&= 1 - (\mathbf {I}_{(f(X) = 1)]} \eta (x) \nonumber \\&\quad + \mathbf {I}_{(f(X) = 0)]}(1 -\eta (x))). \end{aligned}$$

(5)

Here $\mathbf {I}_A$ is the indicator function associated with A. Along similar lines, we can express the probability of misclassification associated with the Bayes classifier ($f^*$) as:

$$\begin{aligned} \mathbf {P}[f^*(X) \ne Y | X = x]&= 1 - (\mathbf {I}_{(f*(X) = 1)]} \eta (x) \nonumber \\&\quad + \mathbf {I}_{(f^*(X) = 0)]}(1 -\eta (x))). \end{aligned}$$

(6)

Subtracting Eqs. (6) from (5) and using the relations:

$$\begin{aligned} \mathbf {I}_{(f(X) = 0)} = (1 - \mathbf {I}_{(f(X) = 1)}) \end{aligned}$$

and

$$\begin{aligned} \mathbf {I}_{(f^*(X) = 0)} = (1 - \mathbf {I}_{(f^*(X) = 1)}), \end{aligned}$$

we have:

$$\begin{aligned}&\mathbf {P}[f(X) \ne Y | X = x] - \mathbf {P}[f^*(X) \ne Y | X = x] \nonumber \\&\quad =2 (\eta (x) - 1) ( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \end{aligned}$$

(7)

Let us examine the right-hand side of Eq. 7:

$$\begin{aligned} 2 (\eta (x) - 1) ( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \end{aligned}$$

$\eta (X)$ takes values in [0, 1]. There are two cases to consider to evaluate the right-hand side of Eq. (7). These are :

1.
$0\le \eta (X) < \frac{1}{2}$
2.
$\frac{1}{2} \le \eta (X) \le 1 $.

When we have $0\le \eta (X) < \frac{1}{2}$:

1.
$2 (\eta (x) - 1) < 0 $,
2.
Applying Eq. (3), $( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \le 0 $.

Therefore, $2 (\eta (x) - 1) ( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \ge 0 $.

When we have $\frac{1}{2} \le \eta (X) \le 1 $:

1.
$2 (\eta (x) - 1) \ge 0 $,
2.
Applying Eq. (3), $( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \ge 0 $.

Therefore, $2 (\eta (x) - 1) ( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \ge 0 $. In other words, the difference in probability of misclassification between any classifier f and the Bayes classifier $f^*$ is always a positive quantity:

$$\begin{aligned} \mathbf {P}[f(X) \ne Y | X = x] - \mathbf {P}[f^*(X) \ne Y | X = x] \ge 0. \end{aligned}$$

This implies that the Bayes classifier, $f^*$, is the optimal classifier. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sambasivan, R., Das, S. Classification and regression using augmented trees. Int J Data Sci Anal 7, 259–276 (2019). https://doi.org/10.1007/s41060-018-0146-6

Download citation

Received: 19 January 2018
Accepted: 23 July 2018
Published: 01 August 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s41060-018-0146-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Classification and regression using augmented trees

Abstract

Access this article

Similar content being viewed by others

Double random forest

A Method to Build Classification and Regression Trees

Automated Machine Learning for Studying the Trade-Off Between Predictive Accuracy and Interpretability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix

1.1 Optimality of Bayes classifier

Definition 1

Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Classification and regression using augmented trees

Abstract

Access this article

Similar content being viewed by others

Double random forest

A Method to Build Classification and Regression Trees

Automated Machine Learning for Studying the Trade-Off Between Predictive Accuracy and Interpretability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix

Appendix

1.1 Optimality of Bayes classifier

Definition 1

Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation