Abstract
In this work, we present an algorithm for regression and classification tasks on big datasets using augmented tree models. Partitioning a big dataset using a tree model permits us to apply a divide and conquer strategy to classification and regression tasks. Experiments conducted as part of this study illustrate that such an approach has an important benefit. Methods associated with good accuracies on learning tasks on big datasets such as ensemble tree methods or neural networks produce models that are not interpretable. The models produced by the proposed algorithm are interpretable while being as accurate as ensemble methods such as random forests or gradient boosted trees. Model interpretation can be performed at coarse and fine granularity. This permits us to extract insights that characterize the entire dataset or a particular subset of the data. Models that are accurate and interpretable are highly desirable in many application settings. The partitions created by the algorithm also permit a divide and conquer approach to model analysis. Analysis of performance by partition helped identify problems such as possible data errors and model overfitting.
Similar content being viewed by others
Notes
Feature names are encoded for readability. Using full feature names clutters the visualization. Feature names have the prefix ‘TF’followed by an integer indicating the column number in the dataset.
References
Bach, F.R., Lanckriet, G.R., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 6. ACM (2004)
Boser, B., Guyon, I., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: 5th Annual ACM Workshop on COLT, pp. 144–152. ACM Press, Pittsburgh (1992)
Bousquet, O., Boucheron, S., Lugosi, G.: Introduction to statistical learning theory. In: Bousquet, O., von Luxburg, U., Rätsch, G. (eds.) Advanced Lectures on Machine Learning, pp. 169–207. Springer (2004)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
Briand, B., Ducharme, G.R., Parache, V., Mercat-Rommens, C.: A similarity measure to assess the stability of classification trees. Comput. Stat. Data Anal. 53(4), 1208–1217 (2009)
Brochu, E., Cora, V.M., De Freitas, N.: A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010)
Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., Elhadad, N.: Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM (2015)
Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM (2016)
Chipman, H., McCulloch, R.E.: Hierarchical priors for bayesian cart shrinkage. Stat. Comput. 10(1), 17–24 (2000)
Chipman, H.A., George, E.I., McCulloch, R.E.: Bayesian treed models. Mach. Learn. 48(1), 299–320 (2002)
Das, S.: Generalized Linear Models and Beyond: An Innovative Approach from Bayesian Perspective. University of Connecticut, Storrs (2008)
Das, S., Dey, D.K.: On bayesian analysis of generalized linear models using the Jacobian technique. Am. Stat. 60(3), 264–268 (2006)
Das, S., Dey, D.K.: On bayesian inference for generalized multivariate gamma distribution. Stat. Probab. Lett. 80(19), 1492–1499 (2010)
Das, S., Dey, D.K.: On dynamic generalized linear models with applications. Methodol. Comput. Appl. Probab. 15, 407–421 (2013)
Das, S., Harel, O., Dey, D.K., Covault, J., Kranzler, H.R.: Analysis of extreme drinking in patients with alcohol dependence using Pareto regression. Stat. Med. 29(11), 1250–1258 (2010)
Das, S., Roy, S., Sambasivan, R.: Fast Gaussian process regression for big data. Big Data Research (2018). https://doi.org/10.1016/j.bdr.2018.06.002. http://www.sciencedirect.com/science/article/pii/S2214579617301909
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, New York (2001)
Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Cambridge (2006)
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Gini, C.: Variability and mutability, contribution to the study of statistical distributions and relations. studieconomico-giuridici della r. universita de cagliari (1912). Reviewed in: Light, R.J., Margolin, B.H.: An analysis of variance for categorical data. J. Am. Stat. Assoc. 66, 534–544 (1971)
Gramacy, R.B., Lee, H.K.H.: Bayesian treed Gaussian process models with an application to computer modeling. J. Am. Stat. Assoc. 103(483), 1119–1130 (2008)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963). http://www.jstor.org/stable/2282952
kaggle: The home of data science and machine learning. http://www.kaggle.com/
kaggle: The playground (2016). http://blog.kaggle.com/2013/09/25/the-playground/
Karalič, A.: Employing linear regression in regression tree leaves. In: Proceedings of the 10th European Conference on Artificial Intelligence, pp. 440–441. Wiley (1992)
Kohavi, R.: Scaling up the accuracy of Naive–Bayes classifiers: a decision-tree hybrid. In: KDD, vol. 96, pp. 202–207 (1996)
Lee, Y., Nelder, J.A.: Hierarchical generalized linear models. J. R. Stat. Soc. Ser. B (Methodol.) 58, 619–678 (1996)
Lichman, M.: UCI machine learning repository (2016). http://archive.ics.uci.edu/ml
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: Eighth IEEE International Conference on Data Mining, 2008. ICDM’08, pp. 413–422. IEEE (2008)
Luc Devroye, L.G.: A Probabilistic Theory of Pattern Recognition. Springer Series in Statistics, vol. 1. Springer, New York (1996)
Nowak, R.: Introduction to classification and regression (2009). http://nowak.ece.wisc.edu/SLT09/lecture2.pdf
Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, Amsterdam (2014)
Sambasivan, R., Das, S.: Big data regression using tree based segmentation. In: Proceedings of the 14th IEEE India Council International Conference. IEEE (2017)
Shmueli, G., et al.: To explain or to predict? Stat. Sci. 25(3), 289–310 (2010)
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2. NIPS’12, Lake Tahoe, Nevada, pp. 2951–2959. Curran Associates Inc., USA (2012)
Targo, L.: Large regression datasets (2016). http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser B (Methodol.) 58, 267–288 (1996)
Torgo, L.: Functional models for regression tree leaves. ICML 97, 385–393 (1997)
Tumer, K., Ghosh, J.: Bayes error rate estimation using classifier ensembles. Int. J. Smart Eng. Syst. Des. 5(2), 95–109 (2003)
USDOT, B.: Rita airline delay data download (2016). http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
Waterhouse, S., MacKay, D., Robinson, T.: Bayesian methods for mixtures of experts. In: Advances in Neural Information Processing Systems, pp. 351–357. MIT Press (1996)
Xiong, H., Pandey, G., Steinbach, M., Kumar, V.: Enhancing data analysis with noise removal. IEEE Trans. Knowl. Data Eng. 18(3), 304–319 (2006)
Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 116. ACM (2004)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Appendix
Appendix
1.1 Optimality of Bayes classifier
Consider a binary classifier (From [31] and [32]). A classifier is associated with a decision function, \(f(X): X \mapsto \{0,1\}\). \(X\ \in \mathbb {R}^d\), denotes the input space and \(Y \in \{ 0, 1\}\), denotes the output space.
Definition 1
The Bayes classifier \(f^*(X)\) is the following mapping:
Here \(\eta (x) = \mathbf {P}[Y = 1 | X = x]\).
Theorem 1
The Bayes classifier, \(f^*(X): X \mapsto \{0,1\}\) is one that minimizes the probability of misclassification, i.e.,
Proof
Here \(\mathbf {I}_A\) is the indicator function associated with A. Along similar lines, we can express the probability of misclassification associated with the Bayes classifier (\(f^*\)) as:
Subtracting Eqs. (6) from (5) and using the relations:
and
we have:
Let us examine the right-hand side of Eq. 7:
\(\eta (X)\) takes values in [0, 1]. There are two cases to consider to evaluate the right-hand side of Eq. (7). These are :
-
1.
\(0\le \eta (X) < \frac{1}{2}\)
-
2.
\(\frac{1}{2} \le \eta (X) \le 1 \).
When we have \(0\le \eta (X) < \frac{1}{2}\):
-
1.
\(2 (\eta (x) - 1) < 0 \),
-
2.
Applying Eq. (3), \(( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \le 0 \).
Therefore, \(2 (\eta (x) - 1) ( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \ge 0 \).
When we have \(\frac{1}{2} \le \eta (X) \le 1 \):
-
1.
\(2 (\eta (x) - 1) \ge 0 \),
-
2.
Applying Eq. (3), \(( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \ge 0 \).
Therefore, \(2 (\eta (x) - 1) ( \mathbf {I}_{(f^*(X) = 1)} - \mathbf {I}_{(f(X) = 1)}) \ge 0 \). In other words, the difference in probability of misclassification between any classifier f and the Bayes classifier \(f^*\) is always a positive quantity:
This implies that the Bayes classifier, \(f^*\), is the optimal classifier. \(\square \)
Rights and permissions
About this article
Cite this article
Sambasivan, R., Das, S. Classification and regression using augmented trees. Int J Data Sci Anal 7, 259–276 (2019). https://doi.org/10.1007/s41060-018-0146-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-018-0146-6