Abstract
We study sparse high dimensional additive model fitting via penalization with sparsity-smoothness penalties. We review several existing algorithms that have been developed for this problem in the recent literature, highlighting the connections between them, and present some computationally efficient algorithms for fitting such models. Furthermore, using reasonable assumptions and exploiting recent results on group LASSO-like procedures, we take advantage of several oracle results which yield asymptotic optimality of estimators for high-dimensional but sparse additive models. Finally, variable selection procedures are compared with some high-dimensional testing procedures available in the literature for testing the presence of additive components.
Similar content being viewed by others
References
Adams RA (1975) Sobolev spaces. Academic Press, New York
Amato U, Antoniadis A, Pensky M (2006) Wavelet kernel penalized estimation for nonequispaced design regression. Stat Comput 16:37–55
Antoniadis A, Fan J (2001) Regularization of wavelets approximations (with discussion). J Am Stat Assoc 96:939–967
Antoniadis A, Gijbels I, Verhasselt A (2012) Variable selection in additive models using P-splines. Technometrics 54:425–438
Avalos M, Grandvalet Y, Ambroise YC (2007) Parsimonious additive models. Comput Stat Data Anal 51:2851–2870
Bach FR (2008) Bolasso: model consistent lasso estimation through the bootstrap. In: Proceedings of the 25th international conference on machine learning, Helsinki, Finland, pp 33–40
Bazerque JA, Mateos G, Giannakis GB (2011) Group-lasso on splines for spectrum cartography. IEEE Trans Signal Process 59(10):4648–4663
Belloni A, Chernozhukov V (2013) Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2):521–547
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical powerful approach to multiple hypothesis testing. J R Stat Soc B 57:289–300
Benjamini Y, Yekutieli D (2001) The control of false discovery rate under dependency. Ann Stat 29:1165–1188
Bhatti MI, Bracken P (2007) Solutions of differential equations in a Bernstein polynomial Basis. J Appl Math Comput 205:272–280
Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of lasso and Dantzig selector. Ann Stat 37(4):1705–1732
Breheny P, Huang J (2009) Penalized methods for bi-level variable selection. Stat Its Interface 2:369–380
Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80:580–598
Breiman L (1995) Better subset regression using the nonnegative garrote. Technometrics 37:373–384
Bühlmann P (2013) Statistical significance in high-dimensional linear models. Bernoulli 19:1212–1242
Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, Heidelberg
Bunea F, Wegkamp M, Auguste A (2006) Consistent variable selection in high dimensional regression via multiple testing. J Stat Plan Inf 136:4349–4364
Bunea F, Tsybakov AB, Wegkamp MH (2007) Sparsity oracle inequalities for the Lasso. Electron J Stat 1:169–194 (electronic)
Cantoni E, Mills Flemming J, Ronchetti E (2011) Variable selection in additive models by Nonnegative Garrote. Stat Model 11:237–252
Casella G, Moreno E (2006) Objective Bayesian variable selection. J Am Stat Assoc 101:157–167
Cui X, Peng H, Wen S, Zhu L (2013) Component selection in the additive regression model. Scand J Stat 40:491–510
de Boor C (2001) A practical guide to splines. Springer, Applied Mathematical Sciences, vol 27, Berlin (2001)
Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and penalties. Stat Sci 11:89–121
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Fan J, Jiang J (2005) Nonparametric inference for additive models. J Am Stat Assoc 100:890–907
Fan J, Feng Y, Song R (2011) Nonparametric independence screening in sparse ultra-high-dimensional additive models. J Am Stat Assoc 106:544–557
Friedman J, Stuelze W (1981) Projection pursuit regression. J Am Stat Assoc 76:817–823
Genuer R, Poggi JM, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recogn Lett 31(14):2225–2236
Hastie TJ, Tibshirani RJ (1990) Generalized additive models. Chapman & Hall, London
Horowitz J, Klemel J, Mammen E (2006) Optimal estimation in additive regression models. Bernoulli 12:271–298
Huang J, Breheny P, Ma F (2012) A selective review of group selection in high dimensional model. Stat Sci 27(4):481–499
Huang J, Horowitz JL, Wei F (2010) Variable selection in nonparametric additive models. Ann Stat 38:2282–2313
Javanmard A, Montanari A (2014) Hypothesis testing in high dimensional regression under the Gaussian random design model: asymptotic theory. IEEE Trans Inf Theory 60(10):6522–6554
Kim J, Kim Y, Kim Y (2008) A gradient-based optimization algorithm for LASSO. J Comput Graph Stat 17:994–1009
Koltchinskii V, Yuan M (2010) Sparsity in multiple kernel learning. Ann Stat 38:3660–3695
Lin Y, Zhang HH (2006) Component selection and smoothing in multivariate nonparametric regression. Ann Stat 34:2272–2297
Liu H, Zhang J (2008) On the \(\ell _1\)-\(\ell _q\) regularized regression, Technical report, Department of Statistics, Carnegie Mellon University. Preprint arXiv:0802.1517
Liu H, Zhang J (2009) Estimation consistency of the group lasso and its applications. J Mach Learn 5:376–383
Mammen E, Park BU (2006) A simple smooth backfitting method for additive models. Ann Stat 34:2252–2271
Marra G, Wood SN (2011) Practical variable selection for generalized additive models. Comput Stat Data Anal 55:2372–2387
Meier L, van de Geer S, Bühlmann P (2009) High-dimensional additive modeling. Ann Stat 37:3779–3821
Ravikumar P, Liu H, Lafferty J, Wasserman L (2009) Spam: sparse additive models. J R Stat Soc Ser B 71:1009–1030
Simon N, Tibshirani R (2012) Standardization and the group lasso penalty. Stat Sin 22:983–1001
Simon N, Friedman J, Hastie T, Tibshirani R (2013) A sparse-group lasso. J Comput Graph Stat 22(2):231–245
Stamey T, Kabalin J, McNeal J, Johnston J, Freiha F, Redwine E, Yang N (1989) ostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. J Urol 16:1076–1083
Storlie B, Bondell HD, Reich BJ, Zhang HH (2011) Surface estimation, variable selection, and the nonparametric oracle property. Stat Sin 21:679–705
Suzuki T, Sugiyama M (2013) Fast learning rate of multiple kernel learning: trade-off between sparsity and smoothness. Ann Stat 41:1381–1405
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
Tutz G, Binder H (2006) Generalized additive modeling with implicit variable selection by likelihood based boosting. Biometrics 62:961–971
van de Geer SA (2008) High-dimensional generalized linear models and the lasso. Ann Stat 36(2):614–645
van de Geer S, Bühlmann P (2009) On the conditions used to prove oracle results for the Lasso. Electron J Stat 3:1360–1392
van de Geer S, Bühlmann P, Ritov Y, Dezeure R (2014) On asymptotically optimal confidence regions and tests for high-dimensional models. Ann Stat 42(3):1166–1202
Verhasselt A (2009) Variable selection with nonnegative garrote in additive models. In: Proceedings of EYSM 2009, Bucharest, Romania, pp 23–25
Wahba G (1990) Spline models for observational data. SIAM, Philadelphia
Wainwright M (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using \(l_1\)-constrained quadratic programming (Lasso). IEEE Trans Inf Theory 55:2183–2202
Wand MP, Ormerod JT (2008) On semiparametric regression with O’Sullivan penalised splines. Aust N Z J Stat 50:179–198
Wang L, Li H, Huang JZ (2008) Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. J Am Stat Assoc 103:1556–1569
Wiesenfarth M, Krivobokova T, Klasen S, Sperlich S (2012) Direct simultaneous inference in additive models and its application to model undernutrition. J Am Stat Assoc 107(500):1286–1296
Wu S, Xue H, Wu Y, Wu H (2014) Variable selection for sparse high-dimensional nonlinear regression models by combining nonnegative garrote and sure independence screening. Stat Sin 24(3):1365–1387
Wood, S. N., Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC (2006)
Wood SN (2013) A simple test for random effects in regression models. Biometrika 100(4):1005–1010
Xue L (2009) Consistent variable selection in additive models. Stat Sin 19:1281–1296
Yang L (2008) Confidence band for additive regression model. J Data Sci 6:207–217
Yang Y, Zou H (2015) A fast unified algorithm for computing group lasso penalized learning problems. Stat Comput 25(6):1129–1141
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68:49–67
Yuan M (2007) Nonnegative garrote component selection in functional ANOVA models. In: Proceedings of the eleventh international conference on artificial intelligence and statistics, March 21–24, 2007, San Juan, Puerto Rico 2, pp 660–666
Yuan M, Lin Y (2007) On the nonnegative garrote estimator. J R Stat Soc Ser B 69:143–161
Zhang CH, Huang J (2008) The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann Stat 36(4):1567–1594
Zhang CH, Zhang SS (2014) Confidence intervals for low dimensional parameters in high dimensional linear models. J R Stat Soc Ser B 76:217–241
Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37:3468–3497
Zheng S (2008) Selection of components and degrees of smoothing via lasso in high dimensional nonparametric additive models. Comput Stat Data Anal 53:164–175
Acknowledgments
We wish to thank three anonymous reviewers, the Associate Editor and the Editor for their helpful comments and for pointing out to us some additional references. In particular, we are extremely grateful to the reviewers for providing valuable and detailed suggestions which have led to substantial improvements in the paper.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
10260_2016_357_MOESM1_ESM.pdf
Supplement to the paper “additive model selection” (additive-supp). Detailed results and discussion on all simulation runs for all four examples.(PDF 196KB)
Appendices
Appendix 1: Some proofs
1.1 Equivalence between (4) and (6)
Sketch of the proof
The Lagrangian corresponding to (6) is
where \(\nu \) and \(\eta =(\eta _1,\ldots ,\eta _p)\) are the corresponding Lagrange multiplier (equality and inequality constraints) and problem (6) becomes
Let us consider the derivative of the Lagrangian with respect to \(\lambda _1,\ldots ,\lambda _p\).
because \(\eta _k=0\) for the complementarity condition for the inactive constraints. \(\square \)
Using the equality constraint in (6) we have
It follows that (6) can be rewritten as
1.2 Evaluation of the sparsity-smoothness penalty (7)
Sketch of the proof
First of all observe that
The B-spline of degree m on the knots \(t_1,\ldots ,t_{K-2}\) is
for \(k=-m,\ldots ,K-2\) and \(\alpha _{r,m+1}^{(k)}\) coefficients of the divided differences of degree m in \(t_k\).\(\square \)
Let us consider the matrices \(K^{(m)}=\left( k^{(m)}_{il}\right) \), \(i,l=-m,\ldots ,K-2\) and \(\varOmega ^{(m)}=\left( \omega ^{(m)}_{il}\right) \), \(i,l=-m+j,\ldots ,K-2\) defined by
Let us evaluate \(K^{(m)}\) first. Note that \(K^{(m)}\) is a \(2m+1\) symmetric band matrix of dimension \(K-1+m \times K-1+m\). This implies that it is sufficient to evaluate the principal diagonal and the upper diagonals, i.e. the elements \(k^{(m)}_{ii+j}\), \(i=-m,\ldots ,K-2; j=0,\ldots ,m\):
\(B_{i,m}(x)\) is different from zero in \((t_i,t_{i+m+1})\) and \(B_{i+j,m}(x)\) is different from zero in \((t_{i+j},t_{i+j+m+1})\), so, assuming \(u=\max (0,t_{i+j})\) and \(v=\min (t_{i+m+1},1)\), it follows
where
But
where \(z=\min (t_{i+r},t_{i+j+s})\) and the integral can be evaluated using Eq. (16).
Now let us evaluate the matrix \(\varOmega ^{(m)}\). First of all let s(x) be the spline of degree m on the knots \(t_1,\ldots ,t_{K-2}\)
Then for each \(j=1,\ldots ,m-1\)
with
This implies that
Using Eq. (17) we have
and \(T^{(j)}\) is obtained iteratively in the following way
where \(V^{(s)}\) is a bi-diagonal matrix of dimension \(m+K-1-s \times m+K-s\) whose principal diagonal \(d=\left( \ldots ,-(t_{i+m+1-s}-t_i)^{-1},\ldots \right) ^t, i=-m+s,\ldots ,K-2\) and upper diagonal \(d^u=-d\).
Eq. (18) can be rewritten using Eq. (19)
Remark 5
Note that the R package fda (http://cran.r-project.org/web/packages/fda) has built-in functions to evaluate both the b-splines functional data basis and the penalties matrices (create.bspline.basis and bsplinepen.R). For the computation of the b-spline basis the algorithm automatically includes \(m+1\) replicates of the end points to the interior knots; the penalties matrices are calculated using the polynomial form of the b-splines.
We used our routines to evaluate both the b-spline basis and the penalties matrices (available upon request). For the b-splines basis we preferred to add \(m+1\) equispaced knots to the left and to the right of the interval defined by the interior knots following Eilers and Marx (1996); for the penalties we implemented directly the formulas shown above, that explicitly give exact calculations to evaluate the integrals.
1.3 Equivalence between Eqs. (8) and (11)
Proof
Eq. (8) is equivalent to Eq. (12) after reparametrization. Now we will show that Eq. (12) is equivalent to solve the following adaptive ridge optimization problem
Let us put \(\varvec{\alpha _j}=\sqrt{\frac{\lambda _j}{\lambda }}\varvec{\gamma _j}\), \(b_j=\sqrt{\frac{\lambda }{\lambda _j}}\), \(\varvec{\alpha }=\left( \varvec{\alpha _1}^t,\ldots ,\varvec{\alpha _p}^t\right) ^t\) and \(\varvec{b}=\left( b_1,\ldots ,b_p\right) ^t\), then Eq. (20) becomes
with \(\mathcal{{L}} \left( \varvec{b,\alpha }\right) =\left\| \mathbf {Y} - \sum _{j=1}^p \hat{H}_j \sqrt{\frac{\lambda }{\lambda _j}}\varvec{\alpha _j} \right\| _n^2\). The Lagrangian associated to (Eq. 23) is
where \(\nu \) and \(\varvec{\xi }=(\xi _1,\ldots ,\xi _p)^t\) are the corresponding Lagrange multipliers. The normal equations are therefore:
Now, for every j, one has \(\varvec{\gamma _j}=b_j\varvec{\alpha _j}\), i.e. \(\varvec{\gamma }=\text {diag}(\varvec{b_{\text {new}}})\varvec{\alpha }\), with \(\varvec{b_{\text {new}}}=\left( \varvec{b_1}^t,\ldots ,\varvec{b_p}^t\right) ^t\), and \(\varvec{b_j}=\left( b_j,\ldots ,b_j\right) ^t \in R^{K+2}\). Hence
It follows
This allow us to get a relation between \(b_j\) and \(\varvec{\alpha _j}\) independent of \(\mathcal{{L}}, \nu \) and \(\varvec{\xi }\). If \(\varvec{\hat{b}}\) and \(\varvec{\hat{\alpha }}\) are the solutions then
But Lagrange multipliers for inactive constraints are 0, so \(\text {diag}(\varvec{\hat{b}_{\text {new}}})\varvec{\xi }=0\). Since Eq. (26) is true at the optimum and \(\frac{\partial L}{\partial \varvec{\alpha }}=\frac{\partial L}{\partial \varvec{b}_{\text {new}}}=0\) at \((\varvec{\hat{b}},\varvec{\hat{\alpha }})\), we have
and using the equality constraint in Eq. (23)
Finally, from \(\varvec{\hat{\gamma }_j}=\text {diag}(\varvec{\hat{b}_{\text {new}}}) \varvec{\hat{\alpha }}\), we have
Using the fact that \(\frac{\partial L}{\partial \varvec{\hat{\gamma }_j}}\mid _{\varvec{\hat{\gamma }_j}}+2\lambda \frac{\varvec{\hat{\alpha }_j}}{\hat{b}_j}\), if \(\hat{b}_j \ne 0\), one easily sees that the solution of Eq. (20) is a solution of the normal equations of
\(\square \)
1.4 Two step algorithm to solve Eq. (11)
Sketch of the proof
For fixed \(\lambda _j\), \(j=1,\ldots ,p\) the solution of problem (11) is given by the classical ridge in the context of additive regression, i.e.
On the contrary, problem (11) is equivalent to
with \(\varvec{\tilde{\beta }_j}=\lambda _j\varvec{\beta _j}\) and \(\tilde{M_j}(\alpha )=\lambda _jM_j(\alpha )\).\(\square \)
Now, supposing to know \(\varvec{\tilde{\beta }_j}\) and \(\tilde{M_j}(\alpha )\), Eq. (27) can be written in more compact way as
i.e.
with \(C=\mathrm {diag}(\varvec{c_1},\ldots ,\varvec{c_p})\), \(\varvec{c_j}={\left( \tilde{M_j}(\alpha )\right) ^{\frac{1}{2}} \varvec{\tilde{\beta }_j}}, j=1,\ldots ,p\); and \(D=\left[ \mathbf {d_1},\ldots ,\mathbf {d_p}\right] \), \(\mathbf {d_j}=B_j\varvec{\tilde{\beta }_j}, j=1\ldots ,p\).
Assuming \(\mathbf {z}=\left( {\begin{array}{l} \mathbf {Y}\\ \mathbf {0}\end{array}}\right) \) and \(G=\left( {\begin{array}{l} D\\ C\end{array}} \right) \), Eq. (28) becomes
that is equivalent to solve a Non Negative Garrote minimization.
So iterating between Eqs. (11) and (29) we obtain a double-step algorithm. Indeed starting from Eq. (11) for fixed \(\tau \), when the parameters \(\lambda _j\) are fixed constants, we obtain \(\varvec{\tilde{\beta }_j}\) and \(\tilde{M_j}(\alpha )\) by classical ridge regression in the context of additive regression. Substituting this values into Eq. (29) and minimizing with respect to the vector of the \(\lambda _j\)’s, the new values for \(\lambda _j\)’s are just obtained by the nonnegative garrote but using the previous ridge estimates of \(\varvec{\beta }_j\) instead of the ordinary least squares estimates.
Appendix 2 : Algorithms and methods for selecting tuning parameters
To implement the proposed method, we need to choose the tuning parameters K, \(\lambda \) and \(\alpha \), and s in the non negative garrote. Both parameters \(\alpha \) and K control the smoothness of the additive components, while \(\lambda \) and s determine the sparsity. In Sect. 3 we have shown how these tuning parameters grow or decay at a proper rate to obtain variable selection and consistent estimators. In practice however we need data driven procedures for selecting this regularization parameters.
The parameter K was set to 7 basing on the dimensionality examples and after some experimentation (results were robust with respect to values around 7). This choice was done according to the rule \((K-2)\asymp {n}^{1/5}\) interior knots placed at the empirical quantiles of \(X^{j}, j=1,\ldots ,p\), completed by two knots, lower and upper bound of the interval where the additive components are defined.
1.1 Choice of the optimal \(\alpha \)
Choice of the optimal smoothing parameter \(\alpha \) in all the proposed methods is aimed at performing variable selection in the context of additive modeling. We have implemented the following heuristic algorithm
1.2 Choice of the regularization parameter \(\lambda \) and s
gam estimates the regularization parameter \(\lambda \) by GCV as
(see Avalos et al. 2007 for details). In all the other packages (grpreg, gglasso, sgl, cosso, hgam) the regularization parameter \(\lambda \) has been selected by 5-fold Cross Validation. The same choice has been adopted for the testing procedures test-glr and test-nng to determine the parameter s controlling for sparsity in non negative garrote.
Rights and permissions
About this article
Cite this article
Amato, U., Antoniadis, A. & De Feis, I. Additive model selection. Stat Methods Appl 25, 519–564 (2016). https://doi.org/10.1007/s10260-016-0357-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-016-0357-8