Additive model selection

Amato, Umberto; Antoniadis, Anestis; De Feis, Italia

doi:10.1007/s10260-016-0357-8

Additive model selection

Original Paper
Published: 12 March 2016

Volume 25, pages 519–564, (2016)
Cite this article

Statistical Methods & Applications Aims and scope Submit manuscript

Umberto Amato¹,
Anestis Antoniadis^2,3 &
Italia De Feis¹

718 Accesses
7 Citations
Explore all metrics

Abstract

We study sparse high dimensional additive model fitting via penalization with sparsity-smoothness penalties. We review several existing algorithms that have been developed for this problem in the recent literature, highlighting the connections between them, and present some computationally efficient algorithms for fitting such models. Furthermore, using reasonable assumptions and exploiting recent results on group LASSO-like procedures, we take advantage of several oracle results which yield asymptotic optimality of estimators for high-dimensional but sparse additive models. Finally, variable selection procedures are compared with some high-dimensional testing procedures available in the literature for testing the presence of additive components.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Entropy-Based Subsampling Methods for Big Data

Article 11 April 2024

A Guide for Sparse PCA: Model Comparison and Applications

Article Open access 29 June 2021

Residuals-based distributionally robust optimization with covariate information

Article 26 September 2023

References

Adams RA (1975) Sobolev spaces. Academic Press, New York
MATH Google Scholar
Amato U, Antoniadis A, Pensky M (2006) Wavelet kernel penalized estimation for nonequispaced design regression. Stat Comput 16:37–55
Article MathSciNet Google Scholar
Antoniadis A, Fan J (2001) Regularization of wavelets approximations (with discussion). J Am Stat Assoc 96:939–967
Article MathSciNet MATH Google Scholar
Antoniadis A, Gijbels I, Verhasselt A (2012) Variable selection in additive models using P-splines. Technometrics 54:425–438
Article MathSciNet Google Scholar
Avalos M, Grandvalet Y, Ambroise YC (2007) Parsimonious additive models. Comput Stat Data Anal 51:2851–2870
Article MathSciNet MATH Google Scholar
Bach FR (2008) Bolasso: model consistent lasso estimation through the bootstrap. In: Proceedings of the 25th international conference on machine learning, Helsinki, Finland, pp 33–40
Bazerque JA, Mateos G, Giannakis GB (2011) Group-lasso on splines for spectrum cartography. IEEE Trans Signal Process 59(10):4648–4663
Article MathSciNet Google Scholar
Belloni A, Chernozhukov V (2013) Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2):521–547
Article MathSciNet MATH Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical powerful approach to multiple hypothesis testing. J R Stat Soc B 57:289–300
MathSciNet MATH Google Scholar
Benjamini Y, Yekutieli D (2001) The control of false discovery rate under dependency. Ann Stat 29:1165–1188
Article MathSciNet MATH Google Scholar
Bhatti MI, Bracken P (2007) Solutions of differential equations in a Bernstein polynomial Basis. J Appl Math Comput 205:272–280
Article MathSciNet MATH Google Scholar
Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of lasso and Dantzig selector. Ann Stat 37(4):1705–1732
Article MathSciNet MATH Google Scholar
Breheny P, Huang J (2009) Penalized methods for bi-level variable selection. Stat Its Interface 2:369–380
Article MathSciNet MATH Google Scholar
Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80:580–598
Article MathSciNet MATH Google Scholar
Breiman L (1995) Better subset regression using the nonnegative garrote. Technometrics 37:373–384
Article MathSciNet MATH Google Scholar
Bühlmann P (2013) Statistical significance in high-dimensional linear models. Bernoulli 19:1212–1242
Article MathSciNet MATH Google Scholar
Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, Heidelberg
Book MATH Google Scholar
Bunea F, Wegkamp M, Auguste A (2006) Consistent variable selection in high dimensional regression via multiple testing. J Stat Plan Inf 136:4349–4364
Article MathSciNet MATH Google Scholar
Bunea F, Tsybakov AB, Wegkamp MH (2007) Sparsity oracle inequalities for the Lasso. Electron J Stat 1:169–194 (electronic)
Article MathSciNet MATH Google Scholar
Cantoni E, Mills Flemming J, Ronchetti E (2011) Variable selection in additive models by Nonnegative Garrote. Stat Model 11:237–252
Article MathSciNet Google Scholar
Casella G, Moreno E (2006) Objective Bayesian variable selection. J Am Stat Assoc 101:157–167
Article MathSciNet MATH Google Scholar
Cui X, Peng H, Wen S, Zhu L (2013) Component selection in the additive regression model. Scand J Stat 40:491–510
Article MathSciNet MATH Google Scholar
de Boor C (2001) A practical guide to splines. Springer, Applied Mathematical Sciences, vol 27, Berlin (2001)
Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and penalties. Stat Sci 11:89–121
Article MathSciNet MATH Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Article MathSciNet MATH Google Scholar
Fan J, Jiang J (2005) Nonparametric inference for additive models. J Am Stat Assoc 100:890–907
Article MathSciNet MATH Google Scholar
Fan J, Feng Y, Song R (2011) Nonparametric independence screening in sparse ultra-high-dimensional additive models. J Am Stat Assoc 106:544–557
Article MathSciNet MATH Google Scholar
Friedman J, Stuelze W (1981) Projection pursuit regression. J Am Stat Assoc 76:817–823
Article MathSciNet Google Scholar
Genuer R, Poggi JM, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recogn Lett 31(14):2225–2236
Article Google Scholar
Hastie TJ, Tibshirani RJ (1990) Generalized additive models. Chapman & Hall, London
MATH Google Scholar
Horowitz J, Klemel J, Mammen E (2006) Optimal estimation in additive regression models. Bernoulli 12:271–298
Article MathSciNet MATH Google Scholar
Huang J, Breheny P, Ma F (2012) A selective review of group selection in high dimensional model. Stat Sci 27(4):481–499
Article MathSciNet MATH Google Scholar
Huang J, Horowitz JL, Wei F (2010) Variable selection in nonparametric additive models. Ann Stat 38:2282–2313
Article MathSciNet MATH Google Scholar
Javanmard A, Montanari A (2014) Hypothesis testing in high dimensional regression under the Gaussian random design model: asymptotic theory. IEEE Trans Inf Theory 60(10):6522–6554
Article MathSciNet Google Scholar
Kim J, Kim Y, Kim Y (2008) A gradient-based optimization algorithm for LASSO. J Comput Graph Stat 17:994–1009
Article MathSciNet MATH Google Scholar
Koltchinskii V, Yuan M (2010) Sparsity in multiple kernel learning. Ann Stat 38:3660–3695
Article MathSciNet MATH Google Scholar
Lin Y, Zhang HH (2006) Component selection and smoothing in multivariate nonparametric regression. Ann Stat 34:2272–2297
Article MathSciNet MATH Google Scholar
Liu H, Zhang J (2008) On the $\ell _1$-$\ell _q$ regularized regression, Technical report, Department of Statistics, Carnegie Mellon University. Preprint arXiv:0802.1517
Liu H, Zhang J (2009) Estimation consistency of the group lasso and its applications. J Mach Learn 5:376–383
Google Scholar
Mammen E, Park BU (2006) A simple smooth backfitting method for additive models. Ann Stat 34:2252–2271
Article MathSciNet MATH Google Scholar
Marra G, Wood SN (2011) Practical variable selection for generalized additive models. Comput Stat Data Anal 55:2372–2387
Article MathSciNet MATH Google Scholar
Meier L, van de Geer S, Bühlmann P (2009) High-dimensional additive modeling. Ann Stat 37:3779–3821
Article MathSciNet MATH Google Scholar
Ravikumar P, Liu H, Lafferty J, Wasserman L (2009) Spam: sparse additive models. J R Stat Soc Ser B 71:1009–1030
Article MathSciNet Google Scholar
Simon N, Tibshirani R (2012) Standardization and the group lasso penalty. Stat Sin 22:983–1001
MathSciNet MATH Google Scholar
Simon N, Friedman J, Hastie T, Tibshirani R (2013) A sparse-group lasso. J Comput Graph Stat 22(2):231–245
Article MathSciNet Google Scholar
Stamey T, Kabalin J, McNeal J, Johnston J, Freiha F, Redwine E, Yang N (1989) ostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. J Urol 16:1076–1083
Google Scholar
Storlie B, Bondell HD, Reich BJ, Zhang HH (2011) Surface estimation, variable selection, and the nonparametric oracle property. Stat Sin 21:679–705
Article MathSciNet MATH Google Scholar
Suzuki T, Sugiyama M (2013) Fast learning rate of multiple kernel learning: trade-off between sparsity and smoothness. Ann Stat 41:1381–1405
Article MathSciNet MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
MathSciNet MATH Google Scholar
Tutz G, Binder H (2006) Generalized additive modeling with implicit variable selection by likelihood based boosting. Biometrics 62:961–971
Article MathSciNet MATH Google Scholar
van de Geer SA (2008) High-dimensional generalized linear models and the lasso. Ann Stat 36(2):614–645
Article MathSciNet MATH Google Scholar
van de Geer S, Bühlmann P (2009) On the conditions used to prove oracle results for the Lasso. Electron J Stat 3:1360–1392
Article MathSciNet MATH Google Scholar
van de Geer S, Bühlmann P, Ritov Y, Dezeure R (2014) On asymptotically optimal confidence regions and tests for high-dimensional models. Ann Stat 42(3):1166–1202
Article MathSciNet MATH Google Scholar
Verhasselt A (2009) Variable selection with nonnegative garrote in additive models. In: Proceedings of EYSM 2009, Bucharest, Romania, pp 23–25
Wahba G (1990) Spline models for observational data. SIAM, Philadelphia
Book MATH Google Scholar
Wainwright M (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using $l_1$-constrained quadratic programming (Lasso). IEEE Trans Inf Theory 55:2183–2202
Article MathSciNet Google Scholar
Wand MP, Ormerod JT (2008) On semiparametric regression with O’Sullivan penalised splines. Aust N Z J Stat 50:179–198
Article MathSciNet MATH Google Scholar
Wang L, Li H, Huang JZ (2008) Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. J Am Stat Assoc 103:1556–1569
Article MathSciNet MATH Google Scholar
Wiesenfarth M, Krivobokova T, Klasen S, Sperlich S (2012) Direct simultaneous inference in additive models and its application to model undernutrition. J Am Stat Assoc 107(500):1286–1296
Article MathSciNet MATH Google Scholar
Wu S, Xue H, Wu Y, Wu H (2014) Variable selection for sparse high-dimensional nonlinear regression models by combining nonnegative garrote and sure independence screening. Stat Sin 24(3):1365–1387
MathSciNet MATH Google Scholar
Wood, S. N., Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC (2006)
Wood SN (2013) A simple test for random effects in regression models. Biometrika 100(4):1005–1010
Article MathSciNet MATH Google Scholar
Xue L (2009) Consistent variable selection in additive models. Stat Sin 19:1281–1296
MathSciNet MATH Google Scholar
Yang L (2008) Confidence band for additive regression model. J Data Sci 6:207–217
Google Scholar
Yang Y, Zou H (2015) A fast unified algorithm for computing group lasso penalized learning problems. Stat Comput 25(6):1129–1141
Article MathSciNet MATH Google Scholar
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68:49–67
Article MathSciNet MATH Google Scholar
Yuan M (2007) Nonnegative garrote component selection in functional ANOVA models. In: Proceedings of the eleventh international conference on artificial intelligence and statistics, March 21–24, 2007, San Juan, Puerto Rico 2, pp 660–666
Yuan M, Lin Y (2007) On the nonnegative garrote estimator. J R Stat Soc Ser B 69:143–161
Article MathSciNet Google Scholar
Zhang CH, Huang J (2008) The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann Stat 36(4):1567–1594
Article MathSciNet MATH Google Scholar
Zhang CH, Zhang SS (2014) Confidence intervals for low dimensional parameters in high dimensional linear models. J R Stat Soc Ser B 76:217–241
Article MathSciNet Google Scholar
Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37:3468–3497
Article MathSciNet MATH Google Scholar
Zheng S (2008) Selection of components and degrees of smoothing via lasso in high dimensional nonparametric additive models. Comput Stat Data Anal 53:164–175
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

We wish to thank three anonymous reviewers, the Associate Editor and the Editor for their helpful comments and for pointing out to us some additional references. In particular, we are extremely grateful to the reviewers for providing valuable and detailed suggestions which have led to substantial improvements in the paper.

Author information

Authors and Affiliations

Istituto per le Applicazioni del Calcolo ‘M. Picone’, National Research Council, Via Pietro Castellino 111, 80131, Naples, Italy
Umberto Amato & Italia De Feis
Laboratoire Jean Kuntzmann, Université Joseph Fourier, BP 53, 38041, Grenoble Cedex 09, France
Anestis Antoniadis
Department of Statistical Sciences, University of Cape Town, Rondebosch, Cape Town, 7701, South Africa
Anestis Antoniadis

Authors

Umberto Amato
View author publications
You can also search for this author in PubMed Google Scholar
Anestis Antoniadis
View author publications
You can also search for this author in PubMed Google Scholar
Italia De Feis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Italia De Feis.

Electronic supplementary material

Below is the link to the electronic supplementary material.

10260_2016_357_MOESM1_ESM.pdf

Supplement to the paper “additive model selection” (additive-supp). Detailed results and discussion on all simulation runs for all four examples.(PDF 196KB)

Appendices

Appendix 1: Some proofs

1.1 Equivalence between (4) and (6)

Sketch of the proof

The Lagrangian corresponding to (6) is

$$\begin{aligned} L(f_1,\ldots ,f_p,\lambda _1,\ldots ,\lambda _p)= & {} \left\| \mathbf {Y} - \sum _{j=1}^p \mathbf {f}_j\right\| _n^2 + \sum _{j=1}^p \lambda _j J^2_\alpha (f_j)+ \nu \left( \sum _{j=1}^p \frac{1}{\lambda _j}-\frac{p}{\lambda }\right) \\&+\sum _{j=1}^p \eta _j \lambda _j, \end{aligned}$$

where $\nu $ and $\eta =(\eta _1,\ldots ,\eta _p)$ are the corresponding Lagrange multiplier (equality and inequality constraints) and problem (6) becomes

$$\begin{aligned} \min _{f_j\in {\mathcal {H}}_j,\lambda _j} L(f_1,\ldots ,f_p,\lambda _1,\ldots ,\lambda _p). \end{aligned}$$

Let us consider the derivative of the Lagrangian with respect to $\lambda _1,\ldots ,\lambda _p$.

$$\begin{aligned} \frac{\partial {L(f_1,\ldots ,f_p,\lambda _1,\ldots ,\lambda _p)}}{\partial {\lambda _k}}=0 \Leftrightarrow \lambda _k=\frac{\sqrt{\nu }}{J_{\alpha }(f_k)}, \end{aligned}$$

because $\eta _k=0$ for the complementarity condition for the inactive constraints. $\square $

Using the equality constraint in (6) we have

$$\begin{aligned} \sqrt{\nu }= & {} \frac{\lambda }{p}\sum _{j=1}^pJ_{\alpha }(f_j) \Rightarrow \lambda _k=\frac{1}{J_{\alpha }(f_k)}\frac{\lambda }{p}\sum _{j=1}^p J_{\alpha }(f_j) \Rightarrow \sum _{j=1}^p \lambda _j J^2_{\alpha }(f_j)\\= & {} \frac{\lambda }{p}\left( \sum _{j=1}^pJ_{\alpha }(f_j)\right) ^2. \end{aligned}$$

It follows that (6) can be rewritten as

$$\begin{aligned} \min _{{f_j\in {\mathcal {H}}_j} } \left\{ \left\| \mathbf {Y} - \sum _{j=1}^p \mathbf {f}_j\right\| _n^2 + \frac{\lambda }{p} \left( \sum _{j=1}^p J_\alpha (f_j)\right) ^2\right\} . \end{aligned}$$

1.2 Evaluation of the sparsity-smoothness penalty (7)

Sketch of the proof

First of all observe that

$$\begin{aligned}&\int _u^v(a-x)^m(b-x)^m dx\nonumber \\&=\sum _{k=0}^m\sum _{j=0}^m \left( \begin{array}{c} m \\ k \end{array}\right) \left( \begin{array}{c} m \\ j \end{array}\right) a^kb^j(-1)^{k+j}\frac{(v^{2m-k-j+1}-u^{2m-j-k+1})}{(2m-k-j+1)}. \end{aligned}$$

(16)

The B-spline of degree m on the knots $t_1,\ldots ,t_{K-2}$ is

$$\begin{aligned} B_{k,m}(x)=(t_{k+m+1}-t_k)\left[ t_l,\ldots ,t_{k+m+1}\right] (t-x)_+^m=\sum _{r=0}^{m+1}\alpha _{r,m+1}^{(k)}(t_{k+r}-x)_+^m \end{aligned}$$

for $k=-m,\ldots ,K-2$ and $\alpha _{r,m+1}^{(k)}$ coefficients of the divided differences of degree m in $t_k$.$\square $

Let us consider the matrices $K^{(m)}=\left( k^{(m)}_{il}\right) $, $i,l=-m,\ldots ,K-2$ and $\varOmega ^{(m)}=\left( \omega ^{(m)}_{il}\right) $, $i,l=-m+j,\ldots ,K-2$ defined by

$$\begin{aligned} k^{(m)}_{il}=\int _0^1 B_{i,m}(x)B_{l,m}(x)dx \\ \omega ^{(m)}_{il}=\int _0^1 B_{i,m}^{(j)}(x)B_{l,m}^{(j)}(x)dx. \end{aligned}$$

Let us evaluate $K^{(m)}$ first. Note that $K^{(m)}$ is a $2m+1$ symmetric band matrix of dimension $K-1+m \times K-1+m$. This implies that it is sufficient to evaluate the principal diagonal and the upper diagonals, i.e. the elements $k^{(m)}_{ii+j}$, $i=-m,\ldots ,K-2; j=0,\ldots ,m$:

$$\begin{aligned} k^{(m)}_{ii+j}=\int _0^1 B_{i,m}(x)B_{i+j,m}(x)dx. \end{aligned}$$

$B_{i,m}(x)$ is different from zero in $(t_i,t_{i+m+1})$ and $B_{i+j,m}(x)$ is different from zero in $(t_{i+j},t_{i+j+m+1})$, so, assuming $u=\max (0,t_{i+j})$ and $v=\min (t_{i+m+1},1)$, it follows

$$\begin{aligned} k^{(m)}_{ii+j}= & {} \int _u^v B_{i,m}(x)B_{i+j,m}(x)dx=\sum _{r=0}^{m+1}\sum _{s=0}^{m+1}\alpha _{r,m+1}^{(i)}\alpha _{s,m+1}^{(i+j)}I(u,v,t_{i+r},t_{i+j+s}) \end{aligned}$$

where

$$\begin{aligned} I(u,v,t_{i+r},t_{i+j+s})=\int _u^v (t_{i+r}-x)_{+}^{m}(t_{i+j+s}-x)_{+}^{m}dx. \end{aligned}$$

But

$$\begin{aligned} I(u,v,t_{i+r},t_{i+j+s})=\left\{ \begin{array}{ll} 0 &{}\quad \text {if} \ \min (t_{i+r},t_{i+j+s})\le u \\ \int _u^{\min (v,z)}(t_{i+r}-x)(t_{i+j+s}-x)dx &{} \quad \text {otherwise} \end{array} \right. \end{aligned}$$

where $z=\min (t_{i+r},t_{i+j+s})$ and the integral can be evaluated using Eq. (16).

Now let us evaluate the matrix $\varOmega ^{(m)}$. First of all let s(x) be the spline of degree m on the knots $t_1,\ldots ,t_{K-2}$

$$\begin{aligned} s(x)=\sum _{i=-m}^{K-2}\beta _iB_{i,m}(x). \end{aligned}$$

Then for each $j=1,\ldots ,m-1$

$$\begin{aligned} s^{(j)}(x)=\sum _{i=-m+j}^{K-2}\beta _i^{(j)}B_{i,m-j}(x) \end{aligned}$$

with

$$\begin{aligned} \beta _i^{(j)}=\left\{ \begin{array}{ll} \beta _i &{} \quad \mathrm{if} \quad j=0 \\ (m+1-j)\frac{\beta _i^{(j-1)}-\beta _{i-1}^{(j-1)}}{t_{i+m+1-j}-t_i} &{}\quad \mathrm{if} \quad j>0.\end{array} \right. \end{aligned}$$

(17)

This implies that

$$\begin{aligned} \int _0^1\left( s^{(j)}(x)\right) ^2dx= & {} \sum _{i=-m+j}^{K-2}\sum _{l=-m+j}^{K-2}\beta _i^{(j)}\beta _l^{(j)}\int _0^1 B_{i,m-j}(x)B_{l,m-j}(x)dx\nonumber \\= & {} \left( \beta ^{(j)}\right) ^tK^{(m-j)}\beta ^{(j)}\nonumber \\ \beta ^{(j)}= & {} \left( \beta _{-m+j}^{(j)},\ldots ,\beta _{K-2}^{(j)}\right) ^t. \end{aligned}$$

(18)

Using Eq. (17) we have

$$\begin{aligned} \beta ^{(j)}=T^{(j)}\beta , \end{aligned}$$

(19)

and $T^{(j)}$ is obtained iteratively in the following way

$$\begin{aligned} T^{(0)}= & {} I_{m+K-1 \times m+K-1}\\ T^{(s)}= & {} (m+1-s)V^{(s)}T^{(s-1)}, \quad s=1,\ldots ,j; \end{aligned}$$

where $V^{(s)}$ is a bi-diagonal matrix of dimension $m+K-1-s \times m+K-s$ whose principal diagonal $d=\left( \ldots ,-(t_{i+m+1-s}-t_i)^{-1},\ldots \right) ^t, i=-m+s,\ldots ,K-2$ and upper diagonal $d^u=-d$.

Eq. (18) can be rewritten using Eq. (19)

$$\begin{aligned} \int _0^1\left( s^{(j)}(x)\right) ^2dx= & {} \beta ^t\left( T^{(j)}\right) ^tK^{(m-j)}T^{(j)}\beta =\beta ^t\varOmega ^{(m)}\beta \\ \varOmega ^{(m)}= & {} \left( T^{(j)}\right) ^tK^{(m-j)}T^{(j)}. \end{aligned}$$

Remark 5

Note that the R package fda (http://cran.r-project.org/web/packages/fda) has built-in functions to evaluate both the b-splines functional data basis and the penalties matrices (create.bspline.basis and bsplinepen.R). For the computation of the b-spline basis the algorithm automatically includes $m+1$ replicates of the end points to the interior knots; the penalties matrices are calculated using the polynomial form of the b-splines.

We used our routines to evaluate both the b-spline basis and the penalties matrices (available upon request). For the b-splines basis we preferred to add $m+1$ equispaced knots to the left and to the right of the interval defined by the interior knots following Eilers and Marx (1996); for the penalties we implemented directly the formulas shown above, that explicitly give exact calculations to evaluate the integrals.

1.3 Equivalence between Eqs. (8) and (11)

Proof

Eq. (8) is equivalent to Eq. (12) after reparametrization. Now we will show that Eq. (12) is equivalent to solve the following adaptive ridge optimization problem

$$\begin{aligned} \min _{\varvec{\gamma _1}, \ldots , \varvec{\gamma _p}}&\left\| \mathbf {Y} - \sum _{j=1}^p \hat{H}_j \varvec{\gamma _j} \right\| _n^2 + \sum _{j=1}^p \lambda _j \left\| \varvec{\gamma _j} \right\| ^2 \end{aligned}$$

(20)

$$\begin{aligned} \text {subject to:}&\sum _{j=1}^p \frac{1}{\lambda _j} = \frac{p}{\lambda }, \end{aligned}$$

(21)

$$\begin{aligned}&\lambda _j > 0, \quad j=1, \ldots , p. \end{aligned}$$

(22)

Let us put $\varvec{\alpha _j}=\sqrt{\frac{\lambda _j}{\lambda }}\varvec{\gamma _j}$, $b_j=\sqrt{\frac{\lambda }{\lambda _j}}$, $\varvec{\alpha }=\left( \varvec{\alpha _1}^t,\ldots ,\varvec{\alpha _p}^t\right) ^t$ and $\varvec{b}=\left( b_1,\ldots ,b_p\right) ^t$, then Eq. (20) becomes

$$\begin{aligned} \min _{\varvec{b},\varvec{\alpha }}&\mathcal{{L}} \left( \varvec{b,\alpha }\right) + \lambda \sum _{j=1}^p \left\| \varvec{\alpha _j} \right\| ^2 \end{aligned}$$

(23)

$$\begin{aligned} \text {subject to:}&\sum _{j=1}^p b_j^2 = p, \end{aligned}$$

(24)

$$\begin{aligned}&b_j \ge 0, \quad j=1, \ldots , p, \end{aligned}$$

(25)

with $\mathcal{{L}} \left( \varvec{b,\alpha }\right) =\left\| \mathbf {Y} - \sum _{j=1}^p \hat{H}_j \sqrt{\frac{\lambda }{\lambda _j}}\varvec{\alpha _j} \right\| _n^2$. The Lagrangian associated to (Eq. 23) is

$$\begin{aligned} L\left( \varvec{b,\alpha }\right) =\mathcal{{L}} \left( \varvec{b,\alpha }\right) + \lambda \sum _{j=1}^p \left\| \varvec{\alpha _j} \right\| ^2+\nu \left( \sum _{j=1}^p b_j^2-p\right) -\varvec{\xi ^t b} \end{aligned}$$

where $\nu $ and $\varvec{\xi }=(\xi _1,\ldots ,\xi _p)^t$ are the corresponding Lagrange multipliers. The normal equations are therefore:

$$\begin{aligned} \frac{\partial L\left( \varvec{b,\alpha }\right) }{\partial \varvec{\alpha _j}}= & {} \frac{\partial \mathcal{{L}}\left( \varvec{b,\alpha }\right) }{\partial \varvec{\alpha _j}}+2\lambda \varvec{\alpha _j}, \quad j=1,\ldots ,p \\ \frac{\partial L\left( \varvec{b,\alpha }\right) }{\partial \varvec{b}}= & {} \frac{\partial \mathcal{{L}}\left( \varvec{b,\alpha }\right) }{\partial \varvec{b}}+2\nu \varvec{b}-\varvec{\xi }. \end{aligned}$$

Now, for every j, one has $\varvec{\gamma _j}=b_j\varvec{\alpha _j}$, i.e. $\varvec{\gamma }=\text {diag}(\varvec{b_{\text {new}}})\varvec{\alpha }$, with $\varvec{b_{\text {new}}}=\left( \varvec{b_1}^t,\ldots ,\varvec{b_p}^t\right) ^t$, and $\varvec{b_j}=\left( b_j,\ldots ,b_j\right) ^t \in R^{K+2}$. Hence

$$\begin{aligned} \frac{\partial \mathcal{{L}}}{\partial \varvec{\alpha }}= & {} \text {diag}(\varvec{b_{\text {new}}})\frac{\partial \mathcal{{L}}}{\partial \varvec{\gamma }}\\ \frac{\partial \mathcal{{L}}}{\partial \varvec{b_{\text {new}}}}= & {} \text {diag}(\varvec{ \alpha })\frac{\partial \mathcal{{L}}}{\partial \varvec{\gamma }}. \end{aligned}$$

It follows

$$\begin{aligned} \text {diag}(\varvec{\alpha })\frac{\partial \mathcal{{L}}}{\partial \varvec{\alpha }}=\text {diag}(\varvec{b_{\text {new}}})\frac{\partial \mathcal{{L}}}{\partial \varvec{b_{\text {new}}}}. \end{aligned}$$

(26)

This allow us to get a relation between $b_j$ and $\varvec{\alpha _j}$ independent of $\mathcal{{L}}, \nu $ and $\varvec{\xi }$. If $\varvec{\hat{b}}$ and $\varvec{\hat{\alpha }}$ are the solutions then

$$\begin{aligned} \text {diag}(\varvec{\hat{\alpha }})\frac{\partial L}{\partial \varvec{\alpha }}= & {} \text {diag}(\varvec{\hat{\alpha }}) \frac{\partial \mathcal{{L}}}{\partial \varvec{\alpha }}\mid _{(\varvec{\hat{b},\hat{\alpha }})} + 2\lambda \text {diag}(\varvec{\hat{\alpha }})\varvec{\hat{\alpha }}\\ \text {diag}(\varvec{\hat{b}_{\text {new}}})\frac{\partial L}{\partial \varvec{b_{\text {new}}}}= & {} \text {diag}(\varvec{\hat{b}_{\text {new}}}) \frac{\partial \mathcal{{L}}}{\partial \varvec{b_{\text {new}}}}\mid _{(\varvec{\hat{b}_{\text {new}},\hat{\alpha }})} + 2\nu \text {diag}(\varvec{\hat{b}_{\text {new}}})\varvec{\hat{b}_{\text {new}}}-\text {diag}(\varvec{\hat{b}_{\text {new}}})\varvec{\xi }. \end{aligned}$$

But Lagrange multipliers for inactive constraints are 0, so $\text {diag}(\varvec{\hat{b}_{\text {new}}})\varvec{\xi }=0$. Since Eq. (26) is true at the optimum and $\frac{\partial L}{\partial \varvec{\alpha }}=\frac{\partial L}{\partial \varvec{b}_{\text {new}}}=0$ at $(\varvec{\hat{b}},\varvec{\hat{\alpha }})$, we have

$$\begin{aligned} (K+2)\hat{b}_j^2=\frac{\lambda }{\nu }\left\| \varvec{\hat{\alpha }_j}\right\| ^2,\quad j=1,\ldots ,p; \end{aligned}$$

and using the equality constraint in Eq. (23)

$$\begin{aligned} \hat{b}_j= & {} \frac{\sqrt{p}\left\| \varvec{\hat{\alpha }_j}\right\| ^2}{\sqrt{\sum _{j=1}^p\left\| \varvec{\hat{\alpha }_j}\right\| ^2}},\quad j=1,\ldots ,p. \end{aligned}$$

Finally, from $\varvec{\hat{\gamma }_j}=\text {diag}(\varvec{\hat{b}_{\text {new}}}) \varvec{\hat{\alpha }}$, we have

$$\begin{aligned} \left\| \varvec{\hat{\gamma }_j} \right\| =\frac{\sqrt{p}\left\| \varvec{\hat{\alpha }_j}\right\| ^2}{\sqrt{\sum _{j=1}^p\left\| \varvec{\hat{\alpha }_j}\right\| ^2}} \Leftrightarrow \hat{b}_j^2=\frac{p\left\| \varvec{\hat{\gamma }_j}\right\| }{\sum _{j=1}^p\left\| \varvec{\hat{\gamma }_j} \right\| }. \end{aligned}$$

Using the fact that $\frac{\partial L}{\partial \varvec{\hat{\gamma }_j}}\mid _{\varvec{\hat{\gamma }_j}}+2\lambda \frac{\varvec{\hat{\alpha }_j}}{\hat{b}_j}$, if $\hat{b}_j \ne 0$, one easily sees that the solution of Eq. (20) is a solution of the normal equations of

$$\begin{aligned} \min _{\varvec{\gamma _1}, \ldots , \varvec{\gamma _p}}&\left\| \mathbf {Y} - \sum _{j=1}^p \hat{H}_j \varvec{\gamma _j} \right\| _n^2 + \lambda \left( \sum _{j=1}^p \left\| \varvec{\gamma _j} \right\| \right) ^2. \end{aligned}$$

$\square $

1.4 Two step algorithm to solve Eq. (11)

Sketch of the proof

For fixed $\lambda _j$, $j=1,\ldots ,p$ the solution of problem (11) is given by the classical ridge in the context of additive regression, i.e.

$$\begin{aligned} \varvec{\beta _j}=\left( B_j^TB_j+\lambda _jM_j(\alpha )\right) ^{-1}\varvec{\beta _j}^T\left( \mathbf {Y}-\sum _{k\ne j}B_k \varvec{\beta _k}\right) . \end{aligned}$$

On the contrary, problem (11) is equivalent to

$$\begin{aligned} \min _{\begin{array}{l} \varvec{\tilde{\beta }_1}, \ldots , \varvec{\tilde{\beta }_p}\\ \lambda _1,\ldots ,\lambda _p \end{array}} \left\{ \left\| \mathbf {Y} - \sum _{j=1}^p B_j \varvec{\tilde{\beta }_j} \frac{1}{\lambda _j}\right\| _n^2 +\sum _{j=1}^p {\varvec{\tilde{\beta }_j}^T \tilde{M_j}(\alpha ) \varvec{\tilde{\beta }_j} }\frac{1}{\lambda _j^{2}} + \tau \sum _{j=1}^p \frac{\lambda }{\lambda _j}\right\} , \end{aligned}$$

(27)

with $\varvec{\tilde{\beta }_j}=\lambda _j\varvec{\beta _j}$ and $\tilde{M_j}(\alpha )=\lambda _jM_j(\alpha )$.$\square $

Now, supposing to know $\varvec{\tilde{\beta }_j}$ and $\tilde{M_j}(\alpha )$, Eq. (27) can be written in more compact way as

$$\begin{aligned} \min _{\lambda _1,\ldots ,\lambda _p} \left\{ \left\| \mathbf {Y} - D \varvec{\frac{1}{\lambda }}\right\| _n^2 + \left\| C\varvec{\frac{1}{\lambda }}\right\| ^2 + \tau \sum _{j=1}^p \frac{\lambda }{\lambda _j}\right\} , \end{aligned}$$

i.e.

$$\begin{aligned} \min _{\lambda _1,\ldots ,\lambda _p} \left\{ \left\| \left( {\begin{array}{l} \mathbf {Y}\\ \mathbf {0}\end{array}}\right) - \left( {\begin{array}{l} D\\ C\end{array}} \right) \varvec{\frac{1}{\lambda }}\right\| _n^2 + \tau \sum _{j=1}^p \frac{\lambda }{\lambda _j} \right\} , \end{aligned}$$

(28)

with $C=\mathrm {diag}(\varvec{c_1},\ldots ,\varvec{c_p})$, $\varvec{c_j}={\left( \tilde{M_j}(\alpha )\right) ^{\frac{1}{2}} \varvec{\tilde{\beta }_j}}, j=1,\ldots ,p$; and $D=\left[ \mathbf {d_1},\ldots ,\mathbf {d_p}\right] $, $\mathbf {d_j}=B_j\varvec{\tilde{\beta }_j}, j=1\ldots ,p$.

Assuming $\mathbf {z}=\left( {\begin{array}{l} \mathbf {Y}\\ \mathbf {0}\end{array}}\right) $ and $G=\left( {\begin{array}{l} D\\ C\end{array}} \right) $, Eq. (28) becomes

$$\begin{aligned} \min _{\lambda _1,\ldots ,\lambda _p} \left\{ \left\| \mathbf {z}-G \varvec{\frac{1}{\lambda }}\right\| _n^2 + \tau \sum _{j=1}^p \frac{\lambda }{\lambda _j} \right\} , \end{aligned}$$

(29)

that is equivalent to solve a Non Negative Garrote minimization.

So iterating between Eqs. (11) and (29) we obtain a double-step algorithm. Indeed starting from Eq. (11) for fixed $\tau $, when the parameters $\lambda _j$ are fixed constants, we obtain $\varvec{\tilde{\beta }_j}$ and $\tilde{M_j}(\alpha )$ by classical ridge regression in the context of additive regression. Substituting this values into Eq. (29) and minimizing with respect to the vector of the $\lambda _j$’s, the new values for $\lambda _j$’s are just obtained by the nonnegative garrote but using the previous ridge estimates of $\varvec{\beta }_j$ instead of the ordinary least squares estimates.

Appendix 2 : Algorithms and methods for selecting tuning parameters

To implement the proposed method, we need to choose the tuning parameters K, $\lambda $ and $\alpha $, and s in the non negative garrote. Both parameters $\alpha $ and K control the smoothness of the additive components, while $\lambda $ and s determine the sparsity. In Sect. 3 we have shown how these tuning parameters grow or decay at a proper rate to obtain variable selection and consistent estimators. In practice however we need data driven procedures for selecting this regularization parameters.

The parameter K was set to 7 basing on the dimensionality examples and after some experimentation (results were robust with respect to values around 7). This choice was done according to the rule $(K-2)\asymp {n}^{1/5}$ interior knots placed at the empirical quantiles of $X^{j}, j=1,\ldots ,p$, completed by two knots, lower and upper bound of the interval where the additive components are defined.

1.1 Choice of the optimal $\alpha $

Choice of the optimal smoothing parameter $\alpha $ in all the proposed methods is aimed at performing variable selection in the context of additive modeling. We have implemented the following heuristic algorithm

1.2 Choice of the regularization parameter $\lambda $ and s

gam estimates the regularization parameter $\lambda $ by GCV as

$$\begin{aligned} \text {GCV}(\lambda ) = \frac{\Vert \mathbf {Y} - \mathbf {Y}^{\text {pred}}(\lambda )\Vert ^2_n}{(1 - \text {df}/ n)^2} \end{aligned}$$

(see Avalos et al. 2007 for details). In all the other packages (grpreg, gglasso, sgl, cosso, hgam) the regularization parameter $\lambda $ has been selected by 5-fold Cross Validation. The same choice has been adopted for the testing procedures test-glr and test-nng to determine the parameter s controlling for sparsity in non negative garrote.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amato, U., Antoniadis, A. & De Feis, I. Additive model selection. Stat Methods Appl 25, 519–564 (2016). https://doi.org/10.1007/s10260-016-0357-8

Download citation

Accepted: 02 March 2016
Published: 12 March 2016
Issue Date: November 2016
DOI: https://doi.org/10.1007/s10260-016-0357-8

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Additive model selection

Abstract

Access this article

Similar content being viewed by others

Entropy-Based Subsampling Methods for Big Data

A Guide for Sparse PCA: Model Comparison and Applications

Residuals-based distributionally robust optimization with covariate information

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

10260_2016_357_MOESM1_ESM.pdf

Appendices

Appendix 1: Some proofs

1.1 Equivalence between (4) and (6)

Sketch of the proof

1.2 Evaluation of the sparsity-smoothness penalty (7)

Sketch of the proof

Remark 5

1.3 Equivalence between Eqs. (8) and (11)

Proof

1.4 Two step algorithm to solve Eq. (11)

Sketch of the proof

Appendix 2 : Algorithms and methods for selecting tuning parameters

1.1 Choice of the optimal \(\alpha \)

1.2 Choice of the regularization parameter \(\lambda \) and s

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Additive model selection

Abstract

Access this article

Similar content being viewed by others

Entropy-Based Subsampling Methods for Big Data

A Guide for Sparse PCA: Model Comparison and Applications

Residuals-based distributionally robust optimization with covariate information

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

10260_2016_357_MOESM1_ESM.pdf

Appendices

Appendix 1: Some proofs

1.1 Equivalence between (4) and (6)

Sketch of the proof

1.2 Evaluation of the sparsity-smoothness penalty (7)

Sketch of the proof

Remark 5

1.3 Equivalence between Eqs. (8) and (11)

Proof

1.4 Two step algorithm to solve Eq. (11)

Sketch of the proof

Appendix 2 : Algorithms and methods for selecting tuning parameters

1.1 Choice of the optimal \(\alpha \)

1.2 Choice of the regularization parameter \(\lambda \) and s

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation