Skip to main content
Log in

Additive model selection

  • Original Paper
  • Published:
Statistical Methods & Applications Aims and scope Submit manuscript

Abstract

We study sparse high dimensional additive model fitting via penalization with sparsity-smoothness penalties. We review several existing algorithms that have been developed for this problem in the recent literature, highlighting the connections between them, and present some computationally efficient algorithms for fitting such models. Furthermore, using reasonable assumptions and exploiting recent results on group LASSO-like procedures, we take advantage of several oracle results which yield asymptotic optimality of estimators for high-dimensional but sparse additive models. Finally, variable selection procedures are compared with some high-dimensional testing procedures available in the literature for testing the presence of additive components.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Adams RA (1975) Sobolev spaces. Academic Press, New York

    MATH  Google Scholar 

  • Amato U, Antoniadis A, Pensky M (2006) Wavelet kernel penalized estimation for nonequispaced design regression. Stat Comput 16:37–55

    Article  MathSciNet  Google Scholar 

  • Antoniadis A, Fan J (2001) Regularization of wavelets approximations (with discussion). J Am Stat Assoc 96:939–967

    Article  MathSciNet  MATH  Google Scholar 

  • Antoniadis A, Gijbels I, Verhasselt A (2012) Variable selection in additive models using P-splines. Technometrics 54:425–438

    Article  MathSciNet  Google Scholar 

  • Avalos M, Grandvalet Y, Ambroise YC (2007) Parsimonious additive models. Comput Stat Data Anal 51:2851–2870

    Article  MathSciNet  MATH  Google Scholar 

  • Bach FR (2008) Bolasso: model consistent lasso estimation through the bootstrap. In: Proceedings of the 25th international conference on machine learning, Helsinki, Finland, pp 33–40

  • Bazerque JA, Mateos G, Giannakis GB (2011) Group-lasso on splines for spectrum cartography. IEEE Trans Signal Process 59(10):4648–4663

    Article  MathSciNet  Google Scholar 

  • Belloni A, Chernozhukov V (2013) Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2):521–547

    Article  MathSciNet  MATH  Google Scholar 

  • Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical powerful approach to multiple hypothesis testing. J R Stat Soc B 57:289–300

    MathSciNet  MATH  Google Scholar 

  • Benjamini Y, Yekutieli D (2001) The control of false discovery rate under dependency. Ann Stat 29:1165–1188

    Article  MathSciNet  MATH  Google Scholar 

  • Bhatti MI, Bracken P (2007) Solutions of differential equations in a Bernstein polynomial Basis. J Appl Math Comput 205:272–280

    Article  MathSciNet  MATH  Google Scholar 

  • Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of lasso and Dantzig selector. Ann Stat 37(4):1705–1732

    Article  MathSciNet  MATH  Google Scholar 

  • Breheny P, Huang J (2009) Penalized methods for bi-level variable selection. Stat Its Interface 2:369–380

    Article  MathSciNet  MATH  Google Scholar 

  • Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80:580–598

    Article  MathSciNet  MATH  Google Scholar 

  • Breiman L (1995) Better subset regression using the nonnegative garrote. Technometrics 37:373–384

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann P (2013) Statistical significance in high-dimensional linear models. Bernoulli 19:1212–1242

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, Heidelberg

    Book  MATH  Google Scholar 

  • Bunea F, Wegkamp M, Auguste A (2006) Consistent variable selection in high dimensional regression via multiple testing. J Stat Plan Inf 136:4349–4364

    Article  MathSciNet  MATH  Google Scholar 

  • Bunea F, Tsybakov AB, Wegkamp MH (2007) Sparsity oracle inequalities for the Lasso. Electron J Stat 1:169–194 (electronic)

    Article  MathSciNet  MATH  Google Scholar 

  • Cantoni E, Mills Flemming J, Ronchetti E (2011) Variable selection in additive models by Nonnegative Garrote. Stat Model 11:237–252

    Article  MathSciNet  Google Scholar 

  • Casella G, Moreno E (2006) Objective Bayesian variable selection. J Am Stat Assoc 101:157–167

    Article  MathSciNet  MATH  Google Scholar 

  • Cui X, Peng H, Wen S, Zhu L (2013) Component selection in the additive regression model. Scand J Stat 40:491–510

    Article  MathSciNet  MATH  Google Scholar 

  • de Boor C (2001) A practical guide to splines. Springer, Applied Mathematical Sciences, vol 27, Berlin (2001)

  • Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and penalties. Stat Sci 11:89–121

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Jiang J (2005) Nonparametric inference for additive models. J Am Stat Assoc 100:890–907

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Feng Y, Song R (2011) Nonparametric independence screening in sparse ultra-high-dimensional additive models. J Am Stat Assoc 106:544–557

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman J, Stuelze W (1981) Projection pursuit regression. J Am Stat Assoc 76:817–823

    Article  MathSciNet  Google Scholar 

  • Genuer R, Poggi JM, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recogn Lett 31(14):2225–2236

    Article  Google Scholar 

  • Hastie TJ, Tibshirani RJ (1990) Generalized additive models. Chapman & Hall, London

    MATH  Google Scholar 

  • Horowitz J, Klemel J, Mammen E (2006) Optimal estimation in additive regression models. Bernoulli 12:271–298

    Article  MathSciNet  MATH  Google Scholar 

  • Huang J, Breheny P, Ma F (2012) A selective review of group selection in high dimensional model. Stat Sci 27(4):481–499

    Article  MathSciNet  MATH  Google Scholar 

  • Huang J, Horowitz JL, Wei F (2010) Variable selection in nonparametric additive models. Ann Stat 38:2282–2313

    Article  MathSciNet  MATH  Google Scholar 

  • Javanmard A, Montanari A (2014) Hypothesis testing in high dimensional regression under the Gaussian random design model: asymptotic theory. IEEE Trans Inf Theory 60(10):6522–6554

    Article  MathSciNet  Google Scholar 

  • Kim J, Kim Y, Kim Y (2008) A gradient-based optimization algorithm for LASSO. J Comput Graph Stat 17:994–1009

    Article  MathSciNet  MATH  Google Scholar 

  • Koltchinskii V, Yuan M (2010) Sparsity in multiple kernel learning. Ann Stat 38:3660–3695

    Article  MathSciNet  MATH  Google Scholar 

  • Lin Y, Zhang HH (2006) Component selection and smoothing in multivariate nonparametric regression. Ann Stat 34:2272–2297

    Article  MathSciNet  MATH  Google Scholar 

  • Liu H, Zhang J (2008) On the \(\ell _1\)-\(\ell _q\) regularized regression, Technical report, Department of Statistics, Carnegie Mellon University. Preprint arXiv:0802.1517

  • Liu H, Zhang J (2009) Estimation consistency of the group lasso and its applications. J Mach Learn 5:376–383

    Google Scholar 

  • Mammen E, Park BU (2006) A simple smooth backfitting method for additive models. Ann Stat 34:2252–2271

    Article  MathSciNet  MATH  Google Scholar 

  • Marra G, Wood SN (2011) Practical variable selection for generalized additive models. Comput Stat Data Anal 55:2372–2387

    Article  MathSciNet  MATH  Google Scholar 

  • Meier L, van de Geer S, Bühlmann P (2009) High-dimensional additive modeling. Ann Stat 37:3779–3821

    Article  MathSciNet  MATH  Google Scholar 

  • Ravikumar P, Liu H, Lafferty J, Wasserman L (2009) Spam: sparse additive models. J R Stat Soc Ser B 71:1009–1030

    Article  MathSciNet  Google Scholar 

  • Simon N, Tibshirani R (2012) Standardization and the group lasso penalty. Stat Sin 22:983–1001

    MathSciNet  MATH  Google Scholar 

  • Simon N, Friedman J, Hastie T, Tibshirani R (2013) A sparse-group lasso. J Comput Graph Stat 22(2):231–245

    Article  MathSciNet  Google Scholar 

  • Stamey T, Kabalin J, McNeal J, Johnston J, Freiha F, Redwine E, Yang N (1989) ostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. J Urol 16:1076–1083

    Google Scholar 

  • Storlie B, Bondell HD, Reich BJ, Zhang HH (2011) Surface estimation, variable selection, and the nonparametric oracle property. Stat Sin 21:679–705

    Article  MathSciNet  MATH  Google Scholar 

  • Suzuki T, Sugiyama M (2013) Fast learning rate of multiple kernel learning: trade-off between sparsity and smoothness. Ann Stat 41:1381–1405

    Article  MathSciNet  MATH  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288

    MathSciNet  MATH  Google Scholar 

  • Tutz G, Binder H (2006) Generalized additive modeling with implicit variable selection by likelihood based boosting. Biometrics 62:961–971

    Article  MathSciNet  MATH  Google Scholar 

  • van de Geer SA (2008) High-dimensional generalized linear models and the lasso. Ann Stat 36(2):614–645

    Article  MathSciNet  MATH  Google Scholar 

  • van de Geer S, Bühlmann P (2009) On the conditions used to prove oracle results for the Lasso. Electron J Stat 3:1360–1392

    Article  MathSciNet  MATH  Google Scholar 

  • van de Geer S, Bühlmann P, Ritov Y, Dezeure R (2014) On asymptotically optimal confidence regions and tests for high-dimensional models. Ann Stat 42(3):1166–1202

    Article  MathSciNet  MATH  Google Scholar 

  • Verhasselt A (2009) Variable selection with nonnegative garrote in additive models. In: Proceedings of EYSM 2009, Bucharest, Romania, pp 23–25

  • Wahba G (1990) Spline models for observational data. SIAM, Philadelphia

    Book  MATH  Google Scholar 

  • Wainwright M (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using \(l_1\)-constrained quadratic programming (Lasso). IEEE Trans Inf Theory 55:2183–2202

    Article  MathSciNet  Google Scholar 

  • Wand MP, Ormerod JT (2008) On semiparametric regression with O’Sullivan penalised splines. Aust N Z J Stat 50:179–198

    Article  MathSciNet  MATH  Google Scholar 

  • Wang L, Li H, Huang JZ (2008) Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. J Am Stat Assoc 103:1556–1569

    Article  MathSciNet  MATH  Google Scholar 

  • Wiesenfarth M, Krivobokova T, Klasen S, Sperlich S (2012) Direct simultaneous inference in additive models and its application to model undernutrition. J Am Stat Assoc 107(500):1286–1296

    Article  MathSciNet  MATH  Google Scholar 

  • Wu S, Xue H, Wu Y, Wu H (2014) Variable selection for sparse high-dimensional nonlinear regression models by combining nonnegative garrote and sure independence screening. Stat Sin 24(3):1365–1387

    MathSciNet  MATH  Google Scholar 

  • Wood, S. N., Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC (2006)

  • Wood SN (2013) A simple test for random effects in regression models. Biometrika 100(4):1005–1010

    Article  MathSciNet  MATH  Google Scholar 

  • Xue L (2009) Consistent variable selection in additive models. Stat Sin 19:1281–1296

    MathSciNet  MATH  Google Scholar 

  • Yang L (2008) Confidence band for additive regression model. J Data Sci 6:207–217

    Google Scholar 

  • Yang Y, Zou H (2015) A fast unified algorithm for computing group lasso penalized learning problems. Stat Comput 25(6):1129–1141

    Article  MathSciNet  MATH  Google Scholar 

  • Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68:49–67

    Article  MathSciNet  MATH  Google Scholar 

  • Yuan M (2007) Nonnegative garrote component selection in functional ANOVA models. In: Proceedings of the eleventh international conference on artificial intelligence and statistics, March 21–24, 2007, San Juan, Puerto Rico 2, pp 660–666

  • Yuan M, Lin Y (2007) On the nonnegative garrote estimator. J R Stat Soc Ser B 69:143–161

    Article  MathSciNet  Google Scholar 

  • Zhang CH, Huang J (2008) The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann Stat 36(4):1567–1594

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang CH, Zhang SS (2014) Confidence intervals for low dimensional parameters in high dimensional linear models. J R Stat Soc Ser B 76:217–241

    Article  MathSciNet  Google Scholar 

  • Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37:3468–3497

    Article  MathSciNet  MATH  Google Scholar 

  • Zheng S (2008) Selection of components and degrees of smoothing via lasso in high dimensional nonparametric additive models. Comput Stat Data Anal 53:164–175

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

We wish to thank three anonymous reviewers, the Associate Editor and the Editor for their helpful comments and for pointing out to us some additional references. In particular, we are extremely grateful to the reviewers for providing valuable and detailed suggestions which have led to substantial improvements in the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Italia De Feis.

Electronic supplementary material

Below is the link to the electronic supplementary material.

10260_2016_357_MOESM1_ESM.pdf

Supplement to the paper “additive model selection” (additive-supp). Detailed results and discussion on all simulation runs for all four examples.(PDF 196KB)

Appendices

Appendix 1: Some proofs

1.1 Equivalence between (4) and (6)

Sketch of the proof

The Lagrangian corresponding to (6) is

$$\begin{aligned} L(f_1,\ldots ,f_p,\lambda _1,\ldots ,\lambda _p)= & {} \left\| \mathbf {Y} - \sum _{j=1}^p \mathbf {f}_j\right\| _n^2 + \sum _{j=1}^p \lambda _j J^2_\alpha (f_j)+ \nu \left( \sum _{j=1}^p \frac{1}{\lambda _j}-\frac{p}{\lambda }\right) \\&+\sum _{j=1}^p \eta _j \lambda _j, \end{aligned}$$

where \(\nu \) and \(\eta =(\eta _1,\ldots ,\eta _p)\) are the corresponding Lagrange multiplier (equality and inequality constraints) and problem (6) becomes

$$\begin{aligned} \min _{f_j\in {\mathcal {H}}_j,\lambda _j} L(f_1,\ldots ,f_p,\lambda _1,\ldots ,\lambda _p). \end{aligned}$$

Let us consider the derivative of the Lagrangian with respect to \(\lambda _1,\ldots ,\lambda _p\).

$$\begin{aligned} \frac{\partial {L(f_1,\ldots ,f_p,\lambda _1,\ldots ,\lambda _p)}}{\partial {\lambda _k}}=0 \Leftrightarrow \lambda _k=\frac{\sqrt{\nu }}{J_{\alpha }(f_k)}, \end{aligned}$$

because \(\eta _k=0\) for the complementarity condition for the inactive constraints. \(\square \)

Using the equality constraint in (6) we have

$$\begin{aligned} \sqrt{\nu }= & {} \frac{\lambda }{p}\sum _{j=1}^pJ_{\alpha }(f_j) \Rightarrow \lambda _k=\frac{1}{J_{\alpha }(f_k)}\frac{\lambda }{p}\sum _{j=1}^p J_{\alpha }(f_j) \Rightarrow \sum _{j=1}^p \lambda _j J^2_{\alpha }(f_j)\\= & {} \frac{\lambda }{p}\left( \sum _{j=1}^pJ_{\alpha }(f_j)\right) ^2. \end{aligned}$$

It follows that (6) can be rewritten as

$$\begin{aligned} \min _{{f_j\in {\mathcal {H}}_j} } \left\{ \left\| \mathbf {Y} - \sum _{j=1}^p \mathbf {f}_j\right\| _n^2 + \frac{\lambda }{p} \left( \sum _{j=1}^p J_\alpha (f_j)\right) ^2\right\} . \end{aligned}$$

1.2 Evaluation of the sparsity-smoothness penalty (7)

Sketch of the proof

First of all observe that

$$\begin{aligned}&\int _u^v(a-x)^m(b-x)^m dx\nonumber \\&=\sum _{k=0}^m\sum _{j=0}^m \left( \begin{array}{c} m \\ k \end{array}\right) \left( \begin{array}{c} m \\ j \end{array}\right) a^kb^j(-1)^{k+j}\frac{(v^{2m-k-j+1}-u^{2m-j-k+1})}{(2m-k-j+1)}. \end{aligned}$$
(16)

The B-spline of degree m on the knots \(t_1,\ldots ,t_{K-2}\) is

$$\begin{aligned} B_{k,m}(x)=(t_{k+m+1}-t_k)\left[ t_l,\ldots ,t_{k+m+1}\right] (t-x)_+^m=\sum _{r=0}^{m+1}\alpha _{r,m+1}^{(k)}(t_{k+r}-x)_+^m \end{aligned}$$

for \(k=-m,\ldots ,K-2\) and \(\alpha _{r,m+1}^{(k)}\) coefficients of the divided differences of degree m in \(t_k\).\(\square \)

Let us consider the matrices \(K^{(m)}=\left( k^{(m)}_{il}\right) \), \(i,l=-m,\ldots ,K-2\) and \(\varOmega ^{(m)}=\left( \omega ^{(m)}_{il}\right) \), \(i,l=-m+j,\ldots ,K-2\) defined by

$$\begin{aligned} k^{(m)}_{il}=\int _0^1 B_{i,m}(x)B_{l,m}(x)dx \\ \omega ^{(m)}_{il}=\int _0^1 B_{i,m}^{(j)}(x)B_{l,m}^{(j)}(x)dx. \end{aligned}$$

Let us evaluate \(K^{(m)}\) first. Note that \(K^{(m)}\) is a \(2m+1\) symmetric band matrix of dimension \(K-1+m \times K-1+m\). This implies that it is sufficient to evaluate the principal diagonal and the upper diagonals, i.e. the elements \(k^{(m)}_{ii+j}\), \(i=-m,\ldots ,K-2; j=0,\ldots ,m\):

$$\begin{aligned} k^{(m)}_{ii+j}=\int _0^1 B_{i,m}(x)B_{i+j,m}(x)dx. \end{aligned}$$

\(B_{i,m}(x)\) is different from zero in \((t_i,t_{i+m+1})\) and \(B_{i+j,m}(x)\) is different from zero in \((t_{i+j},t_{i+j+m+1})\), so, assuming \(u=\max (0,t_{i+j})\) and \(v=\min (t_{i+m+1},1)\), it follows

$$\begin{aligned} k^{(m)}_{ii+j}= & {} \int _u^v B_{i,m}(x)B_{i+j,m}(x)dx=\sum _{r=0}^{m+1}\sum _{s=0}^{m+1}\alpha _{r,m+1}^{(i)}\alpha _{s,m+1}^{(i+j)}I(u,v,t_{i+r},t_{i+j+s}) \end{aligned}$$

where

$$\begin{aligned} I(u,v,t_{i+r},t_{i+j+s})=\int _u^v (t_{i+r}-x)_{+}^{m}(t_{i+j+s}-x)_{+}^{m}dx. \end{aligned}$$

But

$$\begin{aligned} I(u,v,t_{i+r},t_{i+j+s})=\left\{ \begin{array}{ll} 0 &{}\quad \text {if} \ \min (t_{i+r},t_{i+j+s})\le u \\ \int _u^{\min (v,z)}(t_{i+r}-x)(t_{i+j+s}-x)dx &{} \quad \text {otherwise} \end{array} \right. \end{aligned}$$

where \(z=\min (t_{i+r},t_{i+j+s})\) and the integral can be evaluated using Eq. (16).

Now let us evaluate the matrix \(\varOmega ^{(m)}\). First of all let s(x) be the spline of degree m on the knots \(t_1,\ldots ,t_{K-2}\)

$$\begin{aligned} s(x)=\sum _{i=-m}^{K-2}\beta _iB_{i,m}(x). \end{aligned}$$

Then for each \(j=1,\ldots ,m-1\)

$$\begin{aligned} s^{(j)}(x)=\sum _{i=-m+j}^{K-2}\beta _i^{(j)}B_{i,m-j}(x) \end{aligned}$$

with

$$\begin{aligned} \beta _i^{(j)}=\left\{ \begin{array}{ll} \beta _i &{} \quad \mathrm{if} \quad j=0 \\ (m+1-j)\frac{\beta _i^{(j-1)}-\beta _{i-1}^{(j-1)}}{t_{i+m+1-j}-t_i} &{}\quad \mathrm{if} \quad j>0.\end{array} \right. \end{aligned}$$
(17)

This implies that

$$\begin{aligned} \int _0^1\left( s^{(j)}(x)\right) ^2dx= & {} \sum _{i=-m+j}^{K-2}\sum _{l=-m+j}^{K-2}\beta _i^{(j)}\beta _l^{(j)}\int _0^1 B_{i,m-j}(x)B_{l,m-j}(x)dx\nonumber \\= & {} \left( \beta ^{(j)}\right) ^tK^{(m-j)}\beta ^{(j)}\nonumber \\ \beta ^{(j)}= & {} \left( \beta _{-m+j}^{(j)},\ldots ,\beta _{K-2}^{(j)}\right) ^t. \end{aligned}$$
(18)

Using Eq. (17) we have

$$\begin{aligned} \beta ^{(j)}=T^{(j)}\beta , \end{aligned}$$
(19)

and \(T^{(j)}\) is obtained iteratively in the following way

$$\begin{aligned} T^{(0)}= & {} I_{m+K-1 \times m+K-1}\\ T^{(s)}= & {} (m+1-s)V^{(s)}T^{(s-1)}, \quad s=1,\ldots ,j; \end{aligned}$$

where \(V^{(s)}\) is a bi-diagonal matrix of dimension \(m+K-1-s \times m+K-s\) whose principal diagonal \(d=\left( \ldots ,-(t_{i+m+1-s}-t_i)^{-1},\ldots \right) ^t, i=-m+s,\ldots ,K-2\) and upper diagonal \(d^u=-d\).

Eq. (18) can be rewritten using Eq. (19)

$$\begin{aligned} \int _0^1\left( s^{(j)}(x)\right) ^2dx= & {} \beta ^t\left( T^{(j)}\right) ^tK^{(m-j)}T^{(j)}\beta =\beta ^t\varOmega ^{(m)}\beta \\ \varOmega ^{(m)}= & {} \left( T^{(j)}\right) ^tK^{(m-j)}T^{(j)}. \end{aligned}$$

Remark 5

Note that the R package fda (http://cran.r-project.org/web/packages/fda) has built-in functions to evaluate both the b-splines functional data basis and the penalties matrices (create.bspline.basis and bsplinepen.R). For the computation of the b-spline basis the algorithm automatically includes \(m+1\) replicates of the end points to the interior knots; the penalties matrices are calculated using the polynomial form of the b-splines.

We used our routines to evaluate both the b-spline basis and the penalties matrices (available upon request). For the b-splines basis we preferred to add \(m+1\) equispaced knots to the left and to the right of the interval defined by the interior knots following Eilers and Marx (1996); for the penalties we implemented directly the formulas shown above, that explicitly give exact calculations to evaluate the integrals.

1.3 Equivalence between Eqs. (8) and (11)

Proof

Eq. (8) is equivalent to Eq. (12) after reparametrization. Now we will show that Eq. (12) is equivalent to solve the following adaptive ridge optimization problem

$$\begin{aligned} \min _{\varvec{\gamma _1}, \ldots , \varvec{\gamma _p}}&\left\| \mathbf {Y} - \sum _{j=1}^p \hat{H}_j \varvec{\gamma _j} \right\| _n^2 + \sum _{j=1}^p \lambda _j \left\| \varvec{\gamma _j} \right\| ^2 \end{aligned}$$
(20)
$$\begin{aligned} \text {subject to:}&\sum _{j=1}^p \frac{1}{\lambda _j} = \frac{p}{\lambda }, \end{aligned}$$
(21)
$$\begin{aligned}&\lambda _j > 0, \quad j=1, \ldots , p. \end{aligned}$$
(22)

Let us put \(\varvec{\alpha _j}=\sqrt{\frac{\lambda _j}{\lambda }}\varvec{\gamma _j}\), \(b_j=\sqrt{\frac{\lambda }{\lambda _j}}\), \(\varvec{\alpha }=\left( \varvec{\alpha _1}^t,\ldots ,\varvec{\alpha _p}^t\right) ^t\) and \(\varvec{b}=\left( b_1,\ldots ,b_p\right) ^t\), then Eq. (20) becomes

$$\begin{aligned} \min _{\varvec{b},\varvec{\alpha }}&\mathcal{{L}} \left( \varvec{b,\alpha }\right) + \lambda \sum _{j=1}^p \left\| \varvec{\alpha _j} \right\| ^2 \end{aligned}$$
(23)
$$\begin{aligned} \text {subject to:}&\sum _{j=1}^p b_j^2 = p, \end{aligned}$$
(24)
$$\begin{aligned}&b_j \ge 0, \quad j=1, \ldots , p, \end{aligned}$$
(25)

with \(\mathcal{{L}} \left( \varvec{b,\alpha }\right) =\left\| \mathbf {Y} - \sum _{j=1}^p \hat{H}_j \sqrt{\frac{\lambda }{\lambda _j}}\varvec{\alpha _j} \right\| _n^2\). The Lagrangian associated to (Eq. 23) is

$$\begin{aligned} L\left( \varvec{b,\alpha }\right) =\mathcal{{L}} \left( \varvec{b,\alpha }\right) + \lambda \sum _{j=1}^p \left\| \varvec{\alpha _j} \right\| ^2+\nu \left( \sum _{j=1}^p b_j^2-p\right) -\varvec{\xi ^t b} \end{aligned}$$

where \(\nu \) and \(\varvec{\xi }=(\xi _1,\ldots ,\xi _p)^t\) are the corresponding Lagrange multipliers. The normal equations are therefore:

$$\begin{aligned} \frac{\partial L\left( \varvec{b,\alpha }\right) }{\partial \varvec{\alpha _j}}= & {} \frac{\partial \mathcal{{L}}\left( \varvec{b,\alpha }\right) }{\partial \varvec{\alpha _j}}+2\lambda \varvec{\alpha _j}, \quad j=1,\ldots ,p \\ \frac{\partial L\left( \varvec{b,\alpha }\right) }{\partial \varvec{b}}= & {} \frac{\partial \mathcal{{L}}\left( \varvec{b,\alpha }\right) }{\partial \varvec{b}}+2\nu \varvec{b}-\varvec{\xi }. \end{aligned}$$

Now, for every j, one has \(\varvec{\gamma _j}=b_j\varvec{\alpha _j}\), i.e. \(\varvec{\gamma }=\text {diag}(\varvec{b_{\text {new}}})\varvec{\alpha }\), with \(\varvec{b_{\text {new}}}=\left( \varvec{b_1}^t,\ldots ,\varvec{b_p}^t\right) ^t\), and \(\varvec{b_j}=\left( b_j,\ldots ,b_j\right) ^t \in R^{K+2}\). Hence

$$\begin{aligned} \frac{\partial \mathcal{{L}}}{\partial \varvec{\alpha }}= & {} \text {diag}(\varvec{b_{\text {new}}})\frac{\partial \mathcal{{L}}}{\partial \varvec{\gamma }}\\ \frac{\partial \mathcal{{L}}}{\partial \varvec{b_{\text {new}}}}= & {} \text {diag}(\varvec{ \alpha })\frac{\partial \mathcal{{L}}}{\partial \varvec{\gamma }}. \end{aligned}$$

It follows

$$\begin{aligned} \text {diag}(\varvec{\alpha })\frac{\partial \mathcal{{L}}}{\partial \varvec{\alpha }}=\text {diag}(\varvec{b_{\text {new}}})\frac{\partial \mathcal{{L}}}{\partial \varvec{b_{\text {new}}}}. \end{aligned}$$
(26)

This allow us to get a relation between \(b_j\) and \(\varvec{\alpha _j}\) independent of \(\mathcal{{L}}, \nu \) and \(\varvec{\xi }\). If \(\varvec{\hat{b}}\) and \(\varvec{\hat{\alpha }}\) are the solutions then

$$\begin{aligned} \text {diag}(\varvec{\hat{\alpha }})\frac{\partial L}{\partial \varvec{\alpha }}= & {} \text {diag}(\varvec{\hat{\alpha }}) \frac{\partial \mathcal{{L}}}{\partial \varvec{\alpha }}\mid _{(\varvec{\hat{b},\hat{\alpha }})} + 2\lambda \text {diag}(\varvec{\hat{\alpha }})\varvec{\hat{\alpha }}\\ \text {diag}(\varvec{\hat{b}_{\text {new}}})\frac{\partial L}{\partial \varvec{b_{\text {new}}}}= & {} \text {diag}(\varvec{\hat{b}_{\text {new}}}) \frac{\partial \mathcal{{L}}}{\partial \varvec{b_{\text {new}}}}\mid _{(\varvec{\hat{b}_{\text {new}},\hat{\alpha }})} + 2\nu \text {diag}(\varvec{\hat{b}_{\text {new}}})\varvec{\hat{b}_{\text {new}}}-\text {diag}(\varvec{\hat{b}_{\text {new}}})\varvec{\xi }. \end{aligned}$$

But Lagrange multipliers for inactive constraints are 0, so \(\text {diag}(\varvec{\hat{b}_{\text {new}}})\varvec{\xi }=0\). Since Eq. (26) is true at the optimum and \(\frac{\partial L}{\partial \varvec{\alpha }}=\frac{\partial L}{\partial \varvec{b}_{\text {new}}}=0\) at \((\varvec{\hat{b}},\varvec{\hat{\alpha }})\), we have

$$\begin{aligned} (K+2)\hat{b}_j^2=\frac{\lambda }{\nu }\left\| \varvec{\hat{\alpha }_j}\right\| ^2,\quad j=1,\ldots ,p; \end{aligned}$$

and using the equality constraint in Eq. (23)

$$\begin{aligned} \hat{b}_j= & {} \frac{\sqrt{p}\left\| \varvec{\hat{\alpha }_j}\right\| ^2}{\sqrt{\sum _{j=1}^p\left\| \varvec{\hat{\alpha }_j}\right\| ^2}},\quad j=1,\ldots ,p. \end{aligned}$$

Finally, from \(\varvec{\hat{\gamma }_j}=\text {diag}(\varvec{\hat{b}_{\text {new}}}) \varvec{\hat{\alpha }}\), we have

$$\begin{aligned} \left\| \varvec{\hat{\gamma }_j} \right\| =\frac{\sqrt{p}\left\| \varvec{\hat{\alpha }_j}\right\| ^2}{\sqrt{\sum _{j=1}^p\left\| \varvec{\hat{\alpha }_j}\right\| ^2}} \Leftrightarrow \hat{b}_j^2=\frac{p\left\| \varvec{\hat{\gamma }_j}\right\| }{\sum _{j=1}^p\left\| \varvec{\hat{\gamma }_j} \right\| }. \end{aligned}$$

Using the fact that \(\frac{\partial L}{\partial \varvec{\hat{\gamma }_j}}\mid _{\varvec{\hat{\gamma }_j}}+2\lambda \frac{\varvec{\hat{\alpha }_j}}{\hat{b}_j}\), if \(\hat{b}_j \ne 0\), one easily sees that the solution of Eq. (20) is a solution of the normal equations of

$$\begin{aligned} \min _{\varvec{\gamma _1}, \ldots , \varvec{\gamma _p}}&\left\| \mathbf {Y} - \sum _{j=1}^p \hat{H}_j \varvec{\gamma _j} \right\| _n^2 + \lambda \left( \sum _{j=1}^p \left\| \varvec{\gamma _j} \right\| \right) ^2. \end{aligned}$$

\(\square \)

1.4 Two step algorithm to solve Eq. (11)

Sketch of the proof

For fixed \(\lambda _j\), \(j=1,\ldots ,p\) the solution of problem (11) is given by the classical ridge in the context of additive regression, i.e.

$$\begin{aligned} \varvec{\beta _j}=\left( B_j^TB_j+\lambda _jM_j(\alpha )\right) ^{-1}\varvec{\beta _j}^T\left( \mathbf {Y}-\sum _{k\ne j}B_k \varvec{\beta _k}\right) . \end{aligned}$$

On the contrary, problem (11) is equivalent to

$$\begin{aligned} \min _{\begin{array}{l} \varvec{\tilde{\beta }_1}, \ldots , \varvec{\tilde{\beta }_p}\\ \lambda _1,\ldots ,\lambda _p \end{array}} \left\{ \left\| \mathbf {Y} - \sum _{j=1}^p B_j \varvec{\tilde{\beta }_j} \frac{1}{\lambda _j}\right\| _n^2 +\sum _{j=1}^p {\varvec{\tilde{\beta }_j}^T \tilde{M_j}(\alpha ) \varvec{\tilde{\beta }_j} }\frac{1}{\lambda _j^{2}} + \tau \sum _{j=1}^p \frac{\lambda }{\lambda _j}\right\} , \end{aligned}$$
(27)

with \(\varvec{\tilde{\beta }_j}=\lambda _j\varvec{\beta _j}\) and \(\tilde{M_j}(\alpha )=\lambda _jM_j(\alpha )\).\(\square \)

Now, supposing to know \(\varvec{\tilde{\beta }_j}\) and \(\tilde{M_j}(\alpha )\), Eq. (27) can be written in more compact way as

$$\begin{aligned} \min _{\lambda _1,\ldots ,\lambda _p} \left\{ \left\| \mathbf {Y} - D \varvec{\frac{1}{\lambda }}\right\| _n^2 + \left\| C\varvec{\frac{1}{\lambda }}\right\| ^2 + \tau \sum _{j=1}^p \frac{\lambda }{\lambda _j}\right\} , \end{aligned}$$

i.e.

$$\begin{aligned} \min _{\lambda _1,\ldots ,\lambda _p} \left\{ \left\| \left( {\begin{array}{l} \mathbf {Y}\\ \mathbf {0}\end{array}}\right) - \left( {\begin{array}{l} D\\ C\end{array}} \right) \varvec{\frac{1}{\lambda }}\right\| _n^2 + \tau \sum _{j=1}^p \frac{\lambda }{\lambda _j} \right\} , \end{aligned}$$
(28)

with \(C=\mathrm {diag}(\varvec{c_1},\ldots ,\varvec{c_p})\), \(\varvec{c_j}={\left( \tilde{M_j}(\alpha )\right) ^{\frac{1}{2}} \varvec{\tilde{\beta }_j}}, j=1,\ldots ,p\); and \(D=\left[ \mathbf {d_1},\ldots ,\mathbf {d_p}\right] \), \(\mathbf {d_j}=B_j\varvec{\tilde{\beta }_j}, j=1\ldots ,p\).

Assuming \(\mathbf {z}=\left( {\begin{array}{l} \mathbf {Y}\\ \mathbf {0}\end{array}}\right) \) and \(G=\left( {\begin{array}{l} D\\ C\end{array}} \right) \), Eq. (28) becomes

$$\begin{aligned} \min _{\lambda _1,\ldots ,\lambda _p} \left\{ \left\| \mathbf {z}-G \varvec{\frac{1}{\lambda }}\right\| _n^2 + \tau \sum _{j=1}^p \frac{\lambda }{\lambda _j} \right\} , \end{aligned}$$
(29)

that is equivalent to solve a Non Negative Garrote minimization.

So iterating between Eqs. (11) and (29) we obtain a double-step algorithm. Indeed starting from Eq. (11) for fixed \(\tau \), when the parameters \(\lambda _j\) are fixed constants, we obtain \(\varvec{\tilde{\beta }_j}\) and \(\tilde{M_j}(\alpha )\) by classical ridge regression in the context of additive regression. Substituting this values into Eq. (29) and minimizing with respect to the vector of the \(\lambda _j\)’s, the new values for \(\lambda _j\)’s are just obtained by the nonnegative garrote but using the previous ridge estimates of \(\varvec{\beta }_j\) instead of the ordinary least squares estimates.

Appendix 2 : Algorithms and methods for selecting tuning parameters

To implement the proposed method, we need to choose the tuning parameters K, \(\lambda \) and \(\alpha \), and s in the non negative garrote. Both parameters \(\alpha \) and K control the smoothness of the additive components, while \(\lambda \) and s determine the sparsity. In Sect. 3 we have shown how these tuning parameters grow or decay at a proper rate to obtain variable selection and consistent estimators. In practice however we need data driven procedures for selecting this regularization parameters.

The parameter K was set to 7 basing on the dimensionality examples and after some experimentation (results were robust with respect to values around 7). This choice was done according to the rule \((K-2)\asymp {n}^{1/5}\) interior knots placed at the empirical quantiles of \(X^{j}, j=1,\ldots ,p\), completed by two knots, lower and upper bound of the interval where the additive components are defined.

1.1 Choice of the optimal \(\alpha \)

Choice of the optimal smoothing parameter \(\alpha \) in all the proposed methods is aimed at performing variable selection in the context of additive modeling. We have implemented the following heuristic algorithm

figure c

1.2 Choice of the regularization parameter \(\lambda \) and s

gam estimates the regularization parameter \(\lambda \) by GCV as

$$\begin{aligned} \text {GCV}(\lambda ) = \frac{\Vert \mathbf {Y} - \mathbf {Y}^{\text {pred}}(\lambda )\Vert ^2_n}{(1 - \text {df}/ n)^2} \end{aligned}$$

(see Avalos et al. 2007 for details). In all the other packages (grpreg, gglasso, sgl, cosso, hgam) the regularization parameter \(\lambda \) has been selected by 5-fold Cross Validation. The same choice has been adopted for the testing procedures test-glr and test-nng to determine the parameter s controlling for sparsity in non negative garrote.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Amato, U., Antoniadis, A. & De Feis, I. Additive model selection. Stat Methods Appl 25, 519–564 (2016). https://doi.org/10.1007/s10260-016-0357-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10260-016-0357-8

Keywords

Mathematics Subject Classification

Navigation