Modal additive models with data-driven structure identification

Tieliang Gong; Chen Xu; Hong Chen

doi:10.3934/mfc.2020016

Article Contents

2020, Volume 3, Issue 3: 165-183. Doi: 10.3934/mfc.2020016

This issue Previous Article Summation of Gaussian shifts as Jacobi's third Theta function Next Article Averaging versus voting: A comparative study of strategies for distributed classification

Modal additive models with data-driven structure identification

1.
Department of Mathematics and Statistics, University of Ottawa, Ottawa, ON, K1N 6N5, Canada
2.
College of Science, Huazhong Agricultural University, Wuhan 430070, China

^* Corresponding author: Hong Chen
^* Corresponding author: Hong Chen

Received: January 2020

Revised: May 2020

Published: August 2020

Abstract / Introduction Full Text(HTML) Figure(1) / Table(6) Related Papers Cited by

Abstract

Additive models, due to their high flexibility, have received a great deal of attention in high dimensional regression analysis. Many efforts have been made on capturing interactions between predictive variables within additive models. However, typical approaches are designed based on conditional mean assumptions, which may fail to reveal the structure when data is contaminated by heavy-tailed noise. In this paper, we propose a penalized modal regression method, Modal Additive Models (MAM), based on a conditional mode assumption for simultaneous function estimation and structure identification. MAM approximates the non-parametric function through forward neural networks, and maximizes modal risk with constraints on the function space and group structure. The proposed approach can be implemented by the half-quadratic (HQ) optimization technique, and its asymptotic estimation and selection consistency are established. It turns out that MAM can achieve satisfactory learning rate and identify the target group structure with high probability. The effectiveness of MAM is also supported by some simulated examples.

Keywords:

Mathematics Subject Classification: Primary: 68T05; Secondary: 62J02.

Citation:

Full Text(HTML)

Figure 1. Estimated transformation function for selected groups. Top-left: group $(1, 6)$ , top-right: group $(8, 12)$ , bottom-left: group $(3, 7 )$ , bottom-right: group $(10, 13)$

Download: Full-size image PowerPoint slide

Table Algorithm 1. Half-quadratic Optimization for MAM

1: Require: Input data

$({{\bf x}}_i, y_i)_{i=1}^n$ , kernel-induced representing function

$\phi$ , activating function

$\psi$ , weight parameter

${{\bf w}}$ and bias term

${{\bf b}}$ .

2: Ensure:

${{\bf a}}_{{\bf z}}$ ;

3: Define function

$f$ such that

$f({{\bf x}}^2) = \phi({{\bf x}})$ ;

4: Initialize

$\sigma$ , coefficient

${{\bf a}}$ ;

5:while not converge do

6: Update

$e_i$ by

$e_i = f^\prime \Big( \big(\frac{y_i - f({{\bf x}}_i)}{\sigma} \big)^2 \Big)$ ;

7: Update

${{\bf a}}$ by

${{\bf a}} = \arg \max_{{{\bf a}} \in \mathbb{R}^h} \frac{1}{n \sigma}\sum_{i=1}^{n} \Big( e_i \big(\frac{y_i - f({{\bf x}}_i)}{\sigma} \big)^2 - g(e_i) \Big) - \lambda \|{{\bf a}}\|_2^2$ ;

8: update $\sigma$ ;

9: end while

10: Output: ${{\bf a}}_{{\bf z}} = {{\bf a}}$ .

| Show Table

DownLoad: CSV

Table Algorithm 2. Backward Stepwise Selection for MAM

1: Start with the variable pool $G = \{(1,2,\cdots, d)\}$ ;

2: Solve (13) to obtain the maximum value $\mathscr{R}_{\lambda, G}$ ;

3: for each variable $j$ in $G$ do

4: $\hat{G} \longleftarrow$ either divide $j$ into subgroups or add to an existing group;

5: Solve (13) to obtain the maximum value $\mathscr{R}_{\lambda, \hat{G}}$ ;

6: if

$\mathscr{R}_{\lambda, \hat{G}} > \mathscr{R}_{\lambda, G}$

then

7: Preserve $\hat{G}$ as the new group structure;

8: end if

9: end for

10: Return $\hat{G}$ .

| Show Table

DownLoad: CSV

Table 1. Selected models for simulation study and the corresponding intrinsic group structures

ID	Model	Intrinsic group structure
M1	$y = x_1 + x_2^2 + \frac{1}{1+ x_3^2} + \sin(\pi x_4) +\log(x_5+5) + \sqrt{\|x_6\|} + \epsilon$	$\{(1),(2),(3),(4),(5),(6)\}$
M2	$y = \frac{\sin(x_1)}{x_1 } + \cos((x_2 +x_3)\cdot \pi ) + \arctan((x_4 + x_5 + x_6)^2)+ \epsilon$	$\{(1),(2, 3),(4, 5, 6)\}$
M3	$y = \sin(x_1 + x_2) + 2\log(x_3 + 5) +x_4 + x_5\cdot x_6 + \epsilon$	$\{(1, 2), (3), (4), (5, 6)\}$

| Show Table

DownLoad: CSV

Table 3. Average performance that intrinsic group structures are identified for $(\mu, \beta)$ pair (Gaussian noise)

Parameters		M1					M2					M3
$\mu$	$\beta$	MF	Size	TP	U	O	MF	Size	TP	U	O	MF	Size	TP	U	O
$1 \rm{e} - 6$	$1$	0	2	1	1	0	0	2	0.66	1	0	0	2	1	0	1
$1 \rm{e} - 5$	$1$	0	2	1	1	0	0	2	0.84	1	0	0	2	1	0	1
$1 \rm{e} - 4$	$1$	0	2	1	1	0	0	2	0.68	1	0	0	2	0.1	1	0
$1 \rm{e} - 3$	$1$	0	2	1	1	0	0	2	0.46	0.46	0	0	2	1	1	0
$1 \rm{e} - 2$	$1$	0	2	1	1	0	0	2	0.62	0.62	0	0	2	1	1	0
$1 \rm{e} - 1$	$1$	0	2	1	1	0	0	2	0.78	0.78	0	0	2	1	0	0
$1 \rm{e} - 6$	$3$	0	3	2	1	0	0	2	0.42	0.42	0	0	2	0.66	0.66	0
$1 \rm{e} - 5$	$3$	0	2.84	1.78	0.94	0	0	2	0.54	0.54	0	0	2	0	1	0
$1 \rm{e} - 4$	$3$	0	3.36	2.32	1	0	0	2	0.58	0.58	0	0	2.2	1.6	1	0
$1 \rm{e} - 3$	$3$	0	4.9	3.9	1	0	0	2	0.78	0.78	0	50	4	4	0	0
$1 \rm{e} - 2$	$3$	50	6	6	0	0	29	3.62	1.9	0	0.22	50	4	4	0	0
$1 \rm{e} - 1$	$3$	50	6	6	0	0	0	5.38	1.62	0	1	0	6	2	0	1
$1 \rm{e} - 6$	$5$	0	2.72	1.64	0.92	0	0	2	0.5	0.5	0	0	2.3	0.6	1	0
$1 \rm{e} - 5$	$5$	0	3.4	1.6	0.8	0	0	2	0.58	0.58	0	0	3	2	1	0
$1 \rm{e} - 4$	$5$	0	4.82	3.82	1	0	0	2.01	0.38	0.38	0	50	4	4	0	0
$1 \rm{e} - 3$	$5$	27	5.54	5.08	0.46	0	28	3.44	1.76	0	0	50	4	4	0	0
$1 \rm{e} - 2$	$5$	50	6	6	0	0	0	5	2	0	1	0	6	2	0	1
$1 \rm{e} - 1$	$5$	50	6	6	0	0	0	6	1	0	1	0	6	2	0	1

| Show Table

DownLoad: CSV

Table 4. Average performance that intrinsic group structures are identified for $(\mu, \beta)$ pair (Gamma noise)

Parameters		M1					M2					M3
$\mu$	$\beta$	MF	Size	TP	U	O	MF	Size	TP	U	O	MF	Size	TP	U	O
$1 \rm{e} - 6$	$1$	0	2	1	1	0	0	2	0.6	0.6	0	0	2	1	1	0
$1 \rm{e} - 5$	$1$	0	2	1	1	0	0	2	0.7	0.7	0	0	2	1	1	0
$1 \rm{e} - 4$	$1$	0	2	1	1	0	0	2	1	1	0	0	2	1	1	0
$1 \rm{e} - 3$	$1$	0	2	1	1	0	0	2	0.92	0.92	0	0	2	1	1	0
$1 \rm{e} - 2$	$1$	0	2	1	1	0	0	2	0.58	0.58	0	0	2	1	1	0
$1 \rm{e} - 1$	$1$	0	2	1	1	0	0	2	0.76	0.76	0	0	2	1	1	0
$1 \rm{e} - 6$	$3$	0	2	1	1	0	0	2	0.52	0.52	0	0	2	1	1	0
$1 \rm{e} - 5$	$3$	0	2	1	1	0	0	2	1	1	0	0	2.42	0.66	1	0
$1 \rm{e} - 4$	$3$	0	3.8	2.6	1	0	0	2	0.8	0.8	0	0	2	1	1	0
$1 \rm{e} - 3$	$3$	0	4	3	1	0	5	2.26	0.92	0.62	0	50	4	4	0	0
$1 \rm{e} - 2$	$3$	42	5.84	5.88	0.16	0	27	3.66	1.82	0	0.2	50	4	4	0	0
$1 \rm{e} - 1$	$3$	50	6	6	0	0	0	6	1	0	1	0	6	2	0	1
$1 \rm{e} - 6$	$5$	0	2.56	1.48	1	0	0	2	0.62	0.62	0	0	2	0.92	0.92	0
$1 \rm{e} - 5$	$5$	0	3.5	2.5	1	0	0	2	0.66	0.66	0	0	3	2	1	0
$1 \rm{e} - 4$	$5$	7	4.88	3.76	0.86	0	24	3.08	1.8	0	0.08	0	2.2	0.52	1	0
$1 \rm{e} - 3$	$5$	8	4.94	3.84	0.84	0	27	3.4	1.6	0	0	50	4	4	0	0
$1 \rm{e} - 2$	$5$	50	6	6	0	0	0	5	2	0	1	0	5.14	2.86	0	1
$1 \rm{e} - 1$	$5$	50	6	6	0	0	0	6	1	0	1	0	6	2	0	1

| Show Table

DownLoad: CSV

Table 2. Mean absolute error comparisons (Mean $\pm$ std.) for Gaussian and Gamma noise}

	GASI		MAM
Model	Gaussian	Gamma	Gaussian	Gamma
M1	$186.3.92. \pm 437.8$	$458.8 \pm 988.8$	$\mathbf{109.92 \pm 257.2}$	$\mathbf{272.8 \pm 536.2}$
M2	$1.088 \pm 0.025$	$0.774 \pm 0.032$	$\mathbf{0.839 \pm 0.023}$	$\mathbf{0.751 \pm 0.028}$
M3	$\mathbf{0.857 \pm 0.025}$	$\mathbf{ 0.873 \pm 0.019}$	$0.901 \pm 0.028$	$0.917 \pm 0.021$

| Show Table

DownLoad: CSV

Related Papers

Cited by

References

[1]	L. Breiman and J. Friedman, Estimating optimal transformations for multiple regression and correlation, Journal of the American Statistical Association, 80 (1985), 580-598. doi: 10.1080/01621459.1985.10478157.
[2]	P. Chao and M. Zhu, Group additive structure identification for kernel non-parametric regression, Advances in Neural Information Processing Systems, (2017).
[3]	H. Chen, X. Wang, C. Deng and H. Huang, Group sparse additive machine, Advances in Neural Information Processing Systems, (2017).
[4]	H. Chen and Y. L. Wang, Kernel-based sparse regression with the correntropy-induced loss, Appl. Comput. Harmon. Anal., 44 (2018), 144-164. doi: 10.1016/j.acha.2016.04.004.
[5]	Y.-C. Chen, R. Genovese, R. Tibshirani and L. Wasserman, Nonparametric modal regression, Annals of Statistics, 44 (2016), 489-514. doi: 10.1214/15-AOS1373.
[6]	G. Collomb, W. Härdle and S. Hassani, A note on prediction via estimation of the conditional mode function, Journal of Statistical Planning and Inference, 15 (1986), 227– 236. doi: 10.1016/0378-3758(86)90099-6.
[7]	F. Cucker and S. Smale, Best choices for regularization parameters in learning theory: On the bias-variance problem, Foundations of Computational Mathematics, 2 (2002), 413-428. doi: 10.1007/s102080010030.
[8]	F. Cucker and S. Smale, On the mathematical foundations of learning, Bulletin of the American Mathematical Society, 39 (2002), 1-49. doi: 10.1090/S0273-0979-01-00923-5.
[9]	J. Q. Fan, Y. Feng and R. Song, Nonparametric independence screening in sparse ultra-high-dimensional additive models, Journal of the American Statistical Association, 106 (2011), 544– 557. doi: 10.1198/jasa.2011.tm09779.
[10]	J. Q. Fan and R. Z. Li, Variable selection via non-concave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96 (2001), 1348-1360. doi: 10.1198/016214501753382273.
[11]	J. Q. Fan and J. C. Lv, Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B Stat. Methodol., 70 (2008), 849-911. doi: 10.1111/j.1467-9868.2008.00674.x.
[12]	Y. Feng, J. Fan and Y. Suykens, A statistical learning approach to modal regression, Journal of Machine Learning Research, 21 (2020), 1-35.
[13]	D. Geman and C. Yang, Nonlinear image recovery with half-quadratic regularization, IEEE Transactions on Image Processing, 4 (1995), 932-946. doi: 10.1109/83.392335.
[14]	T. L. Gong, Z. B. Xu and H. Chen, Generalization analysis of Fredholm kernel regularized classifiers, Neural Computation, 29 (2017), 1879-1901. doi: 10.1162/NECO_a_00967.
[15]	C. Gu, Smoothing Spline ANOVA Models, Second edition, Springer Series in Statistics, 297. Springer, New York, 2013. doi: 10.1007/978-1-4614-5369-7.
[16]	X. He, J. Wang and S. Lv, Scalable kernel-based variable selection with sparsistency, preprint, arXiv: 1802.09246.
[17]	J. Huang, J. Horowitz and F. R. Wei, Variable selection in nonparametric additive models, Annals of Statistics, 38 (2010), 2282-2313. doi: 10.1214/09-AOS781.
[18]	J. Huang and L. J. Yang, Identification of non-linear additive autoregressive models, Journal of the Royal Statistical Society, Series B, 66 (2004), 463-477. doi: 10.1111/j.1369-7412.2004.05500.x.
[19]	J. Huang, S. G. Ma and C.-H. Zhang, Adaptive lasso for sparse high-dimensional regression models, Statistica Sinica., 18 (2008), 1603-1618.
[20]	K. Kandasamy and Y. Yu, Additive approximations in high-dimensional non- parametric regression via the salsa, International Conference on Machine Learning, (2016).
[21]	T. Kühn, Covering numbers of Gaussian reproducing kernel Hilbert spaces, Journal of Complexity, 27 (2011), 489-499. doi: 10.1016/j.jco.2011.01.005.
[22]	F. Kuo, G. Sloan, G. Wasilkowski and H. Woźniakowski, On decompositions of multivariate functions, Mathematics of computation, Mathematics of Computation, 79 (2010), 953-966. doi: 10.1090/S0025-5718-09-02319-9.
[23]	Y. Lin and H. Zhang, Component selection and smoothing in multi-variate nonparametric regression, Annals of Statistics, 34 (2006), 2272-2297. doi: 10.1214/009053606000000722.
[24]	T. Sager and R. Thisted, Maximum likelihood estimation of isotonic modal regression, Annals of Statistics, 10 (1982), 690-707. doi: 10.1214/aos/1176345865.
[25]	X. T. Shen, W. Pan and Y. Z. Zhu, Likelihood-based selection and sharp parameter estimation, Journal of the American Statistical Association, 107 (2012), 223-232. doi: 10.1080/01621459.2011.645783.
[26]	L. Shi, Y.-L. Feng and D.-X. Zhou, Concentration estimates for learning with $\ell^{1}$ -regularizer and data dependent hypothesis space, Applied and Computational Harmonic Analysis, 31 (2011), 286 – 302. doi: 10.1016/j.acha.2011.01.001.
[27]	T. Shively, R. Kohn and S. Wood, Variable selection and function estimation in additive non-parametric regression using a data-based prior, Journal of the American Statistical Association, 94 (1999), 777-794. doi: 10.1080/01621459.1999.10474180.
[28]	R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B, 58 (1996), 267-288. doi: 10.1111/j.2517-6161.1996.tb02080.x.
[29]	X. Wang, H. Chen, W. Cai, D. Shen and H. Huang, Regularized modal regression with applications in cognitive impairment prediction, Advances in Neural Information Processing Systems, (2017).
[30]	Q. Wu, Y. M. Ying and D.-X. Zhou, Multi-kernel regularized classifiers, Journal of Complexity, 23 (2007), 108-134. doi: 10.1016/j.jco.2006.06.007.
[31]	Q. Wu and D.-X. Zhou, Learning with sample dependent hypothesis spaces, Computers and Mathematics with Applications, 56 (2008), 2896-2907. doi: 10.1016/j.camwa.2008.09.014.
[32]	W. Yao and R. Lindsay amd R. Li, Local modal regression, Journal of Nonparametric Statistics, 24 (2012), 647-663. doi: 10.1080/10485252.2012.678848.
[33]	J. Yin, X. Chen and E. Xing, Group sparse additive models, International Conference on Machine Learning, (2012).
[34]	M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society, Series B, 68 (2006), 49-67. doi: 10.1111/j.1467-9868.2005.00532.x.
[35]	T. Zhang, Covering number bounds of certain regularized linear function classes, Journal of Machine Learning Research, 2 (2002), 527-550.
[36]	D.-X. Zhou, The covering number in learning theory, Journal of Complexity, 18 (2002), 739-767. doi: 10.1006/jcom.2002.0635.
[37]	D.-X. Zhou, Capacity of reproducing kernel space in learning theory, IEEE Transactions on Information Theory, 49 (2003), 1743-1752. doi: 10.1109/TIT.2003.813564.
[38]	D.-X. Zhou and K. Jetter, Approximation with polynomial kernels and SVM classifiers, Advances in Computational Mathematics, 25 (2006), 323-344. doi: 10.1007/s10444-004-7206-2.
[39]	H. Zou, The adaptive lasso and its oracle properties, Journal of the American Statistical Association, 101 (2006), 1418-1429. doi: 10.1198/016214506000000735.