Randomized sketches for sparse additive models

doi:10.1016/j.neucom.2019.12.012

Neurocomputing

Volume 385, 14 April 2020, Pages 80-87

https://doi.org/10.1016/j.neucom.2019.12.012 Get rights and content

Abstract

Sparse additive models in the setting of reproducing kernel Hilbert spaces have theoretically optimal properties even in high dimensions. However, its computational cost is heavy, especially considering that it contains two regularization parameters associated with different norms. In the big data setting, fitting of the model becomes infeasible. As our first attempt to address this issue, we herein consider fixed-dimensional setting and propose a randomized sketches approach for sparse additive models. It is shown that the sketched estimator has the same optimal convergence rate as the standard estimator. Some Monte Carlo examples are presented to illustrate the performance of the estimators.

Introduction

Statistical estimation of multivariate functions is generally difficult to interpret and may suffer from curse of dimensionality. Semiparametric models that involve only the estimation of univariate functions in the setting with multiple predictors are thus popular. This motivated the proposal of additive models where each predictor can enter the model with flexible nonparametric structures [1] and meanwhile only univariate functions are involved.

On the other hand, even with a great many predictors available to include into the initial modeling, many of them may not be relevant and the inclusion of these decreases the accuracy of prediction. Indeed, in the literature, one popular way out is to assume that only a few of the component functions are nonzero in the true model and use sparsity-inducing penalties to identify the nonzero functions [2], [3], [4], [5], [6]. Different approaches for additive models have been proposed in the literature, including regression splines, smoothing splines, and kernel methods. Furthermore, adopting the reproducing kernel Hilbert spaces (RKHS) framework, theoretical results for sparse additive models have gained increasing interest [2], [4]. The RKHS framework is very general including linear regression, polynomial regression, and smoothing splines as special cases. Its mathematical elegance also leads to several developments of the optimal convergence rates of the model in high dimensions with p possibly larger than the sample size n [7], [8], [9], although we focus on the fixed-dimensional setting in the current paper. Rather, we are interested in the case n is so large that the computational efficiency in fitting the model is the main concern. We also adopt the RKHS framework in this work.

Fitting sparse additive models with some penalty terms typically requires an iterative algorithm and thus it is not clear how to give an explicit computational complexity bound as in classical computation theory. However, one can note that even for kernel ridge regression with a single function, the standard implementation requires computation of the inverse of an n × n kernel matrix and has a complexity of O(n³), where n is the sample size. Thus model fitting would become infeasible in the big data setting, which has attracted a lot of attention recently. One general approach to dealing with large data with a theoretical guarantee is the divide-and-conquer method, which partitions the entire data set into multiple subsets, computes an estimator using each subset, and finally aggregates the multiple estimators into a final estimator. This approach has been investigated in [10], [11], [12], [13], [14], [15], [16], [17], [18], although its usage in sparse additive models has not been investigated.

In this paper, we consider approximations of sparse additive models based on random projection, also known as randomized sketches. This method works by projecting the n-dimensional data vectors to an m-dimensional space utilizing a random sketching matrix [19], [20], [21], [22]. Most related to our work is that of [23] on randomized sketches for kernel ridge regression (KRR), and our results rely on some of the results obtained in that paper. The method of [23] is based on applying the sketch to the kernel matrix, carefully constructed to retain the minimax optimality of the KRR estimate. Our contribution in this paper is to establish the statistical properties of randomized sketches on sparse additive models. Unlike KRR, here the penalized estimator does not have an explicit form, which makes our proof significantly different from that of [23], in spite of some partial similarities. Note that even with $p = 1$ (p is the number of predictors) the model does not reduce to that of KRR since the latter used a ridge penalty involving the square of the RKHS norm. On the other hand, our penalty contains both ℓ²-norm of functions evaluated at observations as well as the RKHS norm (non-squared).

Our work is a natural extension for random sketches to the case of multiple predictors that expands the applicability of the method. We regard sketching as an alternative method to the more popular divide-and-conquer instead of a method that should replace the latter. One can also envision future developments to combine the two methods. Additive models in the RKHS have been studied frequently with many elegant theoretical results in the statistics and machine learning literature. There are many other alternative models proposed under the framework of RKHS, including for example quantile regression, or even non-regression models such as kernel principal component analysis and kernel canonical correlation analysis. Sketching for these models probably requires different techniques than used in this work in order to establish their theoretical properties, which is outside the scope of the current work.

The rest of the article is organized as follows. In Section 2, we present the sparse additive model and the sketching method based on a random projection matrix. In Section 3, we establish the optimal convergence rate of the penalized estimator. Section 4 reports some simulation studies demonstrating the statistical and computational properties of the sketched estimator. We conclude with some discussions in Section 5. Proof of some technical lemmas are relegated to the appendix.

Section snippets

Random sketches for sparse additive models

First we review the concept of reproducing kernel Hilbert spaces (RKHS). Suppose $(X, B, P)$ is a probability space on $X$ with a σ-algebra $B$ and probability measure $P$ . L² norm for a function in $L^{2} (P)$ is denoted by ‖.‖. A Hilbert space $H \subset L^{2} (P)$ with inner product ${〈 \cdot, \cdot 〉}_{H}$ is an RKHS if there is a positive-definite kernel function $K : X \times X \to R$ (i.e., $K (x_{i}, x_{j}) = K (x_{j}, x_{i})$ and ∑_i,ja_ia_jK(x_i, x_j) ≥ 0 for any sequence $x_{i} \in X$ and $a_{i} \in R$ ) such that (i) $K (., x) \in H$ for all $x \in X,$ and (ii) we have the reproducing property $〈 f, K (., x$

Asymptotic theory

For simplicity of notations mainly, we assume K_j ≡ K and $H_{j} \equiv H$ . Remember the complexity function is given by $R (u) : = (\frac{1}{n} \sum_{k = 1}^{\infty} \min {μ_{k}, u^{2}})^{1 / 2},$ and we let $γ_{n} : = \min {γ : R (γ) \leq γ^{2}} .$ It is known that the minimax rate for the additive model is $∥ \hat{f} - f^{*} ∥^{2} = O_{p} (γ_{n}^{2})$ [8].

Suppose that the kernel matrices have the eigendecompositions $K_{j} / n = U_{j} D_{j} U_{j}^{⊤}$ where U_j is an orthonormal matrix and D_j is diagonal with diagonal elements ${\hat{μ}}_{j 1} \geq {\hat{μ}}_{j 2} \geq \dots \geq {\hat{μ}}_{j n} \geq 0,$ the empirical complexity functions are ${\hat{R}}_{j} (u) : = (\frac{1}{n} \sum_{k = 1}^{n} \min {{\hat{μ}}_{j k}, u^{2}})^{1 / 2} .$ Define ${\hat{γ}}_{n j} : =$

Numerical studies

We perform some simulations to illustrate the performance of the sketched estimator. We use a relatively small dimension $p = 5$ . For the predictors for each sample size n, most of the observations ( $n - ⌊ \sqrt{n} ⌋$ of them) are generated uniformly on [1/2, 1]^p. The rest ( $⌊ \sqrt{n} ⌋$ of them) are generated by $x_{i j} \sim N (0, 1 / \sqrt{n})$ and taking their absolute values so that all predictors lie in [0,1]. This heterogeneous way of generating predictors follows one illustration in [23]. The responses are generated from $y_{i} = \sum_{j = 1}^{5} f_{j} (x_{i j})$

Conclusions

In this paper, we constructed a sketched estimator for sparse additive models in the reproducing kernel Hilbert space framework. We establish that the convergence rate of the sketched estimator is the same as that of the standard unsketched estimator. Our numerical examples demonstrate the good performance of the sketched estimator which can also be much faster.

In our numerical examples, the performances of sketching and divide-and-conquer are similar. However, it is well known that even for a

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Fode Zhang: Writing - original draft, Methodology, Software. Xuejun Wang: Formal analysis, Software. Rui Li: Formal analysis, Software. Heng Lian: Conceptualization, Writing - review & editing, Formal analysis.

Acknowledgments

The authors sincerely thank the editor, the associate editor and several anonymous reviewers for their insightful comments that improved the manuscript. The research of Fode Zhang is partially supported by the Fundamental Research Funds for the Central Universities (Nos. JBK1901053, JBK1806002 and JBK140507) of China. The research of Xuejun Wang is partially supported by NSFC (11671012). Li Rui’s research was supported by National Social Science Fund of China (No. 17BTJ025). The research of

Fode Zhang was born in Gansu, China, in 1983. He received the B.S. degree in mathematics from the Northwest Normal University, Lanzhou, Gansu, China, in 2009, and the Ph.D. degree in statistics from Northwestern Polytechnical University, Xi’an, Shannxi, China, in 2017. He is currently an Associate Professor in the Center of Statistical Research and School of Statistics, Southwestern University of Finance and Economics in Chengdu, China. His research interests include statistical inference,

References (27)

G. Kimeldorf et al.
Some results on Tchebycheffian spline functions
J. Math. Anal. Appl.
(1971)
S. De Vito et al.
On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario
Sens. Actuat. B: Chem.
(2008)
T. Hastie et al.
Generalized additive models
Monographs on Statistics and Applied Probability
(1990)
Y. Lin et al.
Component selection and smoothing in multivariate nonparametric regression
Ann. Stat.
(2006)
P. Ravikumar et al.
SpAM: sparse additive models
L. Meier et al.
High-dimensional additive modeling
Ann. Stat.
(2009)
L. Xue
Consistent variable selection in additive models
Stat. Sin.
(2009)
J. Huang et al.
Variable selection in nonparametric additive models
Ann. Stat.
(2010)
V. Koltchinskii et al.
Sparsity in multiple kernel learning
Ann. Stat.
(2010)
G. Raskutti et al.
Minimax-optimal rates for sparse additive models over kernel classes via convex programming
J. Mach. Learn. Res.
(2012)

T. Suzuki et al.

Fast learning rate of multiple kernel learning: trade-off between sparsity and smoothness

Ann. Stat.

(2013)

R. McDonald et al.

Efficient large-scale distributed training of conditional maximum entropy models

Proceedings of Advances in Neural Information Processing Systems

(2009)

M.a. Zinkevich et al.

Parallelized stochastic gradient descent

Proceedings of Advances in Neural Information Processing Systems

(2010)

Cited by (0)

Xuejun Wang was born in Anhui, China, in 1981. He received the B.S. and Ph.D. degree in Statistics from Anhui University, Anhui, China, in 2007 and 2010, respectively. He is a professor and Doctoral supervisor of School of Mathematical Sciences of Anhui University, Anhui, China. His current research interests include probability limit theory, nonparametric regresstion estimation, semiparametric regession estimation and so on.

Rui Li was born in Shanxi, China, in 1980. He received the B.S. degree in Mathematics in 2003 from Taiyuan Normal College, the M.S. degree in Probability and Statistics in 2007 from Jiangsu University and the Ph.D. degree in Statistics in 2015 from Shanghai University of Finance and Economics, China. He is an associate professor in Shanghai University of International Business and Economics. His current research interests are nonparametric and semi-parametric models.

Heng Lian is currently an Associate Professor in the Department of Mathematics, City University of Hong Kong. He previously worked as an Assistant Professor at Nanyang Technological University, Singapore, and later as a Senior Lecturer at the University of New South Wales, Australia. His research interest include mathematical statistics, machine learning, and pattern recognition.

View full text

Randomized sketches for sparse additive models

Abstract

Introduction

Section snippets

Random sketches for sparse additive models

Asymptotic theory

Numerical studies

Conclusions

Declaration of Competing Interest

CRediT authorship contribution statement

Acknowledgments

J. Math. Anal. Appl.

Sens. Actuat. B: Chem.

Generalized additive models

Monographs on Statistics and Applied Probability

Component selection and smoothing in multivariate nonparametric regression

Ann. Stat.

SpAM: sparse additive models

High-dimensional additive modeling

Ann. Stat.

Consistent variable selection in additive models

Stat. Sin.

Variable selection in nonparametric additive models

Ann. Stat.

Sparsity in multiple kernel learning

Ann. Stat.

Minimax-optimal rates for sparse additive models over kernel classes via convex programming

J. Mach. Learn. Res.

Fast learning rate of multiple kernel learning: trade-off between sparsity and smoothness

Ann. Stat.

Efficient large-scale distributed training of conditional maximum entropy models

Proceedings of Advances in Neural Information Processing Systems

Parallelized stochastic gradient descent

Proceedings of Advances in Neural Information Processing Systems