Randomized sketches for sparse additive models
Introduction
Statistical estimation of multivariate functions is generally difficult to interpret and may suffer from curse of dimensionality. Semiparametric models that involve only the estimation of univariate functions in the setting with multiple predictors are thus popular. This motivated the proposal of additive models where each predictor can enter the model with flexible nonparametric structures [1] and meanwhile only univariate functions are involved.
On the other hand, even with a great many predictors available to include into the initial modeling, many of them may not be relevant and the inclusion of these decreases the accuracy of prediction. Indeed, in the literature, one popular way out is to assume that only a few of the component functions are nonzero in the true model and use sparsity-inducing penalties to identify the nonzero functions [2], [3], [4], [5], [6]. Different approaches for additive models have been proposed in the literature, including regression splines, smoothing splines, and kernel methods. Furthermore, adopting the reproducing kernel Hilbert spaces (RKHS) framework, theoretical results for sparse additive models have gained increasing interest [2], [4]. The RKHS framework is very general including linear regression, polynomial regression, and smoothing splines as special cases. Its mathematical elegance also leads to several developments of the optimal convergence rates of the model in high dimensions with p possibly larger than the sample size n [7], [8], [9], although we focus on the fixed-dimensional setting in the current paper. Rather, we are interested in the case n is so large that the computational efficiency in fitting the model is the main concern. We also adopt the RKHS framework in this work.
Fitting sparse additive models with some penalty terms typically requires an iterative algorithm and thus it is not clear how to give an explicit computational complexity bound as in classical computation theory. However, one can note that even for kernel ridge regression with a single function, the standard implementation requires computation of the inverse of an n × n kernel matrix and has a complexity of O(n3), where n is the sample size. Thus model fitting would become infeasible in the big data setting, which has attracted a lot of attention recently. One general approach to dealing with large data with a theoretical guarantee is the divide-and-conquer method, which partitions the entire data set into multiple subsets, computes an estimator using each subset, and finally aggregates the multiple estimators into a final estimator. This approach has been investigated in [10], [11], [12], [13], [14], [15], [16], [17], [18], although its usage in sparse additive models has not been investigated.
In this paper, we consider approximations of sparse additive models based on random projection, also known as randomized sketches. This method works by projecting the n-dimensional data vectors to an m-dimensional space utilizing a random sketching matrix [19], [20], [21], [22]. Most related to our work is that of [23] on randomized sketches for kernel ridge regression (KRR), and our results rely on some of the results obtained in that paper. The method of [23] is based on applying the sketch to the kernel matrix, carefully constructed to retain the minimax optimality of the KRR estimate. Our contribution in this paper is to establish the statistical properties of randomized sketches on sparse additive models. Unlike KRR, here the penalized estimator does not have an explicit form, which makes our proof significantly different from that of [23], in spite of some partial similarities. Note that even with (p is the number of predictors) the model does not reduce to that of KRR since the latter used a ridge penalty involving the square of the RKHS norm. On the other hand, our penalty contains both ℓ2-norm of functions evaluated at observations as well as the RKHS norm (non-squared).
Our work is a natural extension for random sketches to the case of multiple predictors that expands the applicability of the method. We regard sketching as an alternative method to the more popular divide-and-conquer instead of a method that should replace the latter. One can also envision future developments to combine the two methods. Additive models in the RKHS have been studied frequently with many elegant theoretical results in the statistics and machine learning literature. There are many other alternative models proposed under the framework of RKHS, including for example quantile regression, or even non-regression models such as kernel principal component analysis and kernel canonical correlation analysis. Sketching for these models probably requires different techniques than used in this work in order to establish their theoretical properties, which is outside the scope of the current work.
The rest of the article is organized as follows. In Section 2, we present the sparse additive model and the sketching method based on a random projection matrix. In Section 3, we establish the optimal convergence rate of the penalized estimator. Section 4 reports some simulation studies demonstrating the statistical and computational properties of the sketched estimator. We conclude with some discussions in Section 5. Proof of some technical lemmas are relegated to the appendix.
Section snippets
Random sketches for sparse additive models
First we review the concept of reproducing kernel Hilbert spaces (RKHS). Suppose is a probability space on with a σ-algebra and probability measure . L2 norm for a function in is denoted by ‖.‖. A Hilbert space with inner product is an RKHS if there is a positive-definite kernel function (i.e., and ∑i,jaiajK(xi, xj) ≥ 0 for any sequence and ) such that (i) for all and (ii) we have the reproducing property
Asymptotic theory
For simplicity of notations mainly, we assume Kj ≡ K and . Remember the complexity function is given byand we letIt is known that the minimax rate for the additive model is [8].
Suppose that the kernel matrices have the eigendecompositions where Uj is an orthonormal matrix and Dj is diagonal with diagonal elements the empirical complexity functions areDefine
Numerical studies
We perform some simulations to illustrate the performance of the sketched estimator. We use a relatively small dimension . For the predictors for each sample size n, most of the observations ( of them) are generated uniformly on [1/2, 1]p. The rest ( of them) are generated by and taking their absolute values so that all predictors lie in [0,1]. This heterogeneous way of generating predictors follows one illustration in [23]. The responses are generated from
Conclusions
In this paper, we constructed a sketched estimator for sparse additive models in the reproducing kernel Hilbert space framework. We establish that the convergence rate of the sketched estimator is the same as that of the standard unsketched estimator. Our numerical examples demonstrate the good performance of the sketched estimator which can also be much faster.
In our numerical examples, the performances of sketching and divide-and-conquer are similar. However, it is well known that even for a
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
CRediT authorship contribution statement
Fode Zhang: Writing - original draft, Methodology, Software. Xuejun Wang: Formal analysis, Software. Rui Li: Formal analysis, Software. Heng Lian: Conceptualization, Writing - review & editing, Formal analysis.
Acknowledgments
The authors sincerely thank the editor, the associate editor and several anonymous reviewers for their insightful comments that improved the manuscript. The research of Fode Zhang is partially supported by the Fundamental Research Funds for the Central Universities (Nos. JBK1901053, JBK1806002 and JBK140507) of China. The research of Xuejun Wang is partially supported by NSFC (11671012). Li Rui’s research was supported by National Social Science Fund of China (No. 17BTJ025). The research of
Fode Zhang was born in Gansu, China, in 1983. He received the B.S. degree in mathematics from the Northwest Normal University, Lanzhou, Gansu, China, in 2009, and the Ph.D. degree in statistics from Northwestern Polytechnical University, Xi’an, Shannxi, China, in 2017. He is currently an Associate Professor in the Center of Statistical Research and School of Statistics, Southwestern University of Finance and Economics in Chengdu, China. His research interests include statistical inference,
References (27)
- et al.
Some results on Tchebycheffian spline functions
J. Math. Anal. Appl.
(1971) - et al.
On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario
Sens. Actuat. B: Chem.
(2008) - et al.
Generalized additive models
Monographs on Statistics and Applied Probability
(1990) - et al.
Component selection and smoothing in multivariate nonparametric regression
Ann. Stat.
(2006) - et al.
SpAM: sparse additive models
- et al.
High-dimensional additive modeling
Ann. Stat.
(2009) Consistent variable selection in additive models
Stat. Sin.
(2009)- et al.
Variable selection in nonparametric additive models
Ann. Stat.
(2010) - et al.
Sparsity in multiple kernel learning
Ann. Stat.
(2010) - et al.
Minimax-optimal rates for sparse additive models over kernel classes via convex programming
J. Mach. Learn. Res.
(2012)
Fast learning rate of multiple kernel learning: trade-off between sparsity and smoothness
Ann. Stat.
Efficient large-scale distributed training of conditional maximum entropy models
Proceedings of Advances in Neural Information Processing Systems
Parallelized stochastic gradient descent
Proceedings of Advances in Neural Information Processing Systems
Cited by (0)
Fode Zhang was born in Gansu, China, in 1983. He received the B.S. degree in mathematics from the Northwest Normal University, Lanzhou, Gansu, China, in 2009, and the Ph.D. degree in statistics from Northwestern Polytechnical University, Xi’an, Shannxi, China, in 2017. He is currently an Associate Professor in the Center of Statistical Research and School of Statistics, Southwestern University of Finance and Economics in Chengdu, China. His research interests include statistical inference, statistical manifold, and information geometry.
Xuejun Wang was born in Anhui, China, in 1981. He received the B.S. and Ph.D. degree in Statistics from Anhui University, Anhui, China, in 2007 and 2010, respectively. He is a professor and Doctoral supervisor of School of Mathematical Sciences of Anhui University, Anhui, China. His current research interests include probability limit theory, nonparametric regresstion estimation, semiparametric regession estimation and so on.
Rui Li was born in Shanxi, China, in 1980. He received the B.S. degree in Mathematics in 2003 from Taiyuan Normal College, the M.S. degree in Probability and Statistics in 2007 from Jiangsu University and the Ph.D. degree in Statistics in 2015 from Shanghai University of Finance and Economics, China. He is an associate professor in Shanghai University of International Business and Economics. His current research interests are nonparametric and semi-parametric models.
Heng Lian is currently an Associate Professor in the Department of Mathematics, City University of Hong Kong. He previously worked as an Assistant Professor at Nanyang Technological University, Singapore, and later as a Senior Lecturer at the University of New South Wales, Australia. His research interest include mathematical statistics, machine learning, and pattern recognition.