Fast learning rates for sparse quantile regression problem
Introduction
Quantile regression has emerged as a comprehensive approach for analyzing the impact of regressors on the conditional distribution of a response variable [12], [13] and others. In this paper, we consider quantile regression from a learning theory viewpoints. We begin with a supervised learning problem over a collection of observational data , drawn from an unknown distribution on X×Y, where is an input space and is the corresponding output for the regression problem. Given a loss function , we hope that the following risk functional is small over all measurable functions on X, and the least square loss is the most commonly used, which corresponds to the conditional mean function. The conditional mean regression describes the centrality of the conditional response distribution. However, sometimes one wants to obtain a good estimate satisfying a certain proportion of below the estimate. For example, for financial risk management, an investor may need to estimate a lower bound on the changes in the value of its portfolio with high probability, so as to take positive measure beforehand to avoid big asset volatility. In this case the mean function is no longer valid since it cannot provide more complete descriptions of the conditional response distribution. The above problem can be solved by means of quantile regression, which has an equivalent relationship with the so-called loss, defined by
Let be the conditional distribution of , and with fixed any constant , the set-valued function is defined as If has finite support, it is well understood that is a bounded and closed interval at any in [20]. We write which imply that . Moreover, it is easy to check that the interior of is a set, namely, . In this paper we assume that consists of singletons, namely, there exists a function called the conditional function, such that for almost every .
With the help of the loss, one can easily verify that is a minimizer of associated with , and for almost every Quantile regression is an important statistical methods for analyzing the impact of the regressors on the conditional distribution of a response variable. Compared with the conditional mean, one of the advantages using quantile regression is that quantile regression is more robust in response to large outliers. In recent years, quantile regression has been widely used in many practical applications, such as reference charts in medicine [2], [9] and economics [11]. For comprehensive reviews of quantile regression, refer to the articles by Koenker and Hallock [11], [30] and well-written book by Koenker [12].
Throughout this paper, we assume that is supported on for some constant , and it follows that for almost surely. A special case with of (1.1) gives the absolute value function, that is, , and the minimizer becomes the median function defined at each by
Usually cannot be computed directly, due to the unknown distribution . Instead, one seeks to minimize the empirical error associated with the sample z and loss, which is given by Unfortunately, minimizing may lead to overfitting, that is, complex functions fit well on training data but not be able to generalize to unseen data. A promising approach to avoid this is to minimize the following regularized risk:where is a pre-specified hypothesis function space and is a functional on .
Kernel methods have been widely used in many areas of machine learning and achieved great success. Among others kernel regression has drawn much attention including quantile regression and least square regression. The study has focused on the application of Mercer kernels and regularization in the associated reproducing kernel Hilbert space (RKHS). Let be a bounded, symmetric, and positive semi-definite function. The RKHS denoted by associated with the kernel K is the completion of the linear span of functions with the inner product given by . With these preparation, the learning algorithm for the quantile regression is given by the regularization scheme (see [23], [16], [20])where , a regularization parameter controlling the trade-off between the empirical error and the penalty. In practice, we need to choose an adaptive parameter which usually depends on the sample, and probably the cross validation approach is the most commonly used technique aiming at this, but burdens upper computational complexity when the sample size is large. To this end, in Section 5 we will employ a standard training-validation approach to choose a suitable , also we give the convergence rate of our proposed algorithm (1.4) below based on such adaptive parameter.
Note that the resulting function generated by (1.3) is given as a dense expansion in terms of the training patterns, which adds to computational cost greatly. To solve large-data problems, a promising fact is that sparse solutions are often achieved in statistics and compress sensing settings by imposing an L1-regularizer on the expansion coefficients [3], [18], [24]. The L1-norm penalty not only shrinks the fitted coefficient toward zero, but also causes some of the fitted coefficient to be exactly zero when is chosen to be large enough. Thus a lot of irrelevant noise variable or useless data can be removed mostly. In order to introduce L1-regularizer in a machine learning setup, we need to study the coefficient-based regularization scheme, which can be written aswhere and
Algorithm (1.4) is a linear programming which can be solved efficiently by existing codes for large scale problems. For completeness, we also give the concrete optimization procedure in Section 6. This approach is similar to Linear Programming Regularization by proposed Smola et al. [23]. However, the main difference lies in that some convex function class substitutes the set of functions here, rather than acting automatically on data point xi. It is worth noting that our choice of basis functions is more natural since the solution of (1.3) can be found in a finite-dimensional space spanned by the set of functions [16], [23]. Also we find later that this ensures good generalization ability theoretically.
It is worth noting that some previous works have considered the format: L1 loss + L1 penalty structure. For example, Koenker et al. [14] used as the penalty, namely, where is a certain function space. As is shown that the solution is a linear spline with knots if is chosen appropriately. This motivates partially us to study the coefficient-based regularization scheme, considering that smoothing spline function can be viewed as a special kernel.
In the field of machine learning, the coefficient regularization was first introduced in [6] to design linear programming support vector machine. In [7], the sparse property of estimation coefficients with least square regression was discussed by a spectral decomposition technique. In terms of theoretical analysis, [27] derived some learning rates by using a local polynomial reproduction formula in approximation theory. In their analysis, the quantile regression was paid especially under rather general conditions. Ref. [17] provided a unified analytical framework for coefficient-based regularization with strict convex regularizer and indefinite kernel. As we notice, the previous works mentioned above did not consider the effect of the conditional distribution to the convergence rates. In this paper, we can improve the corresponding learning rates sharply by making fully use of the information on the conditional distribution and the so-called comparison theorem first proposed in [32]. In addition, we apply the empirical covering numbers to measure the functional complexity instead of uniform covering numbers [27]. This ensures the entropy integral finiteness with respect to the space , which is shown in Section 4.
From the algorithmic point of view, one expects that (1.4) would be more stable and computational efficiency, and most importantly can approximate the target function well in the whole space X as the sample size m tends to infinity. Error analysis for regularization (1.4) aims at estimates of the excess generalization error in terms of the kernel K, the loss and the underlying measure through the proper choice of the regularization parameter . However, this only implies that is close to in a weak form. Recently, some stronger convergence rates are derived from establishing so called self-calibration inequalities (see [20], [25]).
To sum up, we list our main contributions as follows:
- •
Improved learning rates are obtained by employing so called variance bounds;
- •
We establish stronger convergence rates by making use of the self-calibration inequalities. Meanwhile, our learning rates can also be derived by a simple data-dependent parameter selection method, which does not require some prior knowledge of the target function;
- •
We implement a simulation experiment and a real data application to demonstrate the usefulness of the new method. Specially, the ability associated with sparsity is highlighted in Section 6.
Section snippets
Definitions and main results
In this section we first introduce some basic notations and assumptions related to the data space and the underlying distribution, which are required throughout this text. Then we present two useful lemmas, which play an important role in our analysis of the learning rates of (1.4). Finally, we present and discuss our learning rates. Our analysis employs some mathematical techniques from multivariate approximation [10], which depends on two key assumptions on the structure of data space X and
Error decomposition
Dealing with error analysis, the following regularized function is introduced as a ‘bridge function’, given by and we denote The decay of as measures the approximation ability of the function space H to , and more details can be found in [27]. To this end, we assume that there exist constants and such that
For any function f, note that for the loss, this leads to
Sample error estimate
To deal with the sample error , we further decompose it as where and
The first term can be easily handled using the one-side Bernstein probability inequality. In connection with Lemma 1, the following proposition can be proven by an almost literal repetition of Lemma 4 of [27]. Proposition 2 Suppose that almost surely. If has a of
Learning rates
Theorem 2 Assume X satisfies an interior cone condition and almost surely. If satisfies condition with some , with . Suppose that has a of p-average type q and (3.1) is valid, then when , for any , with confidence , there holds provided that with any andIn addition,
Optimization problem and numerical experiment
Although our investigation is mainly theoretical, it is useful to verify whether coefficient L1-regularization can improve the solution sparsity compared with RKHS regularization. In the following, we first show with a synthetic data that coefficient L1-regularization can be helpful. To this end, we compute the primal optimization problem to (1.4) for efficient numerical implementation. Write , and apply the transform technique employed by Sch lkopf and Smola [22], we can
Acknowledgments
The authors would like to thank the anonymous referees for their valuable comments and suggestions which have substantively improved this paper. This work is supported partially by 211 Youth Growth Project for the Southwestern University of Finance and Economics (Phase 3) under Grant No. 211QN2011028. This work is also supported by National Natural Science Foundation of China, Tian Yuan Special Foundation (No. 11226111).
Shao-Gao Lv received his Ph.D. degree in applied mathematics from University of Science and Technology of China in 2011. He is currently an assistant professor in statistical school, Southwestern University of Finance and Economies, China. His research interests include statistical learning and machine learning. He has published a number of papers in international journals and conferences.
References (32)
- et al.
An empirical feature-based learning algorithm producing sparse approximations
Appl. Comput. Harmon. Anal.
(2012) - et al.
Concentration estimates for learning with l1-regularizer and data dependent hypothesis spaces
Appl. Comput. Harmon. Anal.
(2011) - et al.
Bone mineral acquisition in healthy asian, hispanic, black and Caucasian youth, a longitudinal study
J. Clin. Endocrinol. Metab.
(1999) - et al.
Smoothing reference centile curvesthe LMS method and penalized likelihood
Statist. Med.
(1992) - et al.
Sparsity and incoherence in compressive sampling
Inverse Problems
(2007) - et al.
The Dantzig selectorstatistical estimation when p is much larger than n
Ann. Statist.
(2007) - et al.
Entropy, Compactness and the Approximation of Operators
(1990) - F. Girosi, An Equivalent Between Sparse Approximation and Support Vector Machines, A. I. Memo 1606, MIT Artificial...
- et al.
The Elements of Statistical Learning
(2001) - et al.
Semiparametric estimation of regression quantiles with application to standardizing weight for height and age in US children
J. R. Statist. Soc. Ser. C
(1999)
Error estimates for scattered data interpolation on sphere
Adv. Comput. Math.
Quantile regression
J. Econom. Perspect.
Quantile Regression
Regression quantiles
Econometrica
Quantile smoothing splines
Biometrika
Sparsity in penalized empirical risk minimization
Ann. Inst. Henri Poincarë Probab. Statist.
Cited by (3)
From oracle generalization bound toward empirical inequality
2023, Information SciencesSmoothing quantile regression for a distributed system
2021, NeurocomputingCitation Excerpt :Quantile regression (QR) proposed by [22] is a powerful tool for learning the relationship between a scalar response and a multivariate predictor in the presence of heavier tails and/or data heterogeneity, which makes QR a natural candidate as an analysis tool for distributed systems. Other references about the QR method can be found in [21,34,24,37,15], and so on. To summarize, we aim to make the following important contributions to the existing literature.
Statistical consistency of coefficient-based conditional quantile regression
2016, Journal of Multivariate Analysis
Shao-Gao Lv received his Ph.D. degree in applied mathematics from University of Science and Technology of China in 2011. He is currently an assistant professor in statistical school, Southwestern University of Finance and Economies, China. His research interests include statistical learning and machine learning. He has published a number of papers in international journals and conferences.
Tie-Feng Ma received his Ph.D. in mathematical statistics, Beijing University of Technology, China in July 2008. He is currently an associated professor at the department of Statistics, Southwestern University of Finance and Economies, China. His research interests are mainly linear mixed model, multivariate analysis, shrinkage estimation and order restricted inference in terms of statistics.