Fast learning rates for sparse quantile regression problem

doi:10.1016/j.neucom.2012.10.015

Neurocomputing

Volume 108, 2 May 2013, Pages 13-22

https://doi.org/10.1016/j.neucom.2012.10.015 Get rights and content

Abstract

Learning with coefficient-based regularization has attracted a considerable amount of attention recently in both machine learning and statistics. This paper presents a kernelized version of a quantile estimator integrated with coefficient-based regularization, which can be solved efficiently by a simple linear programming. Fast convergence rates are obtained under mild condition on the underlying distribution. Besides, this algorithm can be adapted easily to large scale problems and sparse solution is often achieved as that of Lasso. In our work we make the following main contributions: girst, improved learning rates are obtained by employing so called variance bounds, which is optimal in the literatures of learning theory; second, we establish stronger convergence rates by employing self-calibration inequalities; third, our learning rates can also be derived by a simple data-dependent parameter selection method; finally, the performance of the classical and our new algorithms are compared respectively in a simulation study and an actual problem.

Introduction

Quantile regression has emerged as a comprehensive approach for analyzing the impact of regressors on the conditional distribution of a response variable [12], [13] and others. In this paper, we consider quantile regression from a learning theory viewpoints. We begin with a supervised learning problem over a collection of observational data $z = {(x_{i}, y_{i})}_{i = 1}^{m}$ , drawn from an unknown distribution $ρ$ on X×Y, where $x_{i} \in X \subset R^{n}$ is an input space and $y_{i} \in R$ is the corresponding output for the regression problem. Given a loss function $L : Y \times R \to [0, \infty)$ , we hope that the following risk functional $R (f) ≔ \int_{X \times Y} L (y, f (x)) d ρ (x, y)$ is small over all measurable functions on X, and the least square loss $(L (y, f (x)) = (y - f (x))^{2})$ is the most commonly used, which corresponds to the conditional mean function. The conditional mean regression describes the centrality of the conditional response distribution. However, sometimes one wants to obtain a good estimate satisfying a certain proportion of $y | x$ below the estimate. For example, for financial risk management, an investor may need to estimate a lower bound on the changes in the value of its portfolio with high probability, so as to take positive measure beforehand to avoid big asset volatility. In this case the mean function is no longer valid since it cannot provide more complete descriptions of the conditional response distribution. The above problem can be solved by means of quantile regression, which has an equivalent relationship with the so-called $τ - pinball$ loss, defined by $L_{τ} (y, t) = \{\begin{matrix} (1 - τ) (t - y) & if y < t, \\ τ (y - t) & if y \geq t . \end{matrix}$

Let $ρ (\cdot | x)$ be the conditional distribution of $ρ$ , and with fixed any constant $τ \in (0, 1)$ , the set-valued function is defined as $F_{τ, ρ}^{⁎} (x) ≔ {t \in R : ρ ((- \infty, t] | x) \geq τ and ρ ([t, \infty) | x) \geq 1 - τ}, x \in X .$ If $ρ (\cdot | x)$ has finite support, it is well understood that $F_{τ, ρ}^{⁎} (x)$ is a bounded and closed interval at any $x \in X$ in [20]. We write $t_{\min}^{⁎} (x) ≔ \min {F_{τ, ρ}^{⁎} (x)} and t_{\max}^{⁎} (x) ≔ \max {F_{τ, ρ}^{⁎} (x)},$ which imply that $F_{τ, ρ}^{⁎} (x) = [t_{\min}^{⁎} (x), t_{\max}^{⁎} (x)]$ . Moreover, it is easy to check that the interior of $F_{τ, ρ}^{⁎} (x)$ is a $ρ (\cdot | x) - zero$ set, namely, $ρ ((t_{\min}^{⁎} (x), t_{\max}^{⁎} (x)) | x) = 0$ . In this paper we assume that $F_{τ, ρ}^{⁎} (x)$ consists of singletons, namely, there exists a function $f_{τ, ρ}^{⁎} : X \to R$ called the conditional $τ - quantile$ function, such that $F_{τ, ρ}^{⁎} (x) = {f_{τ, ρ}^{⁎} (x)}$ for almost every $x \in X$ .

With the help of the $τ - pinball$ loss, one can easily verify that $f_{τ, ρ}^{⁎}$ is a minimizer of $R (f)$ associated with $L_{τ}$ , and for almost every $x \in X$ $f_{τ, ρ}^{⁎} (x) = \arg \min_{t \in R} \int_{Y} L_{τ} (y, t) d ρ (y | x) .$ Quantile regression is an important statistical methods for analyzing the impact of the regressors on the conditional distribution of a response variable. Compared with the conditional mean, one of the advantages using quantile regression is that quantile regression is more robust in response to large outliers. In recent years, quantile regression has been widely used in many practical applications, such as reference charts in medicine [2], [9] and economics [11]. For comprehensive reviews of quantile regression, refer to the articles by Koenker and Hallock [11], [30] and well-written book by Koenker [12].

Throughout this paper, we assume that $ρ (\cdot | x)$ is supported on $[- M, M]$ for some constant $M > 0$ , and it follows that $| f_{τ, ρ}^{⁎} (x) | \leq M$ for $x \in X$ almost surely. A special case with $τ = \frac{1}{2}$ of (1.1) gives the absolute value function, that is, $L_{1 / 2} (y, t) = \frac{1}{2} | y - t |$ , and the minimizer $f_{1 / 2, ρ}^{⁎}$ becomes the median function $M_{ρ}$ defined at each $x \in X$ by $ρ (y \geq M_{ρ} (x) | x) \geq \frac{1}{2} and ρ (y \leq M_{ρ} (x) | x) \geq \frac{1}{2} .$

Usually $f_{τ, ρ}^{⁎}$ cannot be computed directly, due to the unknown distribution $ρ$ . Instead, one seeks to minimize the empirical error associated with the sample z and $τ - pinball$ loss, which is given by $R_{z} (f) = \frac{1}{m} \sum_{i = 1}^{m} L_{τ} (y_{i}, f (x_{i})) .$ Unfortunately, minimizing $R_{z} (f)$ may lead to overfitting, that is, complex functions fit well on training data but not be able to generalize to unseen data. A promising approach to avoid this is to minimize the following regularized risk: $\min_{f \in H} {R_{z} (f) + λ Ω (f)},$ where $H$ is a pre-specified hypothesis function space and $Ω$ is a functional on $H$ .

Kernel methods have been widely used in many areas of machine learning and achieved great success. Among others kernel regression has drawn much attention including quantile regression and least square regression. The study has focused on the application of Mercer kernels and regularization in the associated reproducing kernel Hilbert space (RKHS). Let $K : X \times X \to R$ be a bounded, symmetric, and positive semi-definite function. The RKHS denoted by $H_{K}$ associated with the kernel K is the completion of the linear span of functions $K_{x} ≔ K (x, \cdot), x \in X$ with the inner product given by $〈 K_{x}, K_{y} 〉_{H_{K}} = K (x, y)$ . With these preparation, the learning algorithm for the quantile regression is given by the regularization scheme (see [23], [16], [20]) $f_{z} = \arg \min_{f \in H_{K}} \{\frac{1}{m} \sum_{i = 1}^{m} L_{τ} (y_{i}, f (x_{i})) + λ ∥ f ∥_{K}^{2}\},$ where $0 < λ \leq 1$ , a regularization parameter controlling the trade-off between the empirical error and the penalty. In practice, we need to choose an adaptive parameter $λ$ which usually depends on the sample, and probably the cross validation approach is the most commonly used technique aiming at this, but burdens upper computational complexity when the sample size is large. To this end, in Section 5 we will employ a standard training-validation approach to choose a suitable $λ$ , also we give the convergence rate of our proposed algorithm (1.4) below based on such adaptive parameter.

Note that the resulting function $f_{z}$ generated by (1.3) is given as a dense expansion in terms of the training patterns, which adds to computational cost greatly. To solve large-data problems, a promising fact is that sparse solutions are often achieved in statistics and compress sensing settings by imposing an L¹-regularizer on the expansion coefficients [3], [18], [24]. The L₁-norm penalty not only shrinks the fitted coefficient toward zero, but also causes some of the fitted coefficient to be exactly zero when $λ$ is chosen to be large enough. Thus a lot of irrelevant noise variable or useless data can be removed mostly. In order to introduce L¹-regularizer in a machine learning setup, we need to study the coefficient-based regularization scheme, which can be written as $f_{z, λ} = \arg \min_{f_{α} \in H_{K, z}} {R_{z} (f_{α}) + λ Ω_{z} (f_{α})},$ where $H_{K, z} = \{f_{α} (x) = \sum_{i = 1}^{m} α_{i} K (x, x_{i}) : α = (α_{1}, \dots, α_{m}) \in R^{m}\}$ and $Ω_{z} (f_{α}) = \sum_{i = 1}^{m} | α_{i} | for f_{α} (x) = \sum_{i = 1}^{m} α_{i} K (x, x_{i}) .$

Algorithm (1.4) is a linear programming which can be solved efficiently by existing codes for large scale problems. For completeness, we also give the concrete optimization procedure in Section 6. This approach is similar to Linear Programming Regularization by proposed Smola et al. [23]. However, the main difference lies in that some convex function class substitutes the set of functions $K (\cdot, x_{i})$ here, rather than acting automatically on data point x_i. It is worth noting that our choice of basis functions $K (x, x_{i})$ is more natural since the solution of (1.3) can be found in a finite-dimensional space spanned by the set of functions ${K (x, x_{i})}_{i = 1}^{m}$ [16], [23]. Also we find later that this ensures good generalization ability theoretically.

It is worth noting that some previous works have considered the format: L₁ loss + L₁ penalty structure. For example, Koenker et al. [14] used $λ \int_{0}^{1} | f ″ (x) | dx$ as the penalty, namely, $\min_{f \in F} \sum_{i = 1}^{m} L_{τ} (y_{i}, f (x_{i})) + λ \int_{0}^{1} | f ″ (x) | dx,$ where $F$ is a certain function space. As is shown that the solution is a linear spline with knots $x_{i} (i = 1, \dots, m)$ if $F$ is chosen appropriately. This motivates partially us to study the coefficient-based regularization scheme, considering that smoothing spline function can be viewed as a special kernel.

In the field of machine learning, the coefficient regularization was first introduced in [6] to design linear programming support vector machine. In [7], the sparse property of estimation coefficients with least square regression was discussed by a spectral decomposition technique. In terms of theoretical analysis, [27] derived some learning rates by using a local polynomial reproduction formula in approximation theory. In their analysis, the quantile regression was paid especially under rather general conditions. Ref. [17] provided a unified analytical framework for coefficient-based regularization with strict convex regularizer and indefinite kernel. As we notice, the previous works mentioned above did not consider the effect of the conditional distribution to the convergence rates. In this paper, we can improve the corresponding learning rates sharply by making fully use of the information on the conditional distribution and the so-called comparison theorem first proposed in [32]. In addition, we apply the empirical covering numbers to measure the functional complexity instead of uniform covering numbers [27]. This ensures the entropy integral finiteness with respect to the space $H_{K, z}$ , which is shown in Section 4.

From the algorithmic point of view, one expects that (1.4) would be more stable and computational efficiency, and most importantly $f_{z, λ}$ can approximate the target function $f_{τ, ρ}^{⁎}$ well in the whole space X as the sample size m tends to infinity. Error analysis for regularization (1.4) aims at estimates of the excess generalization error $R (f_{z, λ}) - R (f_{τ, ρ}^{⁎})$ in terms of the kernel K, the $τ - pinball$ loss and the underlying measure $ρ$ through the proper choice of the regularization parameter $λ$ . However, this only implies that $f_{z, λ}$ is close to $f_{τ, ρ}^{⁎}$ in a weak form. Recently, some stronger convergence rates are derived from establishing so called self-calibration inequalities (see [20], [25]).

To sum up, we list our main contributions as follows:

•
Improved learning rates are obtained by employing so called variance bounds;
•
We establish stronger convergence rates by making use of the self-calibration inequalities. Meanwhile, our learning rates can also be derived by a simple data-dependent parameter selection method, which does not require some prior knowledge of the target function;
•
We implement a simulation experiment and a real data application to demonstrate the usefulness of the new method. Specially, the ability associated with sparsity is highlighted in Section 6.

The rest of this paper is organized as follows. In Section 2, after basic notations and definitions are reviewed, we establish the optimal learning rates of convergence for the quantile problem when the regularization parameter is appropriately chosen. 3 Error decomposition, 4 Sample error estimate present the proof process of our theoretical results. Section 5 proposes a data-driven method for choosing the regularization parameter and the resulting estimator is shown to adaptively achieve the same rates of convergence as previous statements. Finally, an artificial dataset and a real dataset are employed to show the numerical merits of algorithm (1.4) in terms of prediction ability and computational efficiency.

Section snippets

Definitions and main results

In this section we first introduce some basic notations and assumptions related to the data space and the underlying distribution, which are required throughout this text. Then we present two useful lemmas, which play an important role in our analysis of the learning rates of (1.4). Finally, we present and discuss our learning rates. Our analysis employs some mathematical techniques from multivariate approximation [10], which depends on two key assumptions on the structure of data space X and

Error decomposition

Dealing with error analysis, the following regularized function $f_{τ, λ}$ is introduced as a ‘bridge function’, given by $f_{τ, λ} = \arg \min_{f \in H} {R (f) + λ ∥ f ∥_{H}}$ and we denote $D (λ) = R (f_{τ, λ}) - R (f_{τ, ρ}^{⁎}) + λ ∥ f_{τ, λ} ∥_{H} .$ The decay of $D (λ)$ as $λ \to 0$ measures the approximation ability of the function space H to $f_{τ, ρ}^{⁎}$ , and more details can be found in [27]. To this end, we assume that there exist constants $c_{β} > 0$ and $β \in (0, 1]$ such that $D (λ) \leq c_{β} λ^{β} .$

For any function f, note that $L_{τ} (y, \hat{f (x)}) \leq L_{τ} (y, f (x))$ for the $τ - pinball$ loss, this leads to

Sample error estimate

To deal with the sample error $S (z, λ)$ , we further decompose it as $S (z, λ) = S_{1} (z, λ) + S_{2} (z, λ),$ where $S_{1} (z, λ) = {R_{z} (f_{τ, λ}) - R_{z} (f_{τ, ρ}^{⁎})} - {R (f_{τ, λ}) - R (f_{τ, ρ}^{⁎})}$ and $S_{2} (z, λ) = {R (\hat{f_{z, λ}}) - R (f_{τ, ρ}^{⁎})} - {R_{z} (\hat{f_{z, λ}}) - R_{z} (f_{τ, ρ}^{⁎})} .$

The first term $S_{1} (z, λ)$ can be easily handled using the one-side Bernstein probability inequality. In connection with Lemma 1, the following proposition can be proven by an almost literal repetition of Lemma 4 of [27].

Proposition 2

Suppose that $| L_{τ} (y, f_{τ, λ}) - L_{τ} (y, f_{τ, ρ}^{⁎}) | \leq \tilde{M}$ almost surely. If $ρ$ has a $τ - quantile$ of

Learning rates

Theorem 2

Assume X satisfies an interior cone condition and $| y | \leq M$ almost surely. If $ρ_{X}$ satisfies condition $L_{ζ}$ with some $ζ > 0$ , $K \in C^{s} (X \times X)$ with $s \geq 2$ . Suppose that $ρ$ has a $τ - quantile$ of p-average type q and (3.1) is valid, then when $m \geq {\tilde{C}}_{0} (\log (2 / δ) + \log (m + 1))$ , for any $0 < δ < 1$ , with confidence $1 - δ$ , there holds $R (\hat{f_{z, λ}}) - R (f_{τ, ρ}^{⁎}) \leq C_{0} (\log (3 (m + 1) / δ))^{\max {1, 1 / (2 - ϑ), s / ζ}} m^{- Θ},$ provided that $λ = m^{- ν}$ with any $0 < ν < \min \{\frac{s}{ζ (1 - β)}, \frac{1}{(2 - ϑ) (1 - β)}, \frac{1}{θ}, 1\}$ and $Θ ≔ \min \{ν β, \frac{s}{ζ} - ν (1 - β), \frac{1}{2 - ϑ} - ν (1 - β), \frac{2 (1 - ν θ)}{4 - 2 ϑ + θ ϑ}\}, where θ = \frac{n}{\max {s, 1 + n / 2}} .$ In addition, $C_{0} ≔ 2 c_{β} + ϑ + 4$

Optimization problem and numerical experiment

Although our investigation is mainly theoretical, it is useful to verify whether coefficient L¹-regularization can improve the solution sparsity compared with RKHS regularization. In the following, we first show with a synthetic data that coefficient L¹-regularization can be helpful. To this end, we compute the primal optimization problem to (1.4) for efficient numerical implementation. Write $f (x) = \sum_{i = 1}^{m} α_{i} K (x_{i}, x)$ , and apply the transform technique employed by Sch $\ddot{o}$ lkopf and Smola [22], we can

Acknowledgments

The authors would like to thank the anonymous referees for their valuable comments and suggestions which have substantively improved this paper. This work is supported partially by 211 Youth Growth Project for the Southwestern University of Finance and Economics (Phase 3) under Grant No. 211QN2011028. This work is also supported by National Natural Science Foundation of China, Tian Yuan Special Foundation (No. 11226111).

Shao-Gao Lv received his Ph.D. degree in applied mathematics from University of Science and Technology of China in 2011. He is currently an assistant professor in statistical school, Southwestern University of Finance and Economies, China. His research interests include statistical learning and machine learning. He has published a number of papers in international journals and conferences.

References (32)

X. Guo et al.
An empirical feature-based learning algorithm producing sparse approximations
Appl. Comput. Harmon. Anal.
(2012)
L. Shi et al.
Concentration estimates for learning with l¹-regularizer and data dependent hypothesis spaces
Appl. Comput. Harmon. Anal.
(2011)
L.K. Bachrach et al.
Bone mineral acquisition in healthy asian, hispanic, black and Caucasian youth, a longitudinal study
J. Clin. Endocrinol. Metab.
(1999)
T. Cole et al.
Smoothing reference centile curvesthe LMS method and penalized likelihood
Statist. Med.
(1992)
E. Candes et al.
Sparsity and incoherence in compressive sampling
Inverse Problems
(2007)
E. Candes et al.
The Dantzig selectorstatistical estimation when p is much larger than n
Ann. Statist.
(2007)
B. Carl et al.
Entropy, Compactness and the Approximation of Operators
(1990)
F. Girosi, An Equivalent Between Sparse Approximation and Support Vector Machines, A. I. Memo 1606, MIT Artificial...
T. Hastie et al.
The Elements of Statistical Learning
(2001)
P. Heagery et al.
Semiparametric estimation of regression quantiles with application to standardizing weight for height and age in US children
J. R. Statist. Soc. Ser. C
(1999)

K. Jetter et al.

Error estimates for scattered data interpolation on sphere

Adv. Comput. Math.

(2000)

R. Koenker et al.

Quantile regression

J. Econom. Perspect.

(2001)

R. Koenker

Quantile Regression

(2005)

R. Koenker et al.

Regression quantiles

Econometrica

(1978)

R. Koenker et al.

Quantile smoothing splines

Biometrika

(1994)

V. Koltchinskii

Sparsity in penalized empirical risk minimization

Ann. Inst. Henri Poincarë Probab. Statist.

(2009)

Cited by (3)

From oracle generalization bound toward empirical inequality
2023, Information Sciences
Learning guarantees are tools for analyzing the performance of statistical learning problems. Most of the existing guarantees are oracle generalization bounds and are proper for theoretical analysis, and they cannot be used directly for various applications. The upper bound on the empirical excess risk is more useful in practice since it is obtained based on the observed data. Therefore, getting a learning guarantee on the empirical process is helpful in practice. In this paper, under the Bernstein condition, we propose a novel approach that uses any oracle generalization bound to calculate an upper bound on the empirical excess risk of the empirical risk minimization (ERM) predictor with the fast convergence rate $(O (\frac{1}{n}))$ . Using a tighter general bound, our approach obtains a tighter bound on the empirical excess risk. We show that the proposed empirical excess risk bound is tighter compared to the best in the literature. Using this upper bound, we get a deterministic and known set of predictors such that with a high probability contains the minimum risk predictor. This set can be used in the algorithm design.
Smoothing quantile regression for a distributed system
2021, Neurocomputing
Citation Excerpt :
Quantile regression (QR) proposed by [22] is a powerful tool for learning the relationship between a scalar response and a multivariate predictor in the presence of heavier tails and/or data heterogeneity, which makes QR a natural candidate as an analysis tool for distributed systems. Other references about the QR method can be found in [21,34,24,37,15], and so on. To summarize, we aim to make the following important contributions to the existing literature.
Quantile regression has become a popular alternative to least squares regression for providing a comprehensive description of the response distribution, and robustness against heavy-tailed error distributions. However, the nonsmooth quantile loss poses new challenges to distributed estimation in both computation and theoretical development. To address this challenge, we use a convolution-type smoothing approach and its Taylor expression to transform the nondifferentiable quantile loss function into a convex quadratic loss function, which admits a fast and scalable algorithm to perform optimization under massive and high-dimensional data. The proposed distributed estimators are both computationally and communication efficient. Moreover, only the gradient information is communicated at each iteration. Theoretically, we show that, after a certain number of iterations, the resulting estimator is statistically as efficient as the global estimator without any restriction on the number of machines. Both simulations and data analysis are conducted to illustrate the finite sample performance of the proposed methods.
Statistical consistency of coefficient-based conditional quantile regression
2016, Journal of Multivariate Analysis
This study focuses on the coefficient-based conditional quantile regression associated with $l^{q}$ -regularization term, where $1 \leq q \leq 2$ . Error analysis is investigated based on the capacity of the hypothesis space. The linear piecewise nature of the pinball loss function for quantile regression and the $l^{q}$ -penalty of the learning algorithm lead to some difficulties in the theoretical analysis. In order to overcome the difficulties, we introduce a novel error decomposition formula. The prolix iteration is then circumvented in the error analysis. Some satisfactory learning rates are achieved in a general setting under mild conditions.

Tie-Feng Ma received his Ph.D. in mathematical statistics, Beijing University of Technology, China in July 2008. He is currently an associated professor at the department of Statistics, Southwestern University of Finance and Economies, China. His research interests are mainly linear mixed model, multivariate analysis, shrinkage estimation and order restricted inference in terms of statistics.

View full text

Fast learning rates for sparse quantile regression problem

Abstract

Introduction

Section snippets

Definitions and main results

Error decomposition

Sample error estimate

Learning rates

Optimization problem and numerical experiment

Acknowledgments

Appl. Comput. Harmon. Anal.

Appl. Comput. Harmon. Anal.

Bone mineral acquisition in healthy asian, hispanic, black and Caucasian youth, a longitudinal study

J. Clin. Endocrinol. Metab.

Smoothing reference centile curvesthe LMS method and penalized likelihood

Statist. Med.

Sparsity and incoherence in compressive sampling

Inverse Problems

The Dantzig selectorstatistical estimation when p is much larger than n

Ann. Statist.

Entropy, Compactness and the Approximation of Operators

The Elements of Statistical Learning

Semiparametric estimation of regression quantiles with application to standardizing weight for height and age in US children

J. R. Statist. Soc. Ser. C

Error estimates for scattered data interpolation on sphere

Adv. Comput. Math.

Quantile regression

J. Econom. Perspect.

Quantile Regression

Regression quantiles

Econometrica

Quantile smoothing splines

Biometrika

Sparsity in penalized empirical risk minimization

Ann. Inst. Henri Poincarë Probab. Statist.