Minimum deviation distribution machine for large scale regression

doi:10.1016/j.knosys.2018.02.002

Knowledge-Based Systems

Volume 146, 15 April 2018, Pages 167-180

https://doi.org/10.1016/j.knosys.2018.02.002 Get rights and content

Abstract

In this paper, by introducing the statistics of training data into support vector regression (SVR), we propose a minimum deviation distribution regression (MDR). Rather than just minimizing the structural risk, MDR also minimizes both the regression deviation mean and the regression deviation variance, which is able to deal with the different distribution of boundary data and noises. The formulation of minimizing the first and second order statistics in MDR leads to a strongly convex quadratic programming problem (QPP). An efficient dual coordinate descend algorithm is adopted for small sample problem, and an average stochastic gradient algorithm for large scale one. Both theoretical analysis and experimental results illustrate the efficiency and effectiveness of the proposed method.

Introduction

Regression analysis [1], [2], [3], [4], a powerful statistical tool for estimating the relationships among variables, has been studied extensively. In general, there are mainly two kinds of statistical regressors. One discovers the sample distribution by the statistics [5], [6], [7], [8], e.g., the first-order and second-order statistics, necessary condition analysis, and correlations. These methods, such as linear regression [5], [6], [7], [9], [10], [11] and least square regression [12], [13], [14], attempt to find the best fitting function as a regressor, and they are mainly concern the empirical risk minimization.

Another kinds of regressors focuses on the expected structure risk minimization for prediction, where the regressor tries to learn better regression value for unseen samples that has not been learned. Two popular predictive regressors are ridge regression [15] and support vector regression (SVR) [16], [17]. SVR constructs an ε-sensitive tube as the bounds of the regressor with a flat regularization term. There are two advantages of SVR: sparsity of the solutions and flexibility of generalizing to non-linear regression easily. Recent researches on SVR mainly concern on two aspects: one is to design efficient algorithms for solving a quadratic programming problem (QPP) with double size of training samples, such as Chunking [18], sequential minimal optimization (SMO) [19], least squares support vector machine (LSSVM) [12], [13], LIBSVM [20], Pegasos [21] and LIBLINEAR [22]; the other is to provide comprehensive models for data with different statistics structure, such as modifications of ν-support vector machine (Par-v-SVM) [23], twin support vector regression (TSVR) [24], ε-twin support vector regression (ε-TSVR) [10], and parametric-insensitive nonparallel support vector regression (PINSVR) [25], which were proposed to capture data structure and boundary information more accurately.

Inspired by the theoretical result [26] that the margin distribution is of importance to the generalization performance in the formulation of SVM, Zhang and Zhou [8], [27] presented a large margin distribution machine (LDM), which introduces the statistical information into SVM by seeking the support hyperplanes with the largest statistical margin, and meanwhile keeps the class samples close to each other from the same class and far away to each other from the different classes in the margin sense. More precisely, LDM maximizes the margin between the support hyperplanes together with the margin mean and minimizes the margin variance. The idea of margin distribution has received a great deal of attention [28], [29], [30], [31], [32], [33], [34], [35].

For regression, it is more reasonable to take into account the sample statistics in the deviation sense to precisely characterize the sample distribution. Therefore, in this paper, we propose a minimum deviation distribution regression (MDR) by introducing the statistics of deviation into SVR. However, the statistics margin strategy used in LDM cannot be directly applied to our MDR, since the decision hyperplane in SVM or LDM is a separated function, while it is a fitting one in MDR. In contrast to the formulation of LDM, MDR minimizes the regression deviation mean together with the regression deviation variance. The regression of MDR can be obtained by solving a QPP as in SVR. To promote the learning speed of our MDR, an efficient dual coordinate descend algorithm and average stochastic gradient descent (ASGD) algorithm are constructed for small size problem and large scale problem, respectively. The main contributions of this paper are as follows:

i)
The regression deviation mean and the regression deviation variance are defined to present the first-order and second-order statistics in regression.
ii)
MDR minimizes these two deviations in SVR, which also achieves the structure risk minimization principle.
iii)
MDR is robust to different distribution of boundary data and noises due to the use of the regression deviation mean and the regression deviation variance.
iv)
Dual coordinate descend algorithm and average stochastic gradient algorithm are designed for solving the small and large scale regression problem, respectively.
v)
Experimental results on both artificial data sets and benchmark data sets demonstrate the effectiveness and efficiency of the proposed method.

This paper is organized as follows. Section 2 introduces the basic notations and gives a brief review of SVR. Section 3 presents the details of MDR, including the formulation, algorithms and its properties. Experiments results are reported in Section 4. Section 5 concludes this paper.

Section snippets

Preliminaries

Suppose we are given a training set $S = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{m}, y_{m})}$ $\in X \times R,$ where $x_{i} \in X$ is the input sample and $y_{i} \in R$ is the response value. The goal of the regression is to find a regression function f(x) to predict the output of an input x. In SVR [6], the main goal is to find a regression function $f (x) = (\tilde{w} \cdot ϕ (x)) + b,$ where $\tilde{w}$ is a weight, and b is a bias and ϕ(x) is a feature mapping of x induced by a kernel k( · ,  · ), i.e., $k (x_{i}, x_{j}) = (ϕ (x_{i}) \cdot ϕ (x_{j}))$ . In fact, by appending each sample with an

Minimum deviation distribution regression (MDR)

In this section, we first define two deviation distributions, and then give the primal optimization problem of MDR. At last, we will give two solving algorithms and the corresponding theoretical guarantee.

Experimental results

In this section, the experiments were made to illustrate the effectiveness of our MDR compared with ε-TSVR [10], ε-SVR [5], [7] and LSSVR [45] on some data sets. The methods were implemented by MATLAB 7.0 runing on a PC with an Intel(R) Core Duo i7(2.70GHZ) with 32 GB RAM. ε-SVR was solved by LIBSVM and LIBLINEAR and LSSVR was solved by LSSVMlab. Gaussian kernel $K (x_{i}^{⊤}, x_{j}^{⊤}) = e x p (- ∥ x_{i}^{⊤} - x_{j}^{⊤} ∥^{2} / σ^{2})$ and Poly kernel $K (x_{i}^{⊤}, x_{j}^{⊤}) = {(x_{i} \cdot x_{j} + 1)}^{d}$ were employed for nonlinear regression. The values of

Conclusions

In this paper, by introducing the regression deviation mean and the regression deviation variance in regression, we have proposed a robust minimum deviation distribution machine for large scale regression. Our MDR is not only robust to the different distribution regression but also achieves the structural risk minimization. In addition, two algorithms are proposed for linear and nonlinear MDR. Experiments on benchmark data sets and large scale data sets show that MDR is more robust than

Acknowledgments

The authors thank the editors and the anonymous reviewers, whose invaluable comments helped improve the presentation of this paper substantially. This work is supported by the National Natural Science Foundation of China (No. 11501310, 61603338, 11371365, 11426202), the Zhejiang Provincial Natural Science Foundation of China (No. LY15F030013, LQ17F030003, LY16A010020), Inner Mongolia Natural Science Foundation of China (No. 2015BS0606) and the Fundamental Research Funds for the Central

References (46)

D.L. Liu et al.
Ramp loss nonparallel support vector machine for pattern classification
Knowl. Based Syst.
(2015)
M. Tanveer et al.
An efficient regularized k-nearest neighbor based weighted twin support vector regression
Knowl. Based Syst.
(2016)
K. Wang et al.
Robust non-convex least squares loss function for regression with outliers
Knowl. Based Syst.
(2014)
X.J. Peng
TSVR: an efficient twin support vector machine for regression
Neural Netw.
(2010)
Z.M. Yang et al.
A novel parametric-insensitive nonparallel support vector machine for regression
Neurocomputing
(2016)
W. Gao et al.
On the doubt about margin explanation of boosting
Artif. Intell.
(2013)
T. Zhang, Z.H. Zhou, Optimal margin distribution machine, 2016....
Y.H. Zhou et al.
Large margin distribution learning with cost interval and unlabeled data
IEEE Trans. Knowl. Data Eng.
(2016)
Y.H. Shao et al.
Nonparallel hyperplane support vector machine for binary classification problems
Inf. Sci.
(2014)
Y.F. Ye et al.
Robust lp-norm least squares support vector regression with feature selection
Appl. Math. Comput.
(2017)

Z. Wang et al.

Twin support vector machine for clustering

IEEE Trans. Neural Netw. Learn. Syst.

(2015)

D. Anguita et al.

A support vector machine with integer parameters

Neurocomputing

(2008)

L. Oneto et al.

Learning resource-aware classifiers for mobile devices: from regularization to energy efficiency

Neurocomputing

(2015)

L. Oneto et al.

Constraint-aware data analysis on mobile devices: an application to human activity recognition on smartphones

Adaptive Mobile Comput.

(2017)

G.X. Yuan et al.

Recent advances of larger-scale linear classification

Proc. IEEE

(2012)

N.R. Draper et al.

Applied Regression Analysis

(1998)

C.J.C. Burges

A tutorial on support vector machines for pattern recognition

Data Mining Knowl. Discov.

(1998)

V. Christianini et al.

An Introduction to Support Vector Machines

(2002)

C.W. Hsu et al.

A comprasion of methods for multiclass support vector machines

IEEE Trans. Neural Netw.

(2002)

N.Y. Deng et al.

Support vector Machine: Optimization Based Theory, Algorithms, and Extensions

(2012)

V.N. Vapnik

The Natural of Statistical Learning Theory

(1995)

V.N. Vapnik

Statistical Learning Theory

(1998)

T. Zhang et al.

Large margin distribution learning

Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discoverty and Data Mining (KDD’14), New York, NY

(2014)

Cited by (0)

View full text

Minimum deviation distribution machine for large scale regression

Abstract

Introduction

Section snippets

Preliminaries

Minimum deviation distribution regression (MDR)

Experimental results

Conclusions

Acknowledgments

Knowl. Based Syst.

Knowl. Based Syst.

Knowl. Based Syst.

Neural Netw.

Neurocomputing

Artif. Intell.

IEEE Trans. Knowl. Data Eng.

Inf. Sci.

Appl. Math. Comput.

IEEE Trans. Neural Netw. Learn. Syst.

Neurocomputing

Neurocomputing

Adaptive Mobile Comput.

Proc. IEEE

Applied Regression Analysis

A tutorial on support vector machines for pattern recognition

Data Mining Knowl. Discov.

An Introduction to Support Vector Machines

A comprasion of methods for multiclass support vector machines

IEEE Trans. Neural Netw.

Support vector Machine: Optimization Based Theory, Algorithms, and Extensions

The Natural of Statistical Learning Theory

Statistical Learning Theory

Large margin distribution learning

Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discoverty and Data Mining (KDD’14), New York, NY