Elsevier

Knowledge-Based Systems

Volume 146, 15 April 2018, Pages 167-180
Knowledge-Based Systems

Minimum deviation distribution machine for large scale regression

https://doi.org/10.1016/j.knosys.2018.02.002Get rights and content

Abstract

In this paper, by introducing the statistics of training data into support vector regression (SVR), we propose a minimum deviation distribution regression (MDR). Rather than just minimizing the structural risk, MDR also minimizes both the regression deviation mean and the regression deviation variance, which is able to deal with the different distribution of boundary data and noises. The formulation of minimizing the first and second order statistics in MDR leads to a strongly convex quadratic programming problem (QPP). An efficient dual coordinate descend algorithm is adopted for small sample problem, and an average stochastic gradient algorithm for large scale one. Both theoretical analysis and experimental results illustrate the efficiency and effectiveness of the proposed method.

Introduction

Regression analysis [1], [2], [3], [4], a powerful statistical tool for estimating the relationships among variables, has been studied extensively. In general, there are mainly two kinds of statistical regressors. One discovers the sample distribution by the statistics [5], [6], [7], [8], e.g., the first-order and second-order statistics, necessary condition analysis, and correlations. These methods, such as linear regression [5], [6], [7], [9], [10], [11] and least square regression [12], [13], [14], attempt to find the best fitting function as a regressor, and they are mainly concern the empirical risk minimization.

Another kinds of regressors focuses on the expected structure risk minimization for prediction, where the regressor tries to learn better regression value for unseen samples that has not been learned. Two popular predictive regressors are ridge regression [15] and support vector regression (SVR) [16], [17]. SVR constructs an ε-sensitive tube as the bounds of the regressor with a flat regularization term. There are two advantages of SVR: sparsity of the solutions and flexibility of generalizing to non-linear regression easily. Recent researches on SVR mainly concern on two aspects: one is to design efficient algorithms for solving a quadratic programming problem (QPP) with double size of training samples, such as Chunking [18], sequential minimal optimization (SMO) [19], least squares support vector machine (LSSVM) [12], [13], LIBSVM [20], Pegasos [21] and LIBLINEAR [22]; the other is to provide comprehensive models for data with different statistics structure, such as modifications of ν-support vector machine (Par-v-SVM) [23], twin support vector regression (TSVR) [24], ε-twin support vector regression (ε-TSVR) [10], and parametric-insensitive nonparallel support vector regression (PINSVR) [25], which were proposed to capture data structure and boundary information more accurately.

Inspired by the theoretical result [26] that the margin distribution is of importance to the generalization performance in the formulation of SVM, Zhang and Zhou [8], [27] presented a large margin distribution machine (LDM), which introduces the statistical information into SVM by seeking the support hyperplanes with the largest statistical margin, and meanwhile keeps the class samples close to each other from the same class and far away to each other from the different classes in the margin sense. More precisely, LDM maximizes the margin between the support hyperplanes together with the margin mean and minimizes the margin variance. The idea of margin distribution has received a great deal of attention [28], [29], [30], [31], [32], [33], [34], [35].

For regression, it is more reasonable to take into account the sample statistics in the deviation sense to precisely characterize the sample distribution. Therefore, in this paper, we propose a minimum deviation distribution regression (MDR) by introducing the statistics of deviation into SVR. However, the statistics margin strategy used in LDM cannot be directly applied to our MDR, since the decision hyperplane in SVM or LDM is a separated function, while it is a fitting one in MDR. In contrast to the formulation of LDM, MDR minimizes the regression deviation mean together with the regression deviation variance. The regression of MDR can be obtained by solving a QPP as in SVR. To promote the learning speed of our MDR, an efficient dual coordinate descend algorithm and average stochastic gradient descent (ASGD) algorithm are constructed for small size problem and large scale problem, respectively. The main contributions of this paper are as follows:

  • i)

    The regression deviation mean and the regression deviation variance are defined to present the first-order and second-order statistics in regression.

  • ii)

    MDR minimizes these two deviations in SVR, which also achieves the structure risk minimization principle.

  • iii)

    MDR is robust to different distribution of boundary data and noises due to the use of the regression deviation mean and the regression deviation variance.

  • iv)

    Dual coordinate descend algorithm and average stochastic gradient algorithm are designed for solving the small and large scale regression problem, respectively.

  • v)

    Experimental results on both artificial data sets and benchmark data sets demonstrate the effectiveness and efficiency of the proposed method.

This paper is organized as follows. Section 2 introduces the basic notations and gives a brief review of SVR. Section 3 presents the details of MDR, including the formulation, algorithms and its properties. Experiments results are reported in Section 4. Section 5 concludes this paper.

Section snippets

Preliminaries

Suppose we are given a training set S={(x1,y1),(x2,y2),,(xm,ym)}X×R, where xiX is the input sample and yiR is the response value. The goal of the regression is to find a regression function f(x) to predict the output of an input x. In SVR [6], the main goal is to find a regression function f(x)=(w˜·ϕ(x))+b, where w˜ is a weight, and b is a bias and ϕ(x) is a feature mapping of x induced by a kernel k( · ,  · ), i.e., k(xi,xj)=(ϕ(xi)·ϕ(xj)). In fact, by appending each sample with an

Minimum deviation distribution regression (MDR)

In this section, we first define two deviation distributions, and then give the primal optimization problem of MDR. At last, we will give two solving algorithms and the corresponding theoretical guarantee.

Experimental results

In this section, the experiments were made to illustrate the effectiveness of our MDR compared with ε-TSVR [10], ε-SVR [5], [7] and LSSVR [45] on some data sets. The methods were implemented by MATLAB 7.0 runing on a PC with an Intel(R) Core Duo i7(2.70GHZ) with 32 GB RAM. ε-SVR was solved by  LIBSVM and  LIBLINEAR and LSSVR was solved by LSSVMlab. Gaussian kernel K(xi,xj)=exp(xixj2/σ2) and Poly kernel K(xi,xj)=(xi·xj+1)d were employed for nonlinear regression. The values of

Conclusions

In this paper, by introducing the regression deviation mean and the regression deviation variance in regression, we have proposed a robust minimum deviation distribution machine for large scale regression. Our MDR is not only robust to the different distribution regression but also achieves the structural risk minimization. In addition, two algorithms are proposed for linear and nonlinear MDR. Experiments on benchmark data sets and large scale data sets show that MDR is more robust than

Acknowledgments

The authors thank the editors and the anonymous reviewers, whose invaluable comments helped improve the presentation of this paper substantially. This work is supported by the National Natural Science Foundation of China (No. 11501310, 61603338, 11371365, 11426202), the Zhejiang Provincial Natural Science Foundation of China (No. LY15F030013, LQ17F030003, LY16A010020), Inner Mongolia Natural Science Foundation of China (No. 2015BS0606) and the Fundamental Research Funds for the Central

References (46)

  • Z. Wang et al.

    Twin support vector machine for clustering

    IEEE Trans. Neural Netw. Learn. Syst.

    (2015)
  • D. Anguita et al.

    A support vector machine with integer parameters

    Neurocomputing

    (2008)
  • L. Oneto et al.

    Learning resource-aware classifiers for mobile devices: from regularization to energy efficiency

    Neurocomputing

    (2015)
  • L. Oneto et al.

    Constraint-aware data analysis on mobile devices: an application to human activity recognition on smartphones

    Adaptive Mobile Comput.

    (2017)
  • G.X. Yuan et al.

    Recent advances of larger-scale linear classification

    Proc. IEEE

    (2012)
  • N.R. Draper et al.

    Applied Regression Analysis

    (1998)
  • C.J.C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Mining Knowl. Discov.

    (1998)
  • V. Christianini et al.

    An Introduction to Support Vector Machines

    (2002)
  • C.W. Hsu et al.

    A comprasion of methods for multiclass support vector machines

    IEEE Trans. Neural Netw.

    (2002)
  • N.Y. Deng et al.

    Support vector Machine: Optimization Based Theory, Algorithms, and Extensions

    (2012)
  • V.N. Vapnik

    The Natural of Statistical Learning Theory

    (1995)
  • V.N. Vapnik

    Statistical Learning Theory

    (1998)
  • T. Zhang et al.

    Large margin distribution learning

    Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discoverty and Data Mining (KDD’14), New York, NY

    (2014)
  • Cited by (0)

    View full text