1 Introduction

As a powerfull tool in machine learning, support vector machines (SVMs) have been gained a great deal of attention in wide variety of fields [1,2,3,4,5]. The classical support vector classifier (SVC) is trying to maximize the margin between two parallel hyperplanes, which results in solving a convex quadratic programming problem (QPP). Furhtermore, some non-parallel hyperplane classifiers such as the generalized eigen-value proximal support vector machine (GEPSVM) and twin support vector machine (TWSVM) have been proposed in [4, 5]. TWSVM is to search two non-parallel proximal hyperplanes such that each hyperplane is closer to one class and as far as possible from the other one. In fact, as an efficient generalization of the classical SVC, TWSVMs have been studied extensively [6,7,8,9,10,11,12,13,14], which need to solve two small QPPs in contrast with the classical SVC. This paper proposes a novel large margin proximal non-parallel support vector machine for binary classification (PNSVM). The main contributions of PNSVM are:

  • PNSVM minimizes the structural risk by imposing a regularization term.

  • PNSVM is proximal both classes and maximizes the corresponding margin, while TWSVM is proximal each class and far away from the other class.

  • PNSVM can be solved efficiently with dual coordinate descend method [15].

2 Related Work

Benefiting from its excellent generalization performance of twin support vector machines, the approaches of constructing the non-parallel hyperplanes have received extensive attention [6, 8, 9, 11,12,13,14, 16]. Shao et al. [6] present a variant of GEPSVM based on the difference measure, which seems to be superior in classification accuracy and computation time. For the imbalanced data classification, Shao et al. [8] suggest an efficient weighted Lagrangian twin support vector machine (WLTSVM), which is robust to outliers and overcomes the bias phenomenon in the original TWSVM. Liu et al. [9] present a support vector machine for large scale regression based on the minimization of deviation distribution. Tian et al. [11] propose a novel nonparallel SVM for binary classification, which implements the structural risk minimization and is suitable for large scale problems. Shao et al. [16] present a sparse \(L_q\)-norm least squares support vector machine, where feature selection and prediction are performed simultaneously.

3 PNSVM

3.1 Linear PNSVM

Linear PNSVM is to find two non-parallel hyperplanes \(f_1(\varvec{x})=\varvec{w}_1^T\varvec{x}+b_1=0\) and \(f_2(\varvec{x})=\varvec{w}_2^T\varvec{x}+b_2=0\) by solving the following two problems:

(1)

and

(2)

where \(c_i, i=1,2,\cdots ,6\) are parameters.

The formulation given at (1) can be understood as follows: The first term in (1) is the sum of squared distances from \(f(\varvec{x})=\varvec{w}_1\varvec{x}+b_1=0\) to points of positive class, whose minimization means to keep \(f(\varvec{x})=\varvec{w}_1\varvec{x}+b_1=0\) close to the positive class. The minimization of the second term in (1) implies that the structural risk principle is implemented by the term \(\frac{1}{2}c_2(\Vert \varvec{w}_1^2+b_1^2\Vert )\). The second constraints in (1) require the hyperplane \(f(\varvec{x})=\varvec{w}_1^T\varvec{x}+b_1\) to be at a distance of at least 1 and at most \(\rho _1\) from points of negative class. The third term in (1) tries to minimize mis-classification. The last term in (1) requires the points of negative class to be a distance no far away from the hyperplane. A similar interpretation may also be given to the formulation given at (2).

The Lagrangian of the formulation (1) is given by

$$\begin{aligned}&\mathbf {L}(\varvec{w}_1,b_1,\varvec{p},\varvec{q}_1,\rho _1;\varvec{\alpha }_1,\varvec{\alpha }_2,\varvec{\alpha }_3,\varvec{\alpha }_4,\beta ) \\= & {} \frac{1}{2}c_1 (\Vert \varvec{w}_1\Vert ^2+b_1^2)+\frac{1}{2}\varvec{p}^T\varvec{p}+c_2\varvec{e}_2^T\varvec{q}_1+c_3\rho _1 +\varvec{\alpha }_1^T(\varvec{Aw}_1+\varvec{e}_1b_1-\varvec{p})\nonumber \\- & {} \varvec{\alpha }_2^T(-(\varvec{Bw}_1+\varvec{e}_2b_1)+\varvec{q}_1-\varvec{e}_2)-\varvec{\alpha }_3^T(\rho _1 \varvec{e}_2 + (\varvec{Bw}_1+\varvec{e}_2b_1)-\varvec{q}_1)\nonumber \\- & {} \varvec{\alpha }_4^T\varvec{q}_1-\beta (\rho _1-1) \end{aligned}$$

where \(\varvec{\alpha _1}\in \mathbf {R}^{m_1\times 1}, \varvec{\alpha }_2\in \mathbf {R}^{m_2\times 1},\varvec{\alpha }_3\in \mathbf {R}^{m_2\times 1},\varvec{\alpha }_4\in \mathbf {R}^{m_1\times 1},\beta \in \mathbf {R}\) are the vectors of Lagrange multipliers. The optimality conditions for \(\varvec{w}_1,b_1,\varvec{p},\varvec{q}_1,\rho _1\) are given by

$$\begin{aligned} \nabla _{\varvec{w}_1}\mathbf {L}= & {} c_1\varvec{w}_1+\varvec{A}^T\varvec{\alpha }_1+\varvec{B}^T\varvec{\alpha }_2 - \varvec{B}^T\varvec{\alpha }_3 =0 \end{aligned}$$
(3)
$$\begin{aligned} \nabla _{b_1}\mathbf {L}= & {} c_1b_1+\varvec{e}_1^T\varvec{\alpha }_1+\varvec{e}_1^T\varvec{\alpha }_2-\varvec{e}_2^T\varvec{\alpha }_3=0\end{aligned}$$
(4)
$$\begin{aligned} \nabla _{\varvec{p}}\mathbf {L}= & {} \varvec{\alpha }_1-\varvec{p}=0\end{aligned}$$
(5)
$$\begin{aligned} \nabla _{\varvec{q}_1}\mathbf {L}= & {} c_2\varvec{e}_2-\varvec{\alpha }_2+\varvec{\alpha }_3-\varvec{\alpha }_4=0\end{aligned}$$
(6)
$$\begin{aligned} \nabla _{\rho _1}\mathbf {L}= & {} c_3-\varvec{e}_2^T \varvec{\alpha }_3 -\beta =0 \end{aligned}$$
(7)
$$\begin{aligned}&\varvec{\alpha }_2\ge 0,\varvec{\alpha }_3\ge 0,\varvec{\alpha }_4\ge 0, \beta \ge 0. \end{aligned}$$
(8)

Then substituting (3) and (7) into the Lagrangian, we obtain the dual problem of the problem

(9)

where

$$\begin{aligned} \varvec{\bar{H}}= & {} \left[ \begin{array}{c c c} \varvec{AA}^T+c_1\varvec{I}&{}\varvec{AB}^T&{}-\varvec{AB}^T\\ \varvec{BA}^T&{}\varvec{BB}^T&{}-\varvec{BB}^T\\ -\varvec{BA}^T&{}-\varvec{BB}^T&{}\varvec{BB}^T \end{array} \right] + \left[ \begin{array}{c c c} \varvec{e}_1\varvec{e}_1^T&{}\varvec{e}_1\varvec{e}_2^T&{}-\varvec{e}_1\varvec{e}_2^T\\ \varvec{e}_2\varvec{e}_1^T&{}\varvec{e}_2\varvec{e}_2^T&{}-\varvec{e}_2\varvec{e}_2^T\\ -\varvec{e}_2\varvec{e}_1^T&{}-\varvec{e}_2\varvec{e}_2^T&{}\varvec{e}_2\varvec{e}_2^T \end{array} \right] . \end{aligned}$$

The dual of the problem (2) is

(10)

where

$$\begin{aligned} \varvec{\tilde{H}}= & {} \left[ \begin{array}{c c c} \varvec{BB}^T+c_4\varvec{I}&{}-\varvec{BA}^T&{}\varvec{BA}^T\\ -\varvec{AB}^T&{}\varvec{AA}^T&{}-\varvec{AA}^T\\ \varvec{AB}^T&{}-\varvec{AA}^T&{}\varvec{AA}^T \end{array} \right] + \left[ \begin{array}{c c c} \varvec{e}_2\varvec{e}_2^T&{}-\varvec{e}_2\varvec{e}_1^T&{}\varvec{e}_2\varvec{e}_1^T\\ -\varvec{e}_1\varvec{e}_2^T&{}\varvec{e}_1\varvec{e}_1^T&{}-\varvec{e}_1\varvec{e}_1^T\\ \varvec{e}_1\varvec{e}_2^T&{}-\varvec{e}_1\varvec{e}_1^T&{}\varvec{e}_1\varvec{e}_1^T \end{array} \right] . \end{aligned}$$

Once the \((\varvec{w}_1,b_1)\) and \((\varvec{w}_2,b_2)\) are obtained from the (9) and (10) by means of the dual coordinate descent method [15], a new input \(\varvec{x}\in \mathbf {R}^n\) is assigned to class \(i~(i=+1,-1)\) by \({Class~i}=arg \min \limits _{k=1,2} \frac{|\varvec{w}_k^T\varvec{x}+b_k|}{\Vert \varvec{w}_k\Vert }.\)

3.2 Kernel PNSVM

Kernel PNSVM searches the following two kernel-generated surfaces instead of hyperplanes \(K(\varvec{x}^T, \varvec{C}^T)\varvec{w}_1+b_1 = 0\) and \(K(\varvec{x}^T, \varvec{C}^T)\varvec{w}_2+b_2 = 0,\) where \(\varvec{C}^T=[\varvec{A}~\varvec{B}]^T\in \mathbf {R}^{n\times \ell }\). The optimization problems are

(11)

and

(12)

where \(c_i, i=1,\cdots ,6\) are parameters.

Table 1. The comparison results of linear classifiers.

4 Experimental Results

In this section, the UCI data sets are chosen to demonstrate the performance of our PNSVM compared with SVC, TWSVM and WLTSVM [13]. The methods are implemented by Matlab 9.0 running on a PC with an Intel(R) Core Duo i7(2.70 GHZ) with 32 GB RAM. Our PNSVM is solved by the dual coordinate descend algorithm and SVC and TWSVM are solved by the optimization toolbox QP in Matlab. The classification accuracy is measured by the standard 10-fold cross-validation.

Fig. 1.
figure 1

The influence on Accuracy(AC) of parameters \(c_3\) and \(c_6\) of linear PNSVM.

The classification accuracy, computation time, and optimal values of \(c_1(=c_2),c_4=(c_5)\) and \(c_3\) and \(c_6\) in our PNSVM are listed in Table 1. The parameters from PNSVM, TWSVM, SVC and WLTSVM are searched in the range \(\{2^{2i}|i=-8,\cdots ,8\}\), and the parameters \(c_3\) and \(c_6\) of PNSVM are selected from the set \(\{0.01,0.1,1,10,100\}\). Table 1 shows the comparison results of all four methods. It can be seen from Table 1 that the accuracy of the proposed linear PNSVM is significantly better than that of the linear TWSVM on most of data sets. And our PNSVM is the fastest on most of data sets. Figure 1 exhibits the influence on Accuracy of parameters \(c_3\) and \(c_6\) for linear PNSVM. It can be observed that the choice of \(c_3\) and \(c_6\) affects the results dramatically, which implies that adjusting the parameters \(c_3\) and \(c_6\) is practical selection in real applications. Table 2 is concerned with our kernel PNSVM, TWSVM ans SVC and WLTSVM. The Gaussian kernel \(K(\varvec{x},\varvec{x}')=e^{-\mu \Vert \varvec{x}-\varvec{x}'\Vert ^2}\) is used. The kernel parameter \(\mu \) is also searched from the sets \(\{2^i|i=-8,\cdots ,8\}\). The classification accuracy and computation times for all three methods are also listed in Table 2.

Table 2. The comparison results of kernel classifiers.

5 Conclusions

For binary classification problem, a novel large margin proximal non-parallel twin support vector machine was proposed in this paper. The main contribution are that the structural risk minimization principle is implemented by introducing a regularization term in the primal problems of our PNSVM and the dual formulation of PNSVM avoids the computation of inverse matrices and speeds up the training efficiency. Experimental results of our PNSVM and SVC and TWSVM and WLTSVM, have been made on several data sets, implying that the proposed method is not only faster but also exhibits better generalization. It should be pointed out that there are six parameters in our PNSVM, so the parameter selection and more efficient algorithm [17] is a practical problem and should be addressed in the future.