Keywords

1 Introduction

The importance of kernel methods such as Support Vector Machines (SVM) lies in the fact that they can approximate very complex non-linear decision functions thanks to the kernel trick [5]. Using kernels as a similarity measure allows the user to involve domain knowledge that helps to shape the geometry of the data representation space. The use of the Reproducing Kernel Hilbert Space offers many advantages in machine learning, such as the possibility to define powerful and flexible models, and the possibility to generalize many results and algorithms for linear models in Euclidean spaces [5]. However, traditional kernel methods suffer many problems, especially with memory and time computational complexity, which grows at least quadratically in relation to the number of samples in the training dataset [2]. Thus, kernel methods are very successful with small datasets but do not scale well on their own to large datasets.

Given the fact that the size of the data has been growing exponentially, machine learning methods mostly point to more efficient optimizations strategies. In this sense, Stochastic Gradient Descent (SGD) rises as an effective procedure for large scale learning [3]. The classic formulation for the optimization problem in kernel-based methods does not permit an explicit implementation of SGD. However, it turned out to be possible in Least Squares Support Vector Machine (LS-SVM) [11]. Besides the SGD implementation, different approximation techniques can be used to relieve the computational cost of the Gram matrix. In this paper we use the Learning on a Budget strategy, which consists in taking only a reduced number of representative instances to compute the kernel matrix: the machine is trained with a budget kernel matrix. This strategy has already been applied in the automatic multi-label annotation problem [8], showing a significantly reduction of the computational complexity, with no losing of accuracy.

In the present work we evaluated and compared the performance of LS-SVM using Learning on a Budget with LS-SVM and using the Nyström approximation [4] instead of the budget. Also, an Online Random Fourier Features LS-SVM was proposed and implemented to compare the results with an state-of-the-art method. The Random Fourier Features (RFF) [9] gives an explicit feature mapping to a low dimensional feature space \(\hat{\mathcal {F}}\). This permits to solve the LS-SVM primal optimization problem directly on \(\hat{\mathcal {F}}\). As the primal problem also has an explicit summation over the error of each prediction, an SGD implementation is feasible. The rest of the paper is organized as follows. Section 2 contains theoretical background about the methods. Section 3 describes the proposed method. Results and discussions over experimental work is presented in Sect. 4. Finally, Sect. 5 has some concluding remarks.

2 Related Work

Approximated kernel methods have been widely studied due to their computational benefits [13]. One of the most used is the Nyström Method [4, 10], which finds a low rank approximation of the kernel matrix from a matrix decomposition. It takes \(\beta \ll n\) instances and construct the sub-matrix of X corresponding to those instances: B; this matrix is called the budget. Then it defines \(C = k(X, B)\) and \(W = k(B, B)\), and approximate \(\varOmega \) by \(\tilde{\varOmega } = CW^{-1}C^{T}\). This approximation saves memory of storage of the kernel matrix and time from calculating the overall loss in a training step. Nyström method can be extended to find a k-rank approximation of \(\varOmega \). To make this, the best low k-rank approximation of W is used instead of the original W. As it uses an approximation of the full kernel matrix, this strategy shall not be used with an online implementation.

Another approximated kernel method broadly used is the RFF method [9, 13], which states a relation between a shift invariant kernel k and a probability distribution p using the Bochner’s theorem. This allows to approximate the feature map \(\phi \) with linear projections on D random features and, gives a low dimensional representation \(\hat{\mathcal {F}}\) of the feature space \(\mathcal {F}\) induced by the kernel. Having an explicit representation of the approximated feature map \(\hat{\phi }\) allows us to use the set of images of the training data as input in a simple linear learning algorithm, which does not require too much memory capabilities or does not have high computational complexity.

Recently, several works have used a different approach, using what is called Learning on a Budget [8, 12]. In this method, the loss function in the SVM does not use the full kernel matrix, but only a small portion of it. The formulation of the SVM using Learning on a Budget enables to use SGD and thus online learning. Different versions of the Learning on a Budget strategy have been studied in recent years as in [6], in which the overall formulation of a LS-SVM was adapted to use a budget, therefore improving memory usage and time complexity. However, all the previous methods using the LS-SVM keep working with matricial systems which requires successive replacements of the entries of the matrix in case of an online implementation.

3 Method

The classic LS-SVM solves the optimization problem by means of a system of linear equations. Here we will describe an alternative to solve the convex dual problem by means of SGD, using just a portion of the training data.

3.1 Least Square Support Vector Machine

LS-SVM is a least squared version of the SVM for classification or regression problem. The problem considers equality constraints instead of inequalities as in classic SVM, this allows the solution to be reached by solving a system of linear equations. Given a set of training data \(\{x_1,\ldots ,x_n\}\subset X\) and the labels \(\{y_i\}_{i=1}^n\), and given a nonlinear feature mapping \(\phi :X\rightarrow \mathcal {F},\) associated to the kernel function k, the LS-SVM classifier defines the classification problem as [11]

$$\begin{aligned} \min _{w,b,e}{J(w,b,e)}=\frac{1}{2}w^Tw+\gamma \frac{1}{2}\sum _{k=1}^{n}e_k^2, \end{aligned}$$
(1)

subject to

$$y_k\left[ w^T\phi \left( x_k\right) +b\right] =1-e_k,\ \ \ k=1,\ldots ,\ N.$$

Once the Lagrangian is defined subject to Kuhn-Tucker conditions, the dual problem arises as a system of equations

$$\begin{aligned} \begin{bmatrix} \varOmega +I_n/\gamma&1_n\\ 1_n^T&0 \end{bmatrix} \begin{bmatrix} \alpha \\ b \end{bmatrix} = \begin{bmatrix} y\\ 0 \end{bmatrix} , \end{aligned}$$
(2)

where \(\varOmega _{ij}=k\left( x_i,x_j\right) =\langle \phi (x_i),\phi (x_j) \rangle \) is the kernel matrix, \(1_n=\left[ 1,\ldots ,1\right] ^T \in \mathbb {R}^n\), \(\alpha =\left[ \alpha _1,\ldots ,\alpha _n \right] ^T\) is the vector of Lagrange multipliers, \(y=\left[ y_1,\ldots ,y_n\right] ^T\), and \(I_n\) is the \(n\times n\) identity matrix. Once the system is solved for \(\alpha \) and b, the model is given by:

$$\begin{aligned} y\left( x\right) =w^T\phi \left( x\right) +b, \end{aligned}$$
(3)

where \(w=\sum _{i=1}^{n}{\alpha _iy_i\phi (x_i)}\). The first attempts to apply LS-SVM to large datasets, required solving the linear system by means of an iterative method like Conjugate Gradient or Successive Over-Relaxation [11].

For the dual version, we take the Lagrangian of the original LS-SVM problem (1)

$$\begin{aligned} \mathcal {L}\left( w,b,e,\alpha \right) =J\left( w,b,e\right) -\sum _{k=1}^{n}{\alpha _k\left( y_k\left[ w^T\varphi \left( x_k\right) +b\right] -1+e_k\right) }, \end{aligned}$$
(4)

subject to \(w=\sum _{k=1}^{n}{\alpha _ky_k\varphi \left( x_i\right) }, \sum _{k=1}^{n}{\alpha _ky_k=0,}\) and \(\alpha _k={\gamma e}_k, y_k\left[ w^T\varphi \right. \) \(\left. \left( x_k\right) +b\right] -1+e_k=0\) for \(k=1,\ldots n\). Plugging this into Eq. (4), we get the dual problem

$$\begin{aligned} \mathcal {L}\left( w,b,e,\alpha \right) =-\frac{1}{2}\left( \alpha y\right) ^Tk\left( X,X\right) \left( \alpha y\right) +\sum _{k=1}^{n}\alpha _k-\frac{\gamma }{2}\sum _{k=1}^{n}\left( 1-y_k\left[ {(\alpha y)}^Tk\left( X,x_k\right) +b\right] \right) ^2, \end{aligned}$$
(5)

where \((\alpha y)\) represents a pairwise product of \(\alpha \) and y, and must be maximized for \(\alpha _{k,} k=1,\ldots ,n,\) and b.

3.2 Large Scale LS-SVM

Solving a system of linear equations is a way complicated procedure if an online implementation is required. Solving a quadratic optimization problem by means of SGD is a widely used strategy, for example in the training of deep network architectures [3].

Budget LS-SVM. The Learning on a Budget strategy can be implemented in LS-SVM as follows: instead of computing the entire kernel matrix, a random selection of \(\beta \ll n\) instances will be made, selecting a sub-matrix B from the input data matrix X to train the machine. The loss function will be

$$\begin{aligned} \min _{\alpha ,b}{\mathcal {L}'}=\frac{1}{2}\left( \alpha y\right) ^Tk\left( B,B\right) \left( \alpha y\right) -\sum _{k=1}^{\beta }\alpha _k+\frac{\gamma }{2}\sum _{k=1}^{n}\left( 1-y_k\left[ {(\alpha y)}^{T}k\left( B,x_k\right) +b\right] \right) ^2, \end{aligned}$$
(6)

Online Budget LS-SVM. SGD permits an online implementation as it updates the solution using a single training sample at time, which alleviates even more the memory requirements. Following this, given the derivatives

$$\begin{aligned} \frac{\partial \mathcal {L}'}{\partial \alpha _i}=\sum _{k=1}^{\beta }{\alpha _ky_ky_ik\left( x_i,x_k\right) }-1-\gamma \sum _{k=1}^{n}{\left( 1-y_k\left[ {(\alpha y)}^Tk\left( B,x_k\right) +b\right] \right) y_ky_ik\left( x_i,x_k\right) }, \end{aligned}$$
(7)

the update rule is given by

$$\begin{aligned} \begin{aligned} \alpha _m=\alpha _m-&\eta y_m(\alpha y)^T k\left( B,x_m\right) +\eta \\&+\eta \gamma n\left( 1-y_j\left[ (\alpha y)^Tk\left( B,x_j\right) +b\right] \right) y_j y_m k\left( x_j,x_m\right) , \end{aligned} \end{aligned}$$
(8)

where \((x_j,y_j)\) is a randomly chosen instance of X. The entire procedure of the Online Budget LS-SVM is described in Algorithm 1.

figure a

Nyström LS-SVM. As for the budget strategy, the Nyström method can be used in (5) to approximate the kernel matrix. The loss function will be

$$\begin{aligned} \min _{\alpha ,b}{\mathcal {L}'}=\frac{1}{2}\left( \alpha y\right) ^T\hat{k}\left( X,X\right) \left( \alpha y\right) -\sum _{k=1}^{\beta }\alpha _k+\frac{\gamma }{2}\sum _{k=1}^{n}\left( 1-y_k\left[ {(\alpha y)}^T\hat{k}\left( X,x_k\right) +b\right] \right) ^2, \end{aligned}$$
(9)

where \(\hat{k}\) is the dot function after the Nyström approximation of original k.

4 Experimental Evaluation

4.1 Experimental Setup

The proposed methods were implemented in the dataflow GPU TensorFlow framework [1], one of the most used interfaces to express and develop research over machine learning algorithms. All the datasets were partitioned 80% for training and 20% for test. The optimization process is performed by SGD with the Adam optimizer [7], running until 1000 epochs. Four binary classification problems were chosen to test the proposed models using a RBF kernel. The datasets are described in Table 1. The Online Budget LS-SVM was trained with different budget proportions: 0.2, 0.4, 0.6, 0.8, 1.0 of the original data size. The same proportions were taken to make the Nyström low rank matrix and train the Nyström LS-SVM. In order to compare results with an state-of-the-art method, an Online RFF LS-SVM (which solves (1) in the primal by means of an approximation of the feature mapping \(\phi \) and a linear learning algorithm) was tested for five different features sizes: the same as budget sizes in each dataset. After a parameter exploration for \(\gamma \), we decided to fix \(\gamma = 1.0\).

Table 1. Datasets details. In the case of Wine and Mnist only two classes where taken.
Table 2. Mean training times for each dataset. Times reported for Online Budget LS-SVM correspond to the experiments trained with 20% of the data. For Nyström LS-SVM, they correspond to the experiments with an approximated kernel matrix made from 20% of the data, and for RFF LS-SVM, the number of selected features equals the 20% of training set size.

4.2 Results

Each configuration was executed several times and the mean and standard deviation of the results are presented in Fig. 1. On Wine and on Spambase, the Online Budget LS-SVM outperformed the Nyström LS-SVM, specially for the bigger proportions of the budget. In Mnist and Bank, although Nyström reached higher accuracy levels, the difference with the budget version is not quite high. The RFF approach showed the worst performance in all the datasets, specially in Mnist, where the accuracy level was almost constant equal to 0.5, indicating a null capacity of learning. As expected, in all the datasets the standard deviation goes down as the dataset grows in size, and there is more stability in the Nyström procedure as it recreates the entire kernel matrix. Regarding the training times (Table 2), it is quite notable that in Wine the Nyström LS-SVM is faster than the Online Budget LS-SVM. However, as the size of the dataset gets bigger, Nyström LS-SVM became slower that the budget (as in Bank).

Fig. 1.
figure 1

Mean and standard deviation of the results reached by Online Budget LS-SVM, with the Nyström LS-SVM and with the Online RFF LS-SVM.

4.3 Discussion

Results show a similar performance between Online Budget LS-SVM and Nyström LS-SVM. There is not a significant dominance of one method over the other one. However, to solve the LS-SVM problem in the primal using the RFF approach has not shown any good performance independently of the number of features selected. The standard deviation reported using the budget strategy is higher than the reported with the Nyström approximation, which indicates more stability in the approximation of the kernel matrix. The running times show that the budget method works best for larger sets compared to the Nyström method. This is where the computational complexity of the Nyström approximation is evidenced.

5 Conclusions

In this work we presented the Online Budget LS-SVM, a large scale learning method based on the LS-SVM algorithm. It uses the Learning on a Budget technique to avoid the entire computation of the kernel matrix. In order to compare the performance with other state-of-the-art approximation methods, a Nyström approximation (Nyström LS-SVM) and a RFF approach are also implemented. Experimental results show that there is not a significant loss of accuracy when a random budget is selected to train the machine. Comparing the results of Online Budget LS-SVM with the LS-SVM Nyström, and with the Online RFF LS-SVM, the Online Budget LS-SVM is on par with the Nyström version of the method, sometimes even outperforming it. The execution times showed that, in large datasets, the computation required to obtain the Nyström low rank matrix approximation does not compensate any improvement in the performance of the method, as it is shown in the running times for the Bank datasets. Regarding the Online RFF LS-SVM, the results have shown a bad performance compared to the other methods, independently of the number of features. To conclude, the Learning on a Budget technique alleviates the computation of the kernel matrix, without significant loss of accuracy, speeding up the training process, and making kernel-based methods more scalable.