Keywords

1 Introduction

Kernel Support Vector Machines (SVM) is a very well established supervised machine learning technique. Its training computational complexity depends on the effective number of support vectors. In online settings, the amount of support vectors increases linearly with the size of the dataset. This fact makes online kernel SVM unappealing in front of large scale problems.

Concerning the online capabilities of the SVM, several solvers have been proposed based on the primal formulation of [3], optimized using Stochastic Subgradient Methods (SSM) variations, including Stochastic Gradient Descent (SGD). One of the most widely used is Stochastic Gradient SVM [2]. Pegasos [8] is another well-known solver, which shines due to the use of integer operations that speeds up the training process.

The reduction of the number of support vectors has been an important topic in the machine learning community. Most of the proposals are based on iteratively adding one support vector in a greedy fashion. In SpSVM [6], the newly included support vector is selected from a set of candidates, optimizing its coefficient alone while freezing the other weights. The resulting method is computationally expensive and lacks online capability. SVMperf [4], optimizes the dual formulation using Cutting Planes, while the support vector is chosen solving a preimage problem. This process is also costly and thus unsuitable. The work of [1], LaSVM, extends the original SMO solver [7] with online capabilities. The support vector control involves two processes. The first, adds a support at a time and computes its coefficient by using the traditional SMO update step. The second, finds the most violating pair of support vectors and recomputes their coefficients. In the case that one or both coefficients vanish, their associated support vectors are removed. Lastly, we find several works that controls sparsity in the solution by adding an \(L_1\) norm regularizer to the SVM formulation [10], an idea originally applied in the Relevance Support Machine [9]. Other approaches can be found in [5].

However, controlling the exact number of support vectors can be an important matter in many different scenarios. For example, consider a large scale problem, we can control the amount of memory used by the kernel by limiting the number of support vectors. Note that this will also speed up the training process. Moreover, not only this makes the method suitable for ingesting large amounts of data, but also usable in environments with low computational resources. Another scenario where controlling the number of support vectors is important, is in the exploration phase on a machine learning problem. A fast, lower bound on the validation error, can be computed that may help in the hyperparameter optimization or model selection processes. One way to solve this issue is to set an upper bound on the number of support vectors, called budget, and find an approximate solution. However, the mechanisms to maintain this budget are not trivial and sometimes lead to a large loss of performance.

Our proposal, Elastic Budget SVM (EBSVM) solves the primal kernel SVM formulation [3] with a set of additional constraints for enforcing an upper bound on the number of support vectors using stochastic subgradient methods. Concretely, the magnitude of the coefficients associated to each support vector is required to be above a threshold. This threshold is dynamically optimized during the training process in such a way that minimizes the difference between the current number of support vectors and the required budget. The method does not require additional parameters except for the allowed budget.

The remainder of this article is organized as follows. Section 2 describes the core of the EBSVM algorithm in detail and provide pseudo-code. We validate our proposal with a set of experiments on the UCI database in Sect. 3. Finally, Sect. 4 offers the conclusions.

2 Proposal

Our proposal is founded on the following intuition. We consider the notion of coverage which refers to the effective influence of a support vector on its neighbors. The larger the influence area of a support vector, the smaller the number of support vectors required for modeling/covering all data. In quasi-concaveFootnote 1 kernel functions of the form \(k(x_{sv},x)=e^{-\gamma d(x_{sv},x)}\), the coverage is controlled by parameter \(\gamma \), where \(d(\cdot )\) is an arbitrary distance metric. In the most well-known formulation with radial basis functions, \(d(\cdot )\) is squared Euclidean distance, and \(\gamma \) is the inverse of the desired variance \(1/\sigma ^2\). However, the same effect can be achieved by controlling the magnitude of the solution coefficients \(\alpha \).

The relationship between coverage and coefficients is easily visualized in the level sets of the kernel function. Consider the level set \(k(x_{sv},x)=e^{-\gamma d(x_{sv},x)}=\epsilon \). The \(\epsilon \)-coverage can be written as the hyper-volume enclosed by the level set \(k(x_{sv},x)=\epsilon \), i.e. \(g(x) = \{x | k(x_{sv},x)\le \epsilon \}\), \( Vol = \int _{\mathcal {R}} g(x) dx\).

If the kernel function is a quasi-concave function, the volume increases with small values of \(\epsilon \) and decreases with large values of \(\epsilon \). Thus, in the case of a weighted kernel we have \(\alpha k(x_{sv},x)\) with \(\alpha >0\). In that case, \(\hat{g}(x) = \{x | \alpha k(x_{sv},x)\le \epsilon \}\). This is equivalent to \(\hat{g}(x) = \{x | k(x_{sv},x)\le \epsilon /\alpha \}\). In that case, it is trivial to see that the greater the \(\alpha \) value is the smaller the \(\epsilon \) value is, and consequently, larger the coverage of the function. Conclusively, we use the magnitude of the coefficients as heuristic to rank the importance of the support vectors.

2.1 Sparsity Constraints

We propose to model the problem adding an additional constraint that depends of a parameter \(\theta \), a lower bound \(\theta \) on the magnitude of \(\alpha \), shown in Eq. 1.

$$\begin{aligned} \begin{aligned} \underset{\alpha }{\text {minimize}}&\quad \frac{1}{2} \lambda \Vert \alpha \Vert ^2+ \sum _i \xi _i \\ \text {subject to}&\quad y_i( \sum _j \alpha _j k(x_i, x_{sv_j})) \ge 1 - \xi _i, \; i = 1, \ldots , m \\&\quad \alpha _i = \left\{ \begin{aligned} \alpha _i&\quad \text {if} \quad |\alpha _i|\ge \theta \\ 0&\quad \text {otherwise}\\ \end{aligned}\right. \end{aligned} \end{aligned}$$
(1)

The value of \(\theta \) will allow us to control the amount of support vectors. However, this formulation does not solve the problem of looking for a solution given a budget B of support vectors, i.e. the required number of support vectors. For this task, we define a new function \(N = f(\theta )\) that relates the value of theta to the number of support vectors, N, and require the optimization problem to simultaneously optimize the former problem with a new cost function \((f(\theta )-B)^2\) that penalizes the deviation with respect to the desired budget. The problem can be formulated as follows,

$$\begin{aligned} \begin{aligned} \underset{\alpha , \theta }{\text {minimize}}&\quad \frac{1}{2} \lambda \Vert \alpha \Vert ^2+ \sum _i \xi _i + (B-f(\theta ))^2\\ \text {subject to}&\quad y_i( \sum _j \alpha _j k(x_i, x_{sv_j})) \ge 1 - \xi _i, \; i = 1, \ldots , m \\&\quad \alpha _i = \left\{ \begin{aligned} \alpha _i&\quad \text {if} \quad |\alpha _i|\ge \theta \\ 0&\quad \text {otherwise}\\ \end{aligned}\right. \\&\quad \theta >0 \end{aligned} \end{aligned}$$

Observe that the optimization of this problem requires solving for \(\theta \) and \(\alpha \). However, this entails modeling \(f(\theta )\), which is the dependency of the number of support vectors with respect to \(\theta \). We directly define f as follows, Eq. 2.

$$\begin{aligned} f(\theta ) = \Vert \alpha \Vert _0\Big |_\theta . \end{aligned}$$
(2)

This involves computing the amount of elements different of zero in the coefficients of the solution.

In order to computationally solve this problem using stochastic subgradient methods, we use a block coordinate descent approach. At each iteration we first solve for \(\alpha \) and project the solution into the feasible set, i.e. dropping the coefficients with coefficients below \(\theta \). The second step of the iteration is the optimization with respect to \(\theta \). The gradient of the loss function with respect to the \(\theta \) is the following

$$ \frac{\partial \mathcal {L}}{\partial \theta } = \frac{\partial \mathcal {L}}{\partial f(\theta )}\frac{\partial f(\theta )}{\partial \theta } = -(B-f(\theta )) \frac{\partial f(\theta )}{\partial \theta } $$

The second part of the gradient requieres computing a subgradient of \(f(\theta )\). The exact modeling of this term is complex. For this reason we introduce a surrogate of the exact function that follows the expected behavior of the the number of support vectors with respect to the parameter. In particular we can approximate \(f(\theta )\) using a linear model as \(f(\theta ) = b(a-\theta )\). Observe that this capture the notion that the number of support vectors should decrease when \(\theta \) increases. Thus, the final gradient computation becomes

$$\frac{\partial \mathcal {L}}{\partial \theta } \approx (B-f(\theta )) b$$

The update with respect \(\theta \) results in the formulation of Eq. 3.

$$\begin{aligned} \theta _{t+1} = \theta _t - \eta ' (B - f(\theta )) \end{aligned}$$
(3)

where \(\eta '\) subsumes the constant term b. Algorithm 1 provides the pseudo-code of our proposal. It starts with an empty set of support vectors. Iteratively, a point is sampled from the dataset uniformly at random. The gradient it is then computed, and if it is non-zero it will be added to the support vector set. The other coefficients are updated using \(\eta = \frac{1}{\lambda t}\). Then \(\theta \) is updated according Eq. 3. Finally the support vectors whose coefficient is below \(\theta \) are removed. However, we found experimentally that, since we sample one data point at a time, removing more than one is too harsh. Therefore we remove the violating support vector with the smallest coefficient only.

figure a

3 Experiments

In this section, we offer a set of experiments in order to asses its effectivity and performance. We introduce two different experiments comparing our proposal with a set of state-of-the-art SVM solvers. In the first experiment, we set B equal to the number of support vectors of the model showing the best accuracy on each individual problem. With this experiment we show that our approach allows to effectively control the budget without harming generalization. The second experiment sets \(B=10\). This second experiment shows the behavior of the method when the number of support vectors is very small.

3.1 Experimental Setup

Models. The models chosen are the online primal SVM from [3] with \(L_1\) norm, Pegasos and LaSVM. Regarding LaSVM, we have tuned its sparsity parameter \(\tau \) to achieve the best accuracy.

Performance Measures. We evaluate the classifiers both using the accuracy score to appraise their generalization capabilities, and the number of support vectors for assessing the complexity of the solution.

Datasets. The experiment consists in a comparison conducted over a set of datasets selected from the UCI repository, downloaded from MLData. All the datasets are standardized to \(\mu = 0\) and \(\sigma = 1\).

Generalization Error Estimation. In order to obtain reliable out-of-sample error estimates we use a nested 3-2 fold cross-validation, where the inner 2 folds are used to optimize the hyperparameters and the 3 outer folds to test its accuracy. Also the number of support vectors is averaged over the outer folds.

The hyperparameters are optimized with a logspace of 13 values from -3 to 9 for the regularization parameter \(\lambda \) and \(\sigma \) and \(\{0.01, 0.001, 0.0001\}\) for the \(\tau \) parameter of LaSVM.

3.2 Discussion

The results are shown in Table 1. Note how our method is certainly bounding the number of support vector to the number of support vectors of the model which exhibits the best accuracy. This offers proof of that our proposal is a effectively capable respect the budget, thus validating it as a budget SVM implementation.

Table 1. Accuracy and number of support vectors on different datasets.

A closer look at the results shows that the difference in terms of accuracy between the proposed model and the best performant one is usually very small. In six out of twenty datasets compared, the proposal achieves the best score. There is a noticeable performance loss in just four of the data sets compared. Table 2 shows the average rank of every model. A model is assigned a rank ranging from 1 for the best to 4 for the worst for each dataset, using the average in case of tie. The ranks are then averaged in order to offer insight of the relative performance among them. The resulting ranks show how our method ranks between the second and the third best on average, notably better than OnlineSVM L1 and no far from Pegasos, thus providing evidence that the loss of performance due the use of a budget is not critical.

Furthermore, if we force our algorithm setting the budget to \(B = 10\), the last column of Table 1 shows how our method performs surprisingly well considering the constraints. In some cases like breast-cancer it even outperforms all the other methods. Observe that in at least nine of the data sets, the result achieved is very close to the best performer, while in the rest of the cases the results can be a tight lower bound of the performance achieved without the constraint. It is worth mentioning that using just 10 support vectors effectively reduce the computational complexity and storage of the method by at least an order of magnitude. These results provide confirmatory evidence of the validity of our approach.

Table 2. Average rank for each dataset.

4 Conclusion

The control of the number of support vectors in SVM is an important problem in different scenarios such as low computational resources environments, very large scale machine learning, or for preliminary exploratory analysis of a data set, where fast results are needed. In this paper we introduce a novel solver for support vector machines that allows the effective control of the number of support vectors. The results show that the method is competitive with state-of-the-art for online learning and effectively achieves tight lower bounds when enforcing drastic reductions on the amount of support vectors.

Since the method is based on stochastic subgradient methods, this opens the possibility of applying a similar approach to other methods based on SGD and the use of basis functions, such deep neural networks or gradient boosted trees.

Another interesting line of investigation is optimizing \(\theta \) with respect the training error, allowing to model directly the trade-off between capacity and accuracy. This would enable to traverse the Pareto optimal surface between error and complexity, optimizing the number of support vectors for a given required accuracy.