Efficient parallel implementation of kernel methods☆
Introduction
Kernel methods are very popular in machine learning because they produce highly competitive results in many practical tasks. They transform the input space onto a high dimensional one where inner products are computed using a kernel function. The most relevant techniques are Support Vector Machines (SVMs) for classification problems and Gaussian Processes (GPs) for regression.
Support Vector Machines [1] are one of the most successful kernel techniques, which aim to obtain a maximum margin separating hyper-plane. They are very popular because they automatically adjust the machine size and also produce highly competitive results in many real world problems. The resulting size of the classifier is often very large, that represents a high computational cost. Many research lines have emerged to solve this problem of complexity and scalability. Some works [2], [3], [4], [5] calculate a full SVM and reduce afterwards the machine size by solving a preimage problem [6]. In [7], [8], to avoid the calculation of a full SVM, they propose an iterative growing architecture. In [8], Sparse Greedy Matrix Approximation (SGMA) is proposed to iteratively select candidates to grow a semiparametric model. Peng et al. [9] introduce a criterion for identification of support vectors leading to a reduced support vector set. Other works [10] are focused on improving the classification complexity using decision trees.
Gaussian processes [11] are also non-parametric methods considered the “state-of-art” solving regression problems and relying on probabilistic Bayesian models. Unfortunately, their direct application is limited due to the high training time and computational cost in non-sparse solutions, where n is the size of the training set. There are also some iterative greedy schemes that obtain a reduced GP. Among those, [12], [13] are based on minimizing Kullback–Leibler divergences, [14] uses a MAP criteria to select in every iteration the candidate to grow the model and [15] selects in every iteration the element that maximizes the evidence to avoid overfitting problems.
Since run time is the main problem of kernel methods, parallelization is one of the most important techniques to accelerate them. Currently, the semiconductor industry is creating new designs of processors that increase their performance with the inclusion of more cores in a single chip. With the emergence of multi-core processors and new programming interfaces such as OpenMP [16] to develop parallel software, many research lines about parallelization in kernel methods have been opened.
Early works on parallelization in SVMs propose to split the training set, train different SVMs on every data chunk and combine the results using a neural network [17] or to train a new SVM using the obtained Support Vectors [18]. In [19], a parallel version using a cascade of SVMs is used. Recently new methods have appeared, such as PSVM [20], Parallel SMO [21], [22] or Graphics Processing Unit (GPU) Tailored Approach SVM [23]. After the apparition of the Big Data technologies a MapReduced based SVM is used in [24], [25] to solve problems in a distributed environment.
PSVM solves the Quadratic Programming problem using a parallel implementation of the Interior Point Method (IPM) [26] and Incomplete Cholesky Factorization. Parallel SMO uses a parallel version of SMO [27] that divides the quadratic problem into a series of smaller subproblems, which can be solved analytically. An implementation for GPUs that uses clustering techniques to handle sparse data sets is presented in [23]. In GPs [28] uses domain decomposition to solve 2-dimensional problems in parallel.
Due to the fact that the run time of the training procedure and the complexity of the model are the main weaknesses of Kernel methods, our proposal here consists in the development of new schemas that can address these issues. To that end we are benefiting from two different techniques:
Semiparametric models: That can solve the issue of the model complexity, as presented in previous works [29], because the final machines are written as a function of a set of representatives, instead of support Vectors (as in SVMs) or all data (as in GPs). These models have been shown to achieve similar performance as the full machines but with a lower computational cost and complexity.
Parallel computing: That can solve the issue of the scalability and the excessive run time of the training procedure by simultaneously using multiple computer resources to solve the problem.
By using these techniques we have developed three different models.
- •
PS-SVM : A parallel and semiparametric version of the SVM.
- •
P-GP: A parallel version of the GPs.
- •
PS-GP: A parallel and semiparametric version of the GPs.
This paper is organized as follows. In Section 2 we describe our algorithms. Experimental results are provided in Section 3. Finally we describe the conclusions in Section 4.
Section snippets
Algorithms
When developing parallel code, the two most important issues to avoid if possible are:
- •
Non-parallelizable sections of code: Because they put the upper bound of speedup in our model according to Amdahl׳s law [30]. The run time of our non-parallel code is absolutely despicable comparing to the whole run time.
- •
Communication between threads: To avoid possible bottlenecks we have selected OpenMP as the parallel framework because when a subtask finishes its job another subtask can access the results
Experiments
All the algorithms have been implemented in C using OpenMP [16]. We conducted experiments to evaluate their efficiency and acceleration performance. The experiments were executed on a HP DL160 G6 server with 48 GBytes and 2 Intel Xeon X5675 processors (each one has 6 cores with hyperthreading technology).
To evaluate the parallelization quality we took the speedup parameter:
Conclusions
We have proposed several parallel algorithms for Kernel Methods: one method aimed at solving classification problems called Parallel Semiparametric SVMs (PS-SVM) and two methods intended for solving regression problems, a parallel version of a full GP (P-GP) and the parallel implementation of the SGEV algorithm for sparse GP training (PS-GP). The technique underlying these parallel implementations is based on the division of matrices in quadtrees for the parallelization of matrix inversion in
Roberto Díaz Morales received his Telecommunications Engineering degree from the University Carlos III of Madrid (Spain) in 2006. Until 2008 he worked in Sun Microsystems in the web services area. He received the M.Sc. (Hons.) degree in multimedia and communications from the University Carlos III de Madrid in 2011 and finished his PhD. in 2016. His research interests are focused on machine learning.
References (42)
Compact multiclass support vector machine
Neurocomputing
(2007)- et al.
Growing support vector classifiers with controlled complexity
Pattern Recognit.
(2003) - et al.
A sequential algorithm for sparse support vector classifiers
Pattern Recognit.
(2013) - et al.
Hierarchical linear support vector machine
Pattern Recognit.
(2012) - et al.
Developing parallel sequential minimal optimization for fast training support vector machine
Neurocomputing
(2006) - et al.
An ontology enhanced parallel svm for scalable spam filter training
Neurocomputing
(2013) - et al.
A mapreduce based parallel svm for large-scale predicting protein–protein interactions
Neurocomputing
(2014) The Nature of Statistical Learning Theory
(2000)- Schölkopf, P. Simard, V. Vapnik, A. Smola, Improving the accuracy and speed of support vector machines, In: Advances in...
- E. Osuna, F. Girosi, Reducing the Run-Time Complexity in Support Vector Regression, Advances in Kernel Methods-Support...
The pre-image problem in kernel methods
IEEE Trans. Neural Netw.
Sparse on-line Gaussian processes
Neural Comput.
Analysis of some methods for reduced rank Gaussian process regression
Switch. Learn. Feedback Syst.
Openmpan industry standard api for shared-memory programming
Comput. Sci. Eng. IEEE
A parallel mixture of svms for very large scale problems
Neural Comput.
Cited by (0)
Roberto Díaz Morales received his Telecommunications Engineering degree from the University Carlos III of Madrid (Spain) in 2006. Until 2008 he worked in Sun Microsystems in the web services area. He received the M.Sc. (Hons.) degree in multimedia and communications from the University Carlos III de Madrid in 2011 and finished his PhD. in 2016. His research interests are focused on machine learning.
Ángel Navia-Vázquez received his degree in Telecommunications Engineering in 1992 (Universidad de Vigo, Spain), and finished his PhD, also in Telecommunications Engineering, in 1997 (Universidad Politécnica de Madrid, Spain). He is now an Associate Professor at the Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Spain. His research interests are focused on new architectures and algorithms for nonlinear processing, as well as their application to multimedia processing, communications, data mining and content management. He has (co)authored 26 international refereed journal papers in these areas, several book chapters, more than 40 conference communications, and participated in more than 20 research projects. He is IEEE (Senior) Member since 1999.
- ☆
This work has been partly supported by Spanish MEC project TIN2011-24533.