Elsevier

Pattern Recognition

Volume 46, Issue 3, March 2013, Pages 795-807
Pattern Recognition

Localized algorithms for multiple kernel learning

https://doi.org/10.1016/j.patcog.2012.09.002Get rights and content

Abstract

Instead of selecting a single kernel, multiple kernel learning (MKL) uses a weighted sum of kernels where the weight of each kernel is optimized during training. Such methods assign the same weight to a kernel over the whole input space, and we discuss localized multiple kernel learning (LMKL) that is composed of a kernel-based learning algorithm and a parametric gating model to assign local weights to kernel functions. These two components are trained in a coupled manner using a two-step alternating optimization algorithm. Empirical results on benchmark classification and regression data sets validate the applicability of our approach. We see that LMKL achieves higher accuracy compared with canonical MKL on classification problems with different feature representations. LMKL can also identify the relevant parts of images using the gating model as a saliency detector in image recognition problems. In regression tasks, LMKL improves the performance significantly or reduces the model complexity by storing significantly fewer support vectors.

Highlights

► Introduces a localized multiple kernel learning framework for kernel-based algorithms. ► Generalizes the model for different gating models, kernel functions, and applications. ► Reports the results of extensive simulations on multiple real-world data sets. ► Identifies the relevant parts of images acting as a saliency detector. ► Has inherent regularization to avoid overfitting using required number of kernels.

Introduction

Support vector machine (SVM) is a discriminative classifier based on the theory of structural risk minimization [33]. Given a sample of independent and identically distributed training instances {(xi,yi)}i=1N, where xiRD and yi{1,+1} is its class label, SVM finds the linear discriminant with the maximum margin in the feature space induced by the mapping function Φ(·). The discriminant function is f(x)=w,Φ(x)+bwhose parameters can be learned by solving the following quadratic optimization problem:min.12w22+Ci=1Nξiw.r.t.wRS,ξR+N,bRs.t.yi(w,Φ(xi)+b)1ξiiwhere w is the vector of weight coefficients, S is the dimensionality of the feature space obtained by Φ(·), C is a predefined positive trade-off parameter between model simplicity and classification error, ξ is the vector of slack variables, and b is the bias term of the separating hyperplane. Instead of solving this optimization problem directly, the Lagrangian dual function enables us to obtain the following dual formulation:max.i=1Nαi12i=1Nj=1Nαiαiyiyjk(xi,xj)w.r.t.α[0,C]Ns.t.i=1Nαiyi=0where α is the vector of dual variables corresponding to each separation constraint and the obtained kernel matrix of k(xi,xj)=Φ(xi),Φ(xj) is positive semidefinite. Solving this, we get w=i=1NαiyiΦ(xi) and the discriminant function can be written as f(x)=i=1Nαiyik(xi,x)+b.

There are several kernel functions successfully used in the literature such as the linear kernel (kL), the polynomial kernel (kP), and the Gaussian kernel (kG)kL(xi,xj)=xi,xjkP(xi,xj)=(xi,xj+1)qqNkG(xi,xj)=exp(xixj22/s2)sR++.There are also kernel functions proposed for particular applications, such as natural language processing [24] and bioinformatics [31].

Selecting the kernel function k(·,·) and its parameters (e.g., q or s) is an important issue in training. Generally, a cross-validation procedure is used to choose the best performing kernel function among a set of kernel functions on a separate validation set different from the training set. In recent years, multiple kernel learning (MKL) methods are proposed, where we use multiple kernels instead of selecting one specific kernel function and its corresponding parameterskη(xi,xj)=fη({km(xim,xjm)}m=1P)where the combination function fη(·) can be a linear or a nonlinear function of the input kernels. Kernel functions, {km(·,·)}m=1P, take P feature representations (not necessarily different) of data instances, where xi={xim}m=1P, ximRDm, and Dm is the dimensionality of the corresponding feature representation.

The reasoning is similar to combining different classifiers: Instead of choosing a single kernel function and putting all our eggs in the same basket, it is better to have a set and let an algorithm do the picking or combination. There can be two uses of MKL: (i) Different kernels correspond to different notions of similarity and instead of trying to find which works best, a learning method does the picking for us, or may use a combination of them. Using a specific kernel may be a source of bias, and in allowing a learner to choose among a set of kernels, a better solution can be found. (ii) Different kernels may be using inputs coming from different representations possibly from different sources or modalities. Since these are different representations, they have different measures of similarity corresponding to different kernels. In such a case, combining kernels is one possible way to combine multiple information sources.

Since their original conception, there is significant work on the theory and application of multiple kernel learning. Fixed rules use the combination function in (1) as a fixed function of the kernels, without any training. Once we calculate the combined kernel, we train a single kernel machine using this kernel. For example, we can obtain a valid kernel by taking the summation or multiplication of two kernels as follows [10]:kη(xi,xj)=k1(xi1,xj1)+k2(xi2,xj2)kη(xi,xj)=k1(xi1,xj1)k2(xi2,xj2).The summation rule is applied successfully in computational biology [27] and optical digit recognition [25] to combine two or more kernels obtained from different representations.

Instead of using a fixed combination function, we can have a function parameterized by a set of parameters Θ and then we have a learning procedure to optimize Θ as well. The simplest case is to parameterize the sum rule as a weighted sum kη(xi,xj|Θ=η)=m=1Pηmkm(xim,xjm)with ηmR. Different versions of this approach differ in the way they put restrictions on the kernel weights [22], [4], [29], [19]. For example, we can use arbitrary weights (i.e., linear combination), nonnegative kernel weights (i.e., conic combination), or weights on a simplex (i.e., convex combination). A linear combination may be restrictive and nonlinear combinations are also possible [23], [13], [8]; our proposed approach is of this type and we will discuss these in more detail later.

We can learn the kernel combination weights using a quality measure that gives performance estimates for the kernel matrices calculated on training data. This corresponds to a function that assigns weights to kernel functions η=gη({km(xim,xjm)}m=1P).The quality measure used for determining the kernel weights could be “kernel alignment” [21], [22] or another similarity measure such as the Kullback–Leibler divergence [36]. Another possibility inspired from ensemble and boosting methods is to iteratively update the combined kernel by adding a new kernel as training continues [5], [9]. In a trained combiner parameterized by Θ, if we assume Θ to contain random variables with a prior, we can use a Bayesian approach. For the case of a weighted sum, we can, for example, have a prior on the kernel weights [11], [12], [28]. A recent survey of multiple kernel learning algorithms is given in [18].

This paper is organized as follows: We formulate our proposed nonlinear combination method localized MKL (LMKL) with detailed mathematical derivations in Section 2. We give our experimental results in Section 3 where we compare LMKL with MKL and single kernel SVM. In Section 4, we discuss the key properties of our proposed method together with related work in the literature. We conclude in Section 5.

Section snippets

Localized multiple kernel learning

Using a fixed unweighted or weighted sum assigns the same weight to a kernel over the whole input space. Assigning different weights to a kernel in different regions of the input space may produce a better classifier. If the data has underlying local structure, different similarity measures may be suited in different regions. We propose to divide the input space into regions using a gating function and assign combination weights to kernels in a data-dependent way [13]; in the neural network

Experiments

In this section, we report empirical performance of LMKL for classification and regression problems on several data sets and compare LMKL with SVM, SVR, and MKL (using the linear formulation of [4]). We use our own implementations1 of SVM, SVR, MKL, and LMKL written in MATLAB and the resulting optimization problems for all these methods are solved using the MOSEK optimization software [26].

Except otherwise stated, our experimental methodology

Discussion

We discuss the key properties of the proposed method and compare it with similar MKL methods in the literature.

Conclusions

This work introduces a localized multiple kernel learning framework for kernel-based algorithms. The proposed algorithm has two main ingredients: (i) a gating model that assigns weights to kernels for a data instance, (ii) a kernel-based learning algorithm with the locally combined kernel. The training of these two components is coupled and the parameters of both components are optimized together using a two-step alternating optimization procedure. We derive the learning algorithm for three

Acknowledgments

This work was supported by the Turkish Academy of Sciences in the framework of the Young Scientist Award Program under EA-TÜBA-GEBİP/2001-1-1, the Boğaziçi University Scientific Research Project 07HA101, and the Scientific and Technological Research Council of Turkey (TÜBİTAK) under Grant EEEAG 107E222. The work of M. Gönen was supported by the Ph.D. scholarship (2211) from TÜBİTAK.

Mehmet Gönen received the B.Sc. degree in industrial engineering, the M.Sc. and the Ph.D. degrees in computer engineering from Boğaziçi University, İstanbul, Turkey, in 2003, 2005, and 2010, respectively.

He was a Teaching Assistant at the Department of Computer Engineering, Boğaziçi University. He is currently doing his postdoctoral work at the Department of Information and Computer Science, Aalto University School of Science, Espoo, Finland. His research interests include support vector

References (36)

  • M. Gönen et al.

    Supervised learning of local projection kernels

    Neurocomputing

    (2010)
  • E. Alpaydın, Selective attention for handwritten digit recognition, in: Advances in Neural Information Processing...
  • E. Alpaydın

    Combined 5×2 cv F test for comparing supervised classification learning algorithms

    Neural Computation

    (1999)
  • E. Alpaydın et al.

    Local linear perceptrons for classification

    IEEE Transactions on Neural Networks

    (1996)
  • F.R. Bach, G.R.G. Lanckriet, M.I. Jordan, Multiple kernel learning, conic duality, and the SMO algorithm, in:...
  • K.P. Bennett, M. Momma, M.J. Embrechts, MARK: a boosting algorithm for heterogeneous kernel models, in: Proceedings of...
  • O. Chapelle et al.

    Choosing multiple parameters for support vector machines

    Machine Learning

    (2002)
  • M. Christoudias, R. Urtasun, T. Darrell, Bayesian Localized Multiple Kernel Learning, Technical Report....
  • C. Cortes, M. Mohri, A. Rostamizadeh, Learning non-linear combinations of kernels, in: Advances in Neural Information...
  • K. Crammer, J. Keshet, Y. Singer, Kernel design using boosting, in: Advances in Neural Information Processing Systems...
  • N. Cristianini et al.

    An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods

    (2000)
  • M. Girolami, S. Rogers, Hierarchic Bayesian models for kernel learning, in: Proceedings of the 22nd International...
  • M. Girolami, M. Zhong, Data integration for classification problems employing Gaussian process priors, in: Advances in...
  • M. Gönen, E. Alpaydın, Localized multiple kernel learning, in: Proceedings of the 25th International Conference on...
  • M. Gönen, E. Alpaydın, Localized multiple kernel learning for image recognition, in: NIPS Workshop on Understanding...
  • M. Gönen, E. Alpaydın, Multiple kernel machines using localized kernels, in: Supplementary Proceedings of the Fourth...
  • M. Gönen, E. Alpaydın, Localized multiple kernel regression, in: Proceedings of the 20th International Conference on...
  • M. Gönen et al.

    Multiple kernel learning algorithms

    Journal of Machine Learning Research

    (2011)
  • Cited by (80)

    View all citing articles on Scopus

    Mehmet Gönen received the B.Sc. degree in industrial engineering, the M.Sc. and the Ph.D. degrees in computer engineering from Boğaziçi University, İstanbul, Turkey, in 2003, 2005, and 2010, respectively.

    He was a Teaching Assistant at the Department of Computer Engineering, Boğaziçi University. He is currently doing his postdoctoral work at the Department of Information and Computer Science, Aalto University School of Science, Espoo, Finland. His research interests include support vector machines, kernel methods, Bayesian methods, optimization for machine learning, dimensionality reduction, information retrieval, and computational biology applications.

    Ethem Alpaydın received his B.Sc. from the Department of Computer Engineering of Boğaziçi University in 1987 and the degree of Docteur es Sciences from Ecole Polytechnique Fédérale de Lausanne in 1990.

    He did his postdoctoral work at the International Computer Science Institute, Berkeley, in 1991 and afterwards was appointed as Assistant Professor at the Department of Computer Engineering of Boğaziçi University. He was promoted to Associate Professor in 1996 and Professor in 2002 in the same department. As visiting researcher, he worked at the Department of Brain and Cognitive Sciences of MIT in 1994, the International Computer Science Institute, Berkeley, in 1997 and IDIAP, Switzerland, in 1998. He was awarded a Fulbright Senior scholarship in 1997 and received the Research Excellence Award from the Boğaziçi University Foundation in 1998 (junior level) and in 2008 (senior level), the Young Scientist Award from the Turkish Academy of Sciences in 2001 and the Scientific Encouragement Award from the Scientific and Technological Research Council of Turkey in 2002. His book Introduction to Machine Learning was published by The MIT Press in October 2004. Its German edition was published in 2008, its Chinese edition in 2009, its second edition in 2010, and its Turkish edition in 2011. He is a senior member of the IEEE, an editorial board member of The Computer Journal (Oxford University Press) and an associate editor of Pattern Recognition (Elsevier).

    View full text