Localized algorithms for multiple kernel learning

doi:10.1016/j.patcog.2012.09.002

Pattern Recognition

Volume 46, Issue 3, March 2013, Pages 795-807

https://doi.org/10.1016/j.patcog.2012.09.002 Get rights and content

Abstract

Instead of selecting a single kernel, multiple kernel learning (MKL) uses a weighted sum of kernels where the weight of each kernel is optimized during training. Such methods assign the same weight to a kernel over the whole input space, and we discuss localized multiple kernel learning (LMKL) that is composed of a kernel-based learning algorithm and a parametric gating model to assign local weights to kernel functions. These two components are trained in a coupled manner using a two-step alternating optimization algorithm. Empirical results on benchmark classification and regression data sets validate the applicability of our approach. We see that LMKL achieves higher accuracy compared with canonical MKL on classification problems with different feature representations. LMKL can also identify the relevant parts of images using the gating model as a saliency detector in image recognition problems. In regression tasks, LMKL improves the performance significantly or reduces the model complexity by storing significantly fewer support vectors.

Highlights

► Introduces a localized multiple kernel learning framework for kernel-based algorithms. ► Generalizes the model for different gating models, kernel functions, and applications. ► Reports the results of extensive simulations on multiple real-world data sets. ► Identifies the relevant parts of images acting as a saliency detector. ► Has inherent regularization to avoid overfitting using required number of kernels.

Introduction

Support vector machine (SVM) is a discriminative classifier based on the theory of structural risk minimization [33]. Given a sample of independent and identically distributed training instances ${(x_{i}, y_{i})}_{i = 1}^{N}$ , where $x_{i} \in R^{D}$ and $y_{i} \in {- 1, + 1}$ is its class label, SVM finds the linear discriminant with the maximum margin in the feature space induced by the mapping function $Φ (\cdot)$ . The discriminant function is $f (x) = 〈 w, Φ (x) 〉 + b$ whose parameters can be learned by solving the following quadratic optimization problem: $min. \frac{1}{2} ∥ w ∥_{2}^{2} + C \sum_{i = 1}^{N} ξ_{i} w.r.t. w \in R^{S}, ξ \in R_{+}^{N}, b \in R s.t. y_{i} (〈 w, Φ (x_{i}) 〉 + b) \geq 1 - ξ_{i} \forall i$ where $w$ is the vector of weight coefficients, S is the dimensionality of the feature space obtained by $Φ (\cdot)$ , C is a predefined positive trade-off parameter between model simplicity and classification error, $ξ$ is the vector of slack variables, and b is the bias term of the separating hyperplane. Instead of solving this optimization problem directly, the Lagrangian dual function enables us to obtain the following dual formulation: $max. \sum_{i = 1}^{N} α_{i} - \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} α_{i} α_{i} y_{i} y_{j} k (x_{i}, x_{j}) w.r.t. α \in [0, C]^{N} s.t. \sum_{i = 1}^{N} α_{i} y_{i} = 0$ where $α$ is the vector of dual variables corresponding to each separation constraint and the obtained kernel matrix of $k (x_{i}, x_{j}) = 〈 Φ (x_{i}), Φ (x_{j}) 〉$ is positive semidefinite. Solving this, we get $w = \sum_{i = 1}^{N} α_{i} y_{i} Φ (x_{i})$ and the discriminant function can be written as $f (x) = \sum_{i = 1}^{N} α_{i} y_{i} k (x_{i}, x) + b .$

There are several kernel functions successfully used in the literature such as the linear kernel (k_L), the polynomial kernel (k_P), and the Gaussian kernel (k_G) $k_{L} (x_{i}, x_{j}) = 〈 x_{i}, x_{j} 〉 k_{P} (x_{i}, x_{j}) = (〈 x_{i}, x_{j} 〉 + 1)^{q} q \in N k_{G} (x_{i}, x_{j}) = \exp (- ∥ x_{i} - x_{j} ∥_{2}^{2} / s^{2}) s \in R_{++} .$ There are also kernel functions proposed for particular applications, such as natural language processing [24] and bioinformatics [31].

Selecting the kernel function $k (\cdot, \cdot)$ and its parameters (e.g., q or s) is an important issue in training. Generally, a cross-validation procedure is used to choose the best performing kernel function among a set of kernel functions on a separate validation set different from the training set. In recent years, multiple kernel learning (MKL) methods are proposed, where we use multiple kernels instead of selecting one specific kernel function and its corresponding parameters $k_{η} (x_{i}, x_{j}) = f_{η} ({k_{m} (x_{i}^{m}, x_{j}^{m})}_{m = 1}^{P})$ where the combination function $f_{η} (\cdot)$ can be a linear or a nonlinear function of the input kernels. Kernel functions, ${k_{m} (\cdot, \cdot)}_{m = 1}^{P}$ , take P feature representations (not necessarily different) of data instances, where $x_{i} = {x_{i}^{m}}_{m = 1}^{P}$ , $x_{i}^{m} \in R^{D_{m}}$ , and D_m is the dimensionality of the corresponding feature representation.

The reasoning is similar to combining different classifiers: Instead of choosing a single kernel function and putting all our eggs in the same basket, it is better to have a set and let an algorithm do the picking or combination. There can be two uses of MKL: (i) Different kernels correspond to different notions of similarity and instead of trying to find which works best, a learning method does the picking for us, or may use a combination of them. Using a specific kernel may be a source of bias, and in allowing a learner to choose among a set of kernels, a better solution can be found. (ii) Different kernels may be using inputs coming from different representations possibly from different sources or modalities. Since these are different representations, they have different measures of similarity corresponding to different kernels. In such a case, combining kernels is one possible way to combine multiple information sources.

Since their original conception, there is significant work on the theory and application of multiple kernel learning. Fixed rules use the combination function in (1) as a fixed function of the kernels, without any training. Once we calculate the combined kernel, we train a single kernel machine using this kernel. For example, we can obtain a valid kernel by taking the summation or multiplication of two kernels as follows [10]: $k_{η} (x_{i}, x_{j}) = k_{1} (x_{i}^{1}, x_{j}^{1}) + k_{2} (x_{i}^{2}, x_{j}^{2}) k_{η} (x_{i}, x_{j}) = k_{1} (x_{i}^{1}, x_{j}^{1}) k_{2} (x_{i}^{2}, x_{j}^{2}) .$ The summation rule is applied successfully in computational biology [27] and optical digit recognition [25] to combine two or more kernels obtained from different representations.

Instead of using a fixed combination function, we can have a function parameterized by a set of parameters $Θ$ and then we have a learning procedure to optimize $Θ$ as well. The simplest case is to parameterize the sum rule as a weighted sum $k_{η} (x_{i}, x_{j} | Θ = η) = \sum_{m = 1}^{P} η_{m} k_{m} (x_{i}^{m}, x_{j}^{m})$ with $η_{m} \in R$ . Different versions of this approach differ in the way they put restrictions on the kernel weights [22], [4], [29], [19]. For example, we can use arbitrary weights (i.e., linear combination), nonnegative kernel weights (i.e., conic combination), or weights on a simplex (i.e., convex combination). A linear combination may be restrictive and nonlinear combinations are also possible [23], [13], [8]; our proposed approach is of this type and we will discuss these in more detail later.

We can learn the kernel combination weights using a quality measure that gives performance estimates for the kernel matrices calculated on training data. This corresponds to a function that assigns weights to kernel functions $η = g_{η} ({k_{m} (x_{i}^{m}, x_{j}^{m})}_{m = 1}^{P}) .$ The quality measure used for determining the kernel weights could be “kernel alignment” [21], [22] or another similarity measure such as the Kullback–Leibler divergence [36]. Another possibility inspired from ensemble and boosting methods is to iteratively update the combined kernel by adding a new kernel as training continues [5], [9]. In a trained combiner parameterized by $Θ$ , if we assume $Θ$ to contain random variables with a prior, we can use a Bayesian approach. For the case of a weighted sum, we can, for example, have a prior on the kernel weights [11], [12], [28]. A recent survey of multiple kernel learning algorithms is given in [18].

This paper is organized as follows: We formulate our proposed nonlinear combination method localized MKL (LMKL) with detailed mathematical derivations in Section 2. We give our experimental results in Section 3 where we compare LMKL with MKL and single kernel SVM. In Section 4, we discuss the key properties of our proposed method together with related work in the literature. We conclude in Section 5.

Section snippets

Localized multiple kernel learning

Using a fixed unweighted or weighted sum assigns the same weight to a kernel over the whole input space. Assigning different weights to a kernel in different regions of the input space may produce a better classifier. If the data has underlying local structure, different similarity measures may be suited in different regions. We propose to divide the input space into regions using a gating function and assign combination weights to kernels in a data-dependent way [13]; in the neural network

Experiments

In this section, we report empirical performance of LMKL for classification and regression problems on several data sets and compare LMKL with SVM, SVR, and MKL (using the linear formulation of [4]). We use our own implementations¹ of SVM, SVR, MKL, and LMKL written in MATLAB and the resulting optimization problems for all these methods are solved using the MOSEK optimization software [26].

Except otherwise stated, our experimental methodology

Discussion

We discuss the key properties of the proposed method and compare it with similar MKL methods in the literature.

Conclusions

This work introduces a localized multiple kernel learning framework for kernel-based algorithms. The proposed algorithm has two main ingredients: (i) a gating model that assigns weights to kernels for a data instance, (ii) a kernel-based learning algorithm with the locally combined kernel. The training of these two components is coupled and the parameters of both components are optimized together using a two-step alternating optimization procedure. We derive the learning algorithm for three

Acknowledgments

This work was supported by the Turkish Academy of Sciences in the framework of the Young Scientist Award Program under EA-TÜBA-GEBİP/2001-1-1, the Boğaziçi University Scientific Research Project 07HA101, and the Scientific and Technological Research Council of Turkey (TÜBİTAK) under Grant EEEAG 107E222. The work of M. Gönen was supported by the Ph.D. scholarship (2211) from TÜBİTAK.

Mehmet Gönen received the B.Sc. degree in industrial engineering, the M.Sc. and the Ph.D. degrees in computer engineering from Boğaziçi University, İstanbul, Turkey, in 2003, 2005, and 2010, respectively.

References (36)

M. Gönen et al.
Supervised learning of local projection kernels
Neurocomputing
(2010)
E. Alpaydın, Selective attention for handwritten digit recognition, in: Advances in Neural Information Processing...
E. Alpaydın
Combined 5×2 cv F test for comparing supervised classification learning algorithms
Neural Computation
(1999)
E. Alpaydın et al.
Local linear perceptrons for classification
IEEE Transactions on Neural Networks
(1996)
F.R. Bach, G.R.G. Lanckriet, M.I. Jordan, Multiple kernel learning, conic duality, and the SMO algorithm, in:...
K.P. Bennett, M. Momma, M.J. Embrechts, MARK: a boosting algorithm for heterogeneous kernel models, in: Proceedings of...
O. Chapelle et al.
Choosing multiple parameters for support vector machines
Machine Learning
(2002)
M. Christoudias, R. Urtasun, T. Darrell, Bayesian Localized Multiple Kernel Learning, Technical Report....
C. Cortes, M. Mohri, A. Rostamizadeh, Learning non-linear combinations of kernels, in: Advances in Neural Information...
K. Crammer, J. Keshet, Y. Singer, Kernel design using boosting, in: Advances in Neural Information Processing Systems...

N. Cristianini et al.

An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods

(2000)

M. Girolami, S. Rogers, Hierarchic Bayesian models for kernel learning, in: Proceedings of the 22nd International...

M. Girolami, M. Zhong, Data integration for classification problems employing Gaussian process priors, in: Advances in...

M. Gönen, E. Alpaydın, Localized multiple kernel learning, in: Proceedings of the 25th International Conference on...

M. Gönen, E. Alpaydın, Localized multiple kernel learning for image recognition, in: NIPS Workshop on Understanding...

M. Gönen, E. Alpaydın, Multiple kernel machines using localized kernels, in: Supplementary Proceedings of the Fourth...

M. Gönen, E. Alpaydın, Localized multiple kernel regression, in: Proceedings of the 20th International Conference on...

M. Gönen et al.

Multiple kernel learning algorithms

Journal of Machine Learning Research

(2011)

Cited by (80)

Integrating technical indicators, chip factors and stock news for enhanced stock price predictions: A multi-kernel approach
2023, Asia Pacific Management Review
In the field of stock price forecasting, we are actively seeking to integrate various information to more accurately grasp market dynamics. Although historical stock prices and financial news have been widely used in previous studies, it is relatively rare to find research considering news-based, technical, and chip factors simultaneously and evaluating their combined effect. In this study, we innovatively propose a multi-kernel model that not only combines news-based, technical, and chip factor analysis but also utilizes market data provided by the Taiwan Stock Exchange, including institutional trading situations and stock price technical indicators. The aim is to further enhance the prediction accuracy of stock price dynamics. Based on the frequency of word occurrences, we design a new discriminant index to extract features highly correlated with stock prices from financial news. The empirical results show that our multi-kernel model significantly surpasses the single-kernel model in prediction accuracy. However, we also find that although financial news is somewhat correlated with stock price dynamics, information such as chip factors and stock price technical indicators contribute more significantly in our model. This further validates that our multi-kernel learning algorithm can effectively handle multifaceted data sources and give appropriate weights according to the importance of each data point, thereby enhancing the comprehensiveness of prediction. Through this research, we hope to bring new perspectives and inspirations to the field of stock price forecasting.
Improving SVM performance for type II diabetes prediction with an improved non-linear kernel: Insights from the PIMA dataset
2023, Computer Methods and Programs in Biomedicine Update
Type 2 diabetes is a chronic metabolic disease that affects a significant portion of the worldwide people. Prediction of this disease using different machine learning (ML) based algorithms has gained substantial attention due to its potential for early detection and effective intervention. One of the most powerful ML algorithm support vector machines (SVM) has proven to be effective in a variety of classification tasks, including diabetes prediction. However, the kernel function chosen has a substantial effect on the performance of SVM classifiers. This paper proposes an improved non-linear kernel for the SVM model to enhance Type 2 diabetes classification. The new kernel uses radial basis function (RBF) and RBF city block kernels that enable SVM to learn complex decision boundaries and adapt to the intricacies of the PIMA dataset. The PIMA dataset contains various clinical and demographic features of individuals. To address missing values and outliers, we impute them using the median, ensuring the integrity of the dataset. We tackle the class imbalance issue by leveraging a robust synthetic-based over-sampling approach.
A comparative analysis is performed against several existing kernel functions to show that the proposed approach is superior in terms of various prediction evaluation matrices. Our recommended integrated kernel model also showed improved performance (ACC = 85.5, Recall = 87.0, Precision = 83.4, F1 score = 85.2, and AUC = 85.5) when compared to other approaches in the literature. The results of this study indicate that the proposed non-linear kernel in SVM outperforms existing kernel functions for predicting Type 2 diabetes using the PIMA dataset. Furthermore, a simulation study is carried out to robustify the proposed kernel in SVM and perform well. The improved accuracy and robustness of the model suggest its potential utility in clinical settings, aiding in the early identification and management of individuals at risk for developing diabetes.
Localized multiple kernel learning using graph modularity
2022, Pattern Recognition Letters
Multiple kernel learning (MKL) algorithms exploit information from multiple feature representations by assigning weights to each representation in the kernel space, and later combining them. However, this ignores the fact that data points exhibit locally varying characteristics. To address this problem, localized MKL algorithms learn locality-specific kernel weights which determine each base kernel’s influence in the locality under consideration. Here, we relate the problem of determining the relevance of base kernels for classification to that of quantifying community structure in graphs. Next, we derive sample-specific kernel weights using graph modularity. Through experiments on publicly available datasets, we show that the proposed approach offers a viable alternative to state-of-the-art MKL approaches while being computationally inexpensive.
Minimum class variance multiple kernel learning
2020, Knowledge-Based Systems
The purpose of multiple kernel learning (MKL) is to learn an appropriate kernel from a set of predefined base kernels. Most of the MKL methods follow the basic idea of support vector machine (SVM) to learn the optimal weights of base kernels and build the used classifier. However, SVM is a local method and ignores the structure information of the data in that its solution is exclusively determined by the so-called support vectors. In the paper, we propose an improved SVM-based MKL method called minimum class variance multiple kernel learning (MCVMKL). The key characteristic of MCVMKL is that it exploits the ellipsoidal structure of the data during learning the optimal weights and building the classifier. Besides, its formulation is invariant to scalings of the weights of base kernels. We develop two optimization strategies to handle the optimization model of MCVMKL. Further, we derive a rough upper bound for the objective function of MCVMKL and propose a variant called trace-constrained multiple kernel learning (TCMKL) by using the trace of the within-class scatter matrix. TCMKL enlarges the margin between different classes and simultaneously shrinks the region covering the data as much as possible. Moreover, it can automatically tune the regularization parameter and so saves the training time due to avoiding using the time-consuming cross-validation technique to select an appropriate regularization parameter. Finally, the comprehensive experiments are conducted and the results demonstrate that the proposed methods are effective and can achieve better performance over the competing methods.
A three-level Multiple-Kernel Learning approach for soil spectral analysis
2020, Neurocomputing
To ensure the sustainability of the soil ecosystem, which is the basis for food production, efficient large-scale baseline predictions and trend assessments of key soil properties are necessary. In that regard, visible, near-infrared, and shortwave infrared (VNIR–SWIR) spectroscopy can provide an alternative for the expensive wet chemistry. In this paper, we examined the application of the Multiple-Kernel Learning (MKL) approach to soil spectroscopy by integrating the information from heterogeneous features. In particular, the proposed three-level MKL framework acts in the following way: at the first level, it uses multiple kernels at each spectral feature (wavelength) to maximize the information of each band. At the second level, it performs implicit feature selection at the spectral source level, enabling it to provide interpretable results. Finally, at the third level of integration it combines the complementary information contained within a pool of spectral sources, each derived from its own set of pre-processing techniques. Additionally, at this stage, the proposed approach is also capable of fusing heterogeneous sources of information, such as auxiliary predictors, which can assist the spectral predictions. The experimental analysis was conducted using the pan-European LUCAS (Land Use/Cover Area frame statistical Survey) topsoil database, with a goal to predict from the VNIR–SWIR spectra the concentration of soil organic carbon (SOC), a key indicator for agricultural productivity and environmental resilience. The particle size distribution which describes the soil texture was selected as the set of auxiliary predictors. The proposed MKL framework was compared with other state-of-the-art approaches, and the results indicated that it attains the best performance in terms of accuracy, whilst at the same time producing interpretable results.
Ufmksc: A Uniform Framework for Multiple Kernel Spectral Clustering Using a Noise-Free Laplacian Matrix
2024, SSRN

View all citing articles on Scopus

He was a Teaching Assistant at the Department of Computer Engineering, Boğaziçi University. He is currently doing his postdoctoral work at the Department of Information and Computer Science, Aalto University School of Science, Espoo, Finland. His research interests include support vector machines, kernel methods, Bayesian methods, optimization for machine learning, dimensionality reduction, information retrieval, and computational biology applications.

Ethem Alpaydın received his B.Sc. from the Department of Computer Engineering of Boğaziçi University in 1987 and the degree of Docteur es Sciences from Ecole Polytechnique Fédérale de Lausanne in 1990.

He did his postdoctoral work at the International Computer Science Institute, Berkeley, in 1991 and afterwards was appointed as Assistant Professor at the Department of Computer Engineering of Boğaziçi University. He was promoted to Associate Professor in 1996 and Professor in 2002 in the same department. As visiting researcher, he worked at the Department of Brain and Cognitive Sciences of MIT in 1994, the International Computer Science Institute, Berkeley, in 1997 and IDIAP, Switzerland, in 1998. He was awarded a Fulbright Senior scholarship in 1997 and received the Research Excellence Award from the Boğaziçi University Foundation in 1998 (junior level) and in 2008 (senior level), the Young Scientist Award from the Turkish Academy of Sciences in 2001 and the Scientific Encouragement Award from the Scientific and Technological Research Council of Turkey in 2002. His book Introduction to Machine Learning was published by The MIT Press in October 2004. Its German edition was published in 2008, its Chinese edition in 2009, its second edition in 2010, and its Turkish edition in 2011. He is a senior member of the IEEE, an editorial board member of The Computer Journal (Oxford University Press) and an associate editor of Pattern Recognition (Elsevier).

View full text

Localized algorithms for multiple kernel learning

Abstract

Highlights

Introduction

Section snippets

Localized multiple kernel learning

Experiments

Discussion

Conclusions

Acknowledgments

Neurocomputing

Combined 5×2 cv F test for comparing supervised classification learning algorithms

Neural Computation

Local linear perceptrons for classification

IEEE Transactions on Neural Networks

Choosing multiple parameters for support vector machines

Machine Learning

An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods

Multiple kernel learning algorithms

Journal of Machine Learning Research