A new scheme to learn a kernel in regularization networks

doi:10.1016/j.neucom.2011.03.004

Neurocomputing

Volume 74, Issues 12–13, June 2011, Pages 2222-2227

https://doi.org/10.1016/j.neucom.2011.03.004 Get rights and content

Abstract

In this paper, we propose a new scheme to learn a kernel function from the convex combination of finite given kernels in regularization networks. We show that the corresponding variational problem is convex and under certain conditions, the variational problem can be approximated by a semidefinite programming problem which coincides with the Micchelli and Pontil's (MP's) Model (Micchelli and Pontil, 2005 [10]).

Introduction

Kernel-based methods become very popular in the community of machine learning in recent years. Roughly speaking, a kernel-based method is constructing a nonlinear map from the data set to a Hilbert space (feature space), then building a linear algorithm in the feature space to implement the nonlinear counterpart in the data set. In many problems of machine learning, such as regression, classification and dimensionality reduction, kernel-based methods are very successful. However, the results depend on kernel severely, therefore how to select a ‘good’ kernel is a critical issue for all kernel-based methods. In this direction, there are some progresses. In general, the existing works can be divided into three classes based on tasks. Rayleigh coefficients play a key role in classification problems (see [7], [8], [12], [20]), as for regression, we recommend Micchelli and Pontil's works (see [2], [12]) and for dimensionality reduction, the celebrated work refers to paper [17]. In these works, convex optimization techniques, especially semidefinite programming, are basic tools (see [4], [15]). At the last of this paragraph, we want to mention that statistical generalization analysis of learning the kernel problem is also important. Recently, Ying and Campbell developed a novel generalization bound for learning the kernel problem, especially established satisfactory excess generalization bounds and misclassification error rates for learning Gaussian kernels (see [19]).

In this paper, we develop a scheme to learn an optimal kernel from the convex combination of finite given kernels in regularization networks. Before addressing our method, we review some basic notations.

For two given data sets $X$ and $Y$ , the goal is learning a map from $X$ to $Y$ based on finite training data ${(x_{i}, y_{i})}_{i = 1}^{n} \subset X \times Y$ . In what follows, we will restrict $X \subset R^{d}$ and $Y \subset R$ . A kernel K defined on $X$ is a symmetric function from $X \times X$ to $R$ satisfying that for any finite set ${x_{i}}_{i = 1}^{m}$ , the Gram matrix (kernel matrix) of order m $G_{K} = (K (x_{i}, x_{j}))$ is positive semidefinite, and if the kernel matrix G_K is positive definite, we call K a positive definite kernel. What is more, for a given kernel K, there exists a unique reproducing kernel Hilbert space $H_{K} ≔ \bar{span} {K (x, \cdot) : x \in X}$ associated with K. The inner product of $H_{K}$ , denoted as $〈 \cdot, \cdot 〉_{K}$ , satisfies $〈 f, K (x, \cdot) 〉_{K} = f (x)$ , for any $f \in H_{K}$ , and we use $∥ \cdot ∥_{K}$ to represent the norm of H_K. For more details on kernels and reproducing kernel Hilbert spaces, see [1], [13].

Classical regularization network theory formulates the regression problem as a variational problem of finding a function f that minimizes the functional $\min_{f \in H_{K}} Q_{K} (f) ≔ \sum_{i = 1}^{n} (f (x_{i}) - y_{i})^{2} + λ ∥ f ∥_{K}^{2},$ where $λ > 0$ is the regularization parameter. It is well known that (see [5], [6], [13]) if f_K is a minimizer of (1.1), it has the form $f_{K} (x) = \sum_{i = 1}^{n} c_{i} K (x_{i}, x), x \in X,$ for some real vector $C ≔ (c_{1}, c_{2}, \dots, c_{n})^{T}$ which is determined by $(λ I + G_{K}) C = Y$ , $Y ≔ (y_{1}, y_{2}, \dots, y_{n})^{T}$ . This classical regularization network theory is based on an essential assumption: the function f from $X$ to $Y$ lies in $H_{K}$ or can be well approximated by some element of H_K. See [11] for the approximation ability of H_K.

In this paper, a new scheme is proposed to learn an optimal kernel, that is, $\min_{K \in K} \sum_{i = 1}^{n} (f_{K} (x_{i}) - y_{i})^{2},$ where $K$ is the set of convex combination of finite given kernels. Our method is motivated by Micchelli and Pontil's work [10], in which the idea can be addressed as follows: $\min_{K \in K} Q_{K} (f_{K}) .$ The relation between these two models will be discussed in Section 3, as we will see, under certain conditions, the problem (1.3) can be approximated by a semidefinite programming problem which coincides with (1.4).

This paper is organized as follows: in Section 2, we address the basic issues of optimization problem (1.3): the existence of the solution and the convexity of this optimization problem. In Section 3, the relation between our model and MP's is discussed and we summarize our work in Section 4.

Section snippets

Learning an optimal kernel

In this part, we will discuss that the solution for problem (1.3) exists and this optimization problem is convex. For the sake of simplicity, we introduce the following notations: $L_{+} (R^{n}) ≔ {A : A is the symmetric positive semidefinite real matrix of order n},$ and sometimes we refer $L_{++} (R^{n})$ as the subset of $L_{+} (R^{n})$ whose elements are positive definite matrices.

Let $K_{1}, \dots, K_{p}$ be given p kernels, which are obtained mainly based on the prior information of the problems. Generally speaking, the choice of these kernels

Further discussion on the variational problem

If the given kernels are positive definite, under a mild condition on regularization parameter $λ$ , the problem (2.3) can be approximated by a semidefinite programming problem. First we show an important theorem.

Theorem 3.1

Let $K_{i}, i = 1, \dots, p$ be given positive definite kernels, $G_{i}$ be kernel matrix associated with $K_{i}$ respective to inputs ${x_{i}}_{i = 1}^{n}$ . By the definition of positive definite kernel, $G_{i}$ are symmetric positive definite matrices, we use $α_{i}$ to denote the least eigenvalues of $G_{i}$ , thus $α_{i} \in R and α_{i} > 0$ . For any $0$

Conclusion

In this paper, we study a very important issue: kernel selecting for kernel-based methods. We propose a new scheme to learn a kernel function in regularization networks and analyze the theoretic issue of the corresponding variational problem. What is more, we discuss the relation between our model and MP's, which helps us understand MP's model well. But there are some problems still widely open. The first one is that how to choose the candidate kernels for combination, and as we have known,

Acknowledgment

The authors would like to give many thanks to the anonymous reviewers for their constructive suggestions and comments which greatly improve the paper.

Jie Chen received his B.Sc. degree in Information and Computational Science in 2005 and Ph.D. degree in Computational Mathematics in 2010 from Sun Yat-Sen University, Guangzhou, China. From July 2010 he works in the Department of Mathematics, Yibin University, Yibin, China. His research interests are in the areas of multiscale computing, fast singularity preserving algorithms of the linear and nonlinear integral equations and machine learning and adaptive algorithms.

References (20)

N. Aronszajn
Theory of reproducing kernels
Transactions of the American Mathematical Society
(1950)
A. Argyriou, C.A. Micchelli, M. Pontil, Learning convex combinations of continuously parameterized basic kernels, in:...
A. Argyriou, C.A. Micchelli, M. Pontil, Y. Ying, A spectral regularization framework for multi-task structure learning,...
S. Boyd et al.
Convex Optimization
(2004)
F. Cucker et al.
Learning Theory: An Approximation Theory Viewpoint
(2007)
T. Evgeniou et al.
Regularization networks and support vector machines
Advances in Computational Mathematics
(2000)
S.J. Kim, A. Magnani, S. Boyd, Optimal kernel selection in kernel Fisher discriminant analysis, in: Proceedings of the...
G.R.G. Lanckriet et al.
Learning the kernel matrix with semidefinite programming
Journal of Machine Learning Research
(2004)
C. Müller
(1966)
C.A. Micchelli et al.
Learning the kernel function via regularization
Journal of Machine Learning Research
(2005)

There are more references available in the full text version of this article.

Cited by (1)

A novel fractional-order pid controller for integrated pressurized water reactor based on wavelet kernel neural network algorithm
2014, Mathematical Problems in Engineering

Fei Ma received his B.Sc. degree in Computer Science and Technology from Jilin University and his M.Sc. degree in Information Computation from Sun Yat-sen University, both in China. Now he is a member of Xinhu Futures Co. Ltd., Shanghai, China.

Jian Chen received his Ph.D. degree in Computational Mathematics from Zhongshan University, Guangzhou, China, in 2010. Since July 2010 he worked in the Department of Mathematics, Foshan University, Foshan, China. His research interests are in the areas of multiscale computing, fast algorithm of the nonlinear integral equations and differential equations and model reduction.

View full text

A new scheme to learn a kernel in regularization networks

Abstract

Introduction

Section snippets

Learning an optimal kernel

Further discussion on the variational problem

Conclusion

Acknowledgment

Theory of reproducing kernels

Transactions of the American Mathematical Society

Convex Optimization

Learning Theory: An Approximation Theory Viewpoint

Regularization networks and support vector machines

Advances in Computational Mathematics

Learning the kernel matrix with semidefinite programming

Journal of Machine Learning Research

Learning the kernel function via regularization

Journal of Machine Learning Research