Elsevier

Neurocomputing

Volume 74, Issues 12–13, June 2011, Pages 2222-2227
Neurocomputing

A new scheme to learn a kernel in regularization networks

https://doi.org/10.1016/j.neucom.2011.03.004Get rights and content

Abstract

In this paper, we propose a new scheme to learn a kernel function from the convex combination of finite given kernels in regularization networks. We show that the corresponding variational problem is convex and under certain conditions, the variational problem can be approximated by a semidefinite programming problem which coincides with the Micchelli and Pontil's (MP's) Model (Micchelli and Pontil, 2005 [10]).

Introduction

Kernel-based methods become very popular in the community of machine learning in recent years. Roughly speaking, a kernel-based method is constructing a nonlinear map from the data set to a Hilbert space (feature space), then building a linear algorithm in the feature space to implement the nonlinear counterpart in the data set. In many problems of machine learning, such as regression, classification and dimensionality reduction, kernel-based methods are very successful. However, the results depend on kernel severely, therefore how to select a ‘good’ kernel is a critical issue for all kernel-based methods. In this direction, there are some progresses. In general, the existing works can be divided into three classes based on tasks. Rayleigh coefficients play a key role in classification problems (see [7], [8], [12], [20]), as for regression, we recommend Micchelli and Pontil's works (see [2], [12]) and for dimensionality reduction, the celebrated work refers to paper [17]. In these works, convex optimization techniques, especially semidefinite programming, are basic tools (see [4], [15]). At the last of this paragraph, we want to mention that statistical generalization analysis of learning the kernel problem is also important. Recently, Ying and Campbell developed a novel generalization bound for learning the kernel problem, especially established satisfactory excess generalization bounds and misclassification error rates for learning Gaussian kernels (see [19]).

In this paper, we develop a scheme to learn an optimal kernel from the convex combination of finite given kernels in regularization networks. Before addressing our method, we review some basic notations.

For two given data sets X and Y, the goal is learning a map from X to Y based on finite training data {(xi,yi)}i=1nX×Y. In what follows, we will restrict XRd and YR. A kernel K defined on X is a symmetric function from X×X to R satisfying that for any finite set {xi}i=1m, the Gram matrix (kernel matrix) of order m GK=(K(xi,xj))is positive semidefinite, and if the kernel matrix GK is positive definite, we call K a positive definite kernel. What is more, for a given kernel K, there exists a unique reproducing kernel Hilbert space HKspan¯{K(x,·):xX} associated with K. The inner product of HK, denoted as ·,·K, satisfies f,K(x,·)K=f(x), for any fHK, and we use ·K to represent the norm of HK. For more details on kernels and reproducing kernel Hilbert spaces, see [1], [13].

Classical regularization network theory formulates the regression problem as a variational problem of finding a function f that minimizes the functionalminfHKQK(f)i=1n(f(xi)yi)2+λfK2,where λ>0 is the regularization parameter. It is well known that (see [5], [6], [13]) if fK is a minimizer of (1.1), it has the formfK(x)=i=1nciK(xi,x),xX,for some real vector C(c1,c2,,cn)T which is determined by (λI+GK)C=Y, Y(y1,y2,,yn)T. This classical regularization network theory is based on an essential assumption: the function f from X to Y lies in HK or can be well approximated by some element of HK. See [11] for the approximation ability of HK.

In this paper, a new scheme is proposed to learn an optimal kernel, that is,minKKi=1n(fK(xi)yi)2,where K is the set of convex combination of finite given kernels. Our method is motivated by Micchelli and Pontil's work [10], in which the idea can be addressed as follows:minKKQK(fK).The relation between these two models will be discussed in Section 3, as we will see, under certain conditions, the problem (1.3) can be approximated by a semidefinite programming problem which coincides with (1.4).

This paper is organized as follows: in Section 2, we address the basic issues of optimization problem (1.3): the existence of the solution and the convexity of this optimization problem. In Section 3, the relation between our model and MP's is discussed and we summarize our work in Section 4.

Section snippets

Learning an optimal kernel

In this part, we will discuss that the solution for problem (1.3) exists and this optimization problem is convex. For the sake of simplicity, we introduce the following notations:L+(Rn){A:Aisthesymmetricpositivesemidefiniterealmatrixofordern},and sometimes we refer L++(Rn) as the subset of L+(Rn) whose elements are positive definite matrices.

Let K1,,Kp be given p kernels, which are obtained mainly based on the prior information of the problems. Generally speaking, the choice of these kernels

Further discussion on the variational problem

If the given kernels are positive definite, under a mild condition on regularization parameter λ, the problem (2.3) can be approximated by a semidefinite programming problem. First we show an important theorem.

Theorem 3.1

Let Ki,i=1,,p be given positive definite kernels, Gi be kernel matrix associated with Ki respective to inputs {xi}i=1n. By the definition of positive definite kernel, Gi are symmetric positive definite matrices, we use αi to denote the least eigenvalues of Gi, thus αiRandαi>0. For any 0

Conclusion

In this paper, we study a very important issue: kernel selecting for kernel-based methods. We propose a new scheme to learn a kernel function in regularization networks and analyze the theoretic issue of the corresponding variational problem. What is more, we discuss the relation between our model and MP's, which helps us understand MP's model well. But there are some problems still widely open. The first one is that how to choose the candidate kernels for combination, and as we have known,

Acknowledgment

The authors would like to give many thanks to the anonymous reviewers for their constructive suggestions and comments which greatly improve the paper.

Jie Chen received his B.Sc. degree in Information and Computational Science in 2005 and Ph.D. degree in Computational Mathematics in 2010 from Sun Yat-Sen University, Guangzhou, China. From July 2010 he works in the Department of Mathematics, Yibin University, Yibin, China. His research interests are in the areas of multiscale computing, fast singularity preserving algorithms of the linear and nonlinear integral equations and machine learning and adaptive algorithms.

References (20)

  • N. Aronszajn

    Theory of reproducing kernels

    Transactions of the American Mathematical Society

    (1950)
  • A. Argyriou, C.A. Micchelli, M. Pontil, Learning convex combinations of continuously parameterized basic kernels, in:...
  • A. Argyriou, C.A. Micchelli, M. Pontil, Y. Ying, A spectral regularization framework for multi-task structure learning,...
  • S. Boyd et al.

    Convex Optimization

    (2004)
  • F. Cucker et al.

    Learning Theory: An Approximation Theory Viewpoint

    (2007)
  • T. Evgeniou et al.

    Regularization networks and support vector machines

    Advances in Computational Mathematics

    (2000)
  • S.J. Kim, A. Magnani, S. Boyd, Optimal kernel selection in kernel Fisher discriminant analysis, in: Proceedings of the...
  • G.R.G. Lanckriet et al.

    Learning the kernel matrix with semidefinite programming

    Journal of Machine Learning Research

    (2004)
  • C. Müller
    (1966)
  • C.A. Micchelli et al.

    Learning the kernel function via regularization

    Journal of Machine Learning Research

    (2005)
There are more references available in the full text version of this article.

Jie Chen received his B.Sc. degree in Information and Computational Science in 2005 and Ph.D. degree in Computational Mathematics in 2010 from Sun Yat-Sen University, Guangzhou, China. From July 2010 he works in the Department of Mathematics, Yibin University, Yibin, China. His research interests are in the areas of multiscale computing, fast singularity preserving algorithms of the linear and nonlinear integral equations and machine learning and adaptive algorithms.

Fei Ma received his B.Sc. degree in Computer Science and Technology from Jilin University and his M.Sc. degree in Information Computation from Sun Yat-sen University, both in China. Now he is a member of Xinhu Futures Co. Ltd., Shanghai, China.

Jian Chen received his Ph.D. degree in Computational Mathematics from Zhongshan University, Guangzhou, China, in 2010. Since July 2010 he worked in the Department of Mathematics, Foshan University, Foshan, China. His research interests are in the areas of multiscale computing, fast algorithm of the nonlinear integral equations and differential equations and model reduction.

View full text