Abstract
Kernel machines such as Support Vector Machines (SVM) have exhibited successful performance in pattern classification problems mainly due to their exploitation of potentially nonlinear affinity structures of data through the kernel functions. Hence, selecting an appropriate kernel function, equivalently learning the kernel parameters accurately, has a crucial impact on the classification performance of the kernel machines. In this paper we consider the problem of learning a kernel matrix in a binary classification setup, where the hypothesis kernel family is represented as a convex hull of fixed basis kernels. While many existing approaches involve computationally intensive quadratic or semi-definite optimization, we propose novel kernel learning algorithms based on large margin estimation of Parzen window classifiers. The optimization is cast as instances of linear programming. This significantly reduces the complexity of the kernel learning compared to existing methods, while our large margin based formulation provides tight upper bounds on the generalization error. We empirically demonstrate that the new kernel learning methods maintain or improve the accuracy of the existing classification algorithms while significantly reducing the learning time on many real datasets in both supervised and semi-supervised settings.
Similar content being viewed by others
Notes
A procedure of turning a matrix into a vector by concatenating the columns of the matrix from left to right.
Refer to Theorem 17 of [18] for further details.
The sign of f(x) determines the class label, while the magnitude indicates how confident it is.
Here we assume that the kernel functions are properly normalized to have probability masses 1.
Strictly saying, one should contrast with the learning time of the SMM, for instance, run by the SDP3 solver for a fair comparison. The SimpleMKL is much faster than the SDP3-solved SMM (e.g., for the Sonar dataset in Table 2, the SDP3 solver for SMM recorded 105.72 secs for the coarse set and 490.13 secs for the fine set). Despite this advantage of SMM, we will demonstrate that our algorithms are significantly faster than SMM.
We also evaluated the performance using the Parzen window classifier (PWC), where the classification performance of the PWC appears slightly below but not statistically different from that of the SVM classifier. SVM classifier, on the other hand, leads to a more compact and computationally efficient representation. Although we used a simple PWC model in our kernel learning framework, the classification performance based on the learned kernels does not degrade as long as the same classifier family is used.
We have similar results for the coarse sets.
References
Andrews S, Tsochantaridis I, Hofmann T (2003) Support vector machines for multiple-instance learning. In: Neural information processing systems
Asuncion A, Newman D (2007) UCI machine learning repository
Babich GA, Camps OI (1996) Weighted Parzen windows for pattern classification. IEEE Trans Pattern Anal Mach Intell 18(5):567–570
Bach F (2008) Exploring large feature spaces with hierarchical multiple kernel learning. In: Neural information processing systems, pp 105–112
Bach F, Lanckriet GRG, Jordan MI (2004) Multiple kernel learning, conic duality, and the SMO algorithm. In: International conference on machine learning
Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 7:2399–2434
Bi J, Fung G, Dundar M, Rao B (2005) Semi-supervised mixture of kernels via lpboost methods. In: International conference on data mining
Bi J, Zhang T, Bennet KP (2004) Column-generation boosting methods for mixture of kernels. In: SIGKDD
Cañete A, Constanzo J, Salinas L (2008) Kernel price pattern trading. Appl Intell 29(2):152–156
Cristianini N, Shawe-Taylor J, Elisseeff A (2001) On kernel-target alignment. In: Neural information processing systems
Demiriz A, Bennett KP, Shawe-Taylor J (2002) Linear programming boosting via column generation. Mach Learn 46(1–3):225–254
Dioşan L, Rogozan A, Pecuchet JP (2010) Improving classification performance of support vector machine by genetically optimising kernel shape and hyper-parameters. Applied Intelligence, 1–15. doi:10.1007/s10489-010-0260-1
Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York
Fumera G, Roli F (2005) A theoretical and experimental analysis of linear combiners for multiple classifier systems. IEEE Trans Pattern Anal Mach Intell 27(6):942–956
Gehler P, Nowozin S (2009) On feature combination for multiclass object classification. In: International conference on computer vision
Kondor RI, Lafferty J (2002) Diffusion kernels on graphs and other discrete structures. In: International conference on machine learning
Kwak N, Choi CH (2002) Input feature selection by mutual information based on Parzen window. IEEE Trans Pattern Anal Mach Intell 24(12):1667–1671
Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004) Learning the kernel matrix with semidefinite programming. J Mach Learn Res 5:27–72
Lee LH, Wan CH, Rajkumar R, Isa D (2011) An enhanced support vector machine classification framework by using Euclidean distance function for text document categorization. Appl Intell. doi:10.1007/s10489-011-0314-z
Nanni L, Lumini A (2005) Ensemble of Parzen window classifiers for on-line signature verification. Neurocomputing 68:217–224
Rakotomamonjy A, Bach F, Canu S, Grandvalet Y (2008) Simple MKL. J Mach Learn Res 9:2491–2521
Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5):1651–1686
Schölkopf B, Herbrich R, Smola A (2001) A generalized representer theorem. Comput Learn Theor 2111:416–426
Schölkopf B, Smola A (2002) Learning with kernels. MIT Press, Cambridge
Smola A, Kondor R (2003) Kernels and regularization on graphs. In: Annual conference on learning theory (COLT)
Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565
Tsivtsivadze E, Pahikkala T, Boberg J, Salakoski T (2009) Locality kernels for sequential data and their applications to parse ranking. Appl Intell 31(1):81–88
Tutuncu RH, Toh KC, Todd MJ (2003) Solving semidefinite-quadratic-linear programs using SDPT3. Math Program, Ser A 95:189–217
Varma M, Babu BR (2009) More generality in efficient multiple kernel learning. In: International conference on machine learning
Wang J, Lu H, Plataniotis K, Lu J (2009) Gaussian kernel optimization for pattern classification. Pattern Recognit 42(7):1237–1247
Williams CKI, Rasmussen CE (1996) Gaussian processes for regression. In: Neural information processing systems
Xu Z, Jin R, King I, Lyu MR (2008) An extended level method for efficient multiple kernel learning. In: Neural information processing systems, pp 1825–1832
Yeung DY, Chow C (2002) Parzen-window network intrusion detectors. In: Proceedings of the sixteenth international conference on pattern recognition, pp 385–388
Zhang D, Chen S, Zhou ZH (2006) Learning the kernel parameters in kernel minimum distance classifier. Pattern Recognit 39(1):133–135
Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised learning using Gaussian fields and harmonic functions. In: International conference on machine learning
Zhu X, Kandola J, Ghahramani Z, Lafferty J (2004) Nonparametric transforms of graph kernels for semi-supervised learning. In: Neural information processing systems
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kim, M. Accelerated max-margin multiple kernel learning. Appl Intell 38, 45–57 (2013). https://doi.org/10.1007/s10489-012-0356-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-012-0356-x