Abstract
In this paper, we discuss some equivalences between two recently introduced statistical learning schemes, namely Mercer kernel methods and information theoretic methods. We show that Parzen window-based estimators for some information theoretic cost functions are also cost functions in a corresponding Mercer kernel space. The Mercer kernel is directly related to the Parzen window. Furthermore, we analyze a classification rule based on an information theoretic criterion, and show that this corresponds to a linear classifier in the kernel space. By introducing a weighted Parzen window density estimator, we also formulate the support vector machine in this information theoretic perspective.
Similar content being viewed by others
References
J. Shawe-Taylor and N. Cristianini, “Kernel Methods for Pattern Analysis,” Cambridge University Press, 2004.
K.R. Müller, S. Mika, G. Rätsch, K. Tsuda and B. Schölkopf, “An Introduction to Kernel-Based Learning Algorithms,” IEEE Trans. Neural Netw., vol. 12, no. 2, 2001, pp. 181–201.
F. Perez-Cruz and O. Bousquet, “Kernel Methods and Their Potential Use in Signal Processing,” IEEE Signal Process. Mag., 2004, pp. 57–65, May.
B. Schölkopf and A.J. Smola, “Learning with Kernels,” MIT, Cambridge, 2002.
C. Cortes and V.N. Vapnik, “Support Vector Networks,” Mach. Learn., vol. 20, 1995, pp. 273–297.
V.N. Vapnik, “The Nature of Statistical Learning Theory,” Springer, Berlin Heidelberg New York, 1995.
N. Cristianini and J. Shawe-Taylor, “An Introduction to Support Vector Machines,” Cambridge University Press, Cambridge, 2000.
C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Knowledge Discovery and Data Mining, vol. 2, no. 2, 1998, pp. 121–167.
T. Hastie, S. Rosset, R. Tibshirani and J. Zhu, “The Entire Regularization Path for the Support Vector Machine,” J. Mach. Learn. Res., vol. 5, 2004, pp. 1391–1415.
B. Schölkopf, A.J. Smola and K.R. Müller, “Nonlinear Component Analysis as a Kernel Eigenvalue Problem,” Neural Comput., vol. 10, 1998, pp. 1299–1319.
S. Mika, G. Rätsch, J. Weston, B. Schölkopf and K.R. Müller, “Fisher Discriminant Analysis with Kernels,” in Proceedings of IEEE International Workshop on Neural Networks for Signal Processing, Madison, USA, August 23–25, 1999, pp. 41–48.
V. Roth and V. Steinhage, “Nonlinear Discriminant Analysis using Kernel Functions,” in Advances in Neural Information Processing Systems 12, MIT, Cambridge, 2000, pp. 568–574.
Y.A. LeCun, L.D. Jackel, L. Bottou, A. Brunot, C. Cortes, J.S. Denker, H. Drucker, I. Guyon, U.A. Müller, E. Säckinger, P.Y. Simard and V.N. Vapnik, “Learning Algorithms for Classification: A Comparison on Handwritten Digit Reconstruction,” Neural Netw., 1995, pp. 261–276.
K.R. Müller, A.J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen and V.N. Vapnik, “Predicting Time Series with Support Vector Machines,” in Proceedings of International Conference on Artificial Neural Networks—Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 1997, vol. 1327, pp. 999–1004.
A. Zien, G. Rätsch, S. Mika, B. Schölkopf, T. Lengauer and K.R. Müller, “Engineering Support Vector Machine Kernels that Recognize Translation Invariant Sites in DNA,” Bioinformatics, vol. 16, 2000, pp. 906–914.
J. Principe, D. Xu and J. Fisher, “Information Theoretic Learning,” in Unsupervised Adaptive Filtering, S. Haykin (Ed.), Wiley, New York, 2000, vol. I, Chapter 7.
J.C. Principe, D. Xu, Q. Zhao and J.W. Fisher, “Learning From Examples with Information Theoretic Criteria,” J. VLSI Signal Process., vol. 26, no. 1, 2000, pp. 61–77.
S. Haykin, (Ed.), “Unsupervised Adaptive Filtering: Volume 1, Blind Source Separation, Wiley, New York, 2000.
E. Parzen, “On the Estimation of a Probability Density Function and the Mode,” Ann. Math. Stat., vol. 32, 1962, pp. 1065–1076.
L. Devroye, “On Random Variate Generation when only Moments or Fourier Coefficients are known,” Math. Comput. Simul., vol. 31, 1989, pp. 71–89.
B.W. Silverman, “Density Estimation for Statistics and Data Analysis,” Chapman & Hall, London, 1986.
D.W. Scott, “Multivariate Density Estimation, ” Wiley, New York, 1992.
M.P. Wand and M.C. Jones, “Kernel Smooting, ” Chapman & Hall, London, 1995.
P.A. Viola, N.N. Schraudolph and T.J. Sejnowski, “Empirical Entropy Manipulation for Real-World Problems,” in Advances in Neural Information Processing Systems, 8, MIT, Cambridge, 1995, pp. 851–857.
P. Viola and W.M. Wells, “Alignment by Maximization of Mutual Information,” Int. J. Comput. Vis., vol. 24, no. 2, 1997, pp. 137–154.
D. Xu, “Energy, Entropy and Information Potential for Neural Computation, Ph.D. thesis, University of Florida, Gainesville, FL, USA, 1999.
A. Renyi, “Some Fundamental Questions of Information Theory,” Selected Papers of Alfred Renyi, Akademiai Kiado, Budapest, vol. 2, 1976, pp. 526–552.
A. Renyi, “On Measures of Entropy and Information,” Selected Papers of Alfred Renyi, Akademiai Kiado, Budapest, vol. 2, 1976, pp. 565–580.
M. Lazaro, I. Santamaria, D. Erdogmus, K.E. Hild II, C. Pantaleon and J.C. Principe, “Stochastic Blind Equalization Based on PDF Fitting using Parzen Estimator,” IEEE Trans. Signal Process., vol. 53, no. 2, 2005, pp. 696–704.
D. Erdogmus, K.E. Hild, Y.N. Rao and J.C. Principe, “Minimax Mutual Information Approach for Independent Component Analysis,” Neural Comput., vol. 16, 2004, pp. 1235–1252.
D. Erdogmus, K.E. Hild, J.C. Principe, M. Lazaro and I. Santamaria, “Adaptive Blind Deconvolution of Linear Channels using Renyi’s Entropy with Parzen Window Estimation,” IEEE Trans. Signal Process., vol. 52, no. 6, 2004, pp. 1489–1498.
D. Erdogmus and J.C. Principe, “Convergence Properties and Data Efficiency of the Minimum Error-Entropy Criterion in Adaline Training,” IEEE Trans. Signal Process., vol. 51, no. 7, 2003, pp. 1966–1978.
D. Erdogmus, K.E. Hild and J.C. Principe, “Blind Source Separation using Renyi’s α-Marginal Entropies,” Neurocomputing, vol. 49, 2002, pp. 25–38.
I. Santamaria, D. Erdogmus and J.C. Principe, “Entropy Minimization for Supervised Digital Communications Channel Equalization,” IEEE Trans. Signal Process., vol. 50, no. 5, 2002, pp. 1184–1192.
D. Erdogmus and J.C. Principe, “Generalized Information Potential Criterion for Adaptive System Training,” IEEE Trans. Neural Netw., vol. 13, no. 5, 2002, pp. 1035–1044.
D. Erdogmus and J.C. Principe, “An Error-Entropy Minimization Algorithm for Supervised Training of Nonlinear Adaptive Systems,” IEEE Trans. Signal Process., vol. 50, no. 7, 2002, pp. 1780–1786.
J. Mercer, “Functions of Positive and Negative Type and their Connection with the Theory of Integral Equations,” Philos. Trans. Roy. Soc. London, vol. A, 1909, pp. 415–446.
M. Girolami, “Mercer Kernel-Based Clustering in Feature Space,” IEEE Trans. Neural Netw., vol. 13, no. 3, 2002, pp. 780–784.
I.S. Dhillon, Y. Guan and B. Kulis, “Kernel K-means, Spectral Clustering and Normalized Cuts,” in Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, USA, August 22–25, 2004, pp. 551–556.
L. Devroye and G. Lugosi, “Combinatorial Methods in Density Estimation,” Springer, Berlin Heidelberg New York, 2001.
J.H. Friedman, “On Bias, Variance, 0/1 Loss, and the Curse-Of-Dimensionality,” Data Mining and Knowledge Discovery, vol. 1, no. 1, 1997, pp. 55–77.
M. Girolami, “Orthogonal Series Density Estimation and the Kernel Eigenvalue Problem,” Neural Comput., vol. 14, no. 3, 2002, pp. 669–688.
D.W. Scott, “Parametric Statistical Modeling by Integrated Squared Error,” Technometrics, vol. 43, 2001, pp. 274–285.
J.N. Kapur, “Measures of Information and their Applications,” Wiley, New York, 1994.
R. Jenssen, J.C. Principe and T. Eltoft, “Information Cut and Information Forces for Clustering,” in Proceedings of IEEE International Workshop on Neural Networks for Signal Processing, Toulouse, France, September 17–19, 2003, pp. 459–468.
M. Di Marzio and C.C. Taylor, “Kernel Density Classification and Boosting: An L2 Analysis,” Stat. Comput., vol. 15, no. 2, 2005, pp. 113–123.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Jenssen, R., Eltoft, T., Erdogmus, D. et al. Some Equivalences between Kernel Methods and Information Theoretic Methods. J VLSI Sign Process Syst Sign Image Video Technol 45, 49–65 (2006). https://doi.org/10.1007/s11265-006-9771-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-006-9771-8