The Cauchy–Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels
Introduction
Recently, a new scheme for statistically based machine learning has emerged, coined information theoretic learning (ITL) [1]. The starting point is a data set that globally conveys information about a real-world event. The goal in ITL is to capture, or learn, this information in the form of the parameters of an adaptive system. This is done using information theoretic cost functions as learning criteria. As opposed to the traditional mean squared error criterion, information theoretic cost functions take into account statistical dependencies beyond correlations. This is important in many problems in machine learning, such as blind source separation and independent component analysis, blind equalization and deconvolution, subspace projections, dimensionality reduction and manifold learning, feature extraction, classification and clustering.
Shannon's entropy [2] is the most well-known information theoretic quantity. Renyi [3] proposed another important entropy measure, which satisfies all but one of the axioms laid down by Shannon for an information measure. It contains the Shannon entropy as a special case. We will therefore refer to Renyi's entropy as an information theoretic measure.
These information theoretic cost functions are expressed in terms of probability density functions (pdfs). Hence, ITL implicitly requires pdfs to be estimated, in order to evaluate the information theoretic criterion. Since it is oftentimes desirable to make as few assumptions as possible about the structure of the pdfs in question, Principe et al. [1] argued that Parzen windowing is a convenient density estimation technique. In that paper, it was also proposed to use the Renyi quadratic entropy as an information theoretic cost function, because it can be estimated using Parzen windowing without any approximations besides the Parzen windowing itself. This is not the case for the Shannon entropy. ITL based on the Renyi quadratic entropy and Parzen windowing has been applied with great success on several supervised and unsupervised learning schemes, see for example [4], [5], [6], [7], [8], [9], [10], [11], [12].
Other cost functions based on the Renyi entropy was also proposed, for the same reasons. One important such quantity is the so-called Cauchy–Schwarz (CS) pdf divergence. The CS divergence is a measure of the “distance” between two pdfs, and . It can be considered an approximation to the well-known Kullback–Leibler divergence [13]. The CS divergence satisfies for example the additivity property of an information measure [14], considered by Renyi to be the most important such property. It is therefore referred to as an information theoretic measure. The CS divergence measure is given byThis is a symmetric measure, such that , where the minimum is obtained if and only if .
This paper provides a tutorial level discussion of some interesting properties of the CS divergence. It turns out that when Parzen windowing is used to estimate the CS divergence, the resulting cost function can be interpreted both in terms of graph theory and Mercer kernel and spectral theory. Fig. 1 illustrates these connections, which we will discuss further in the following sections. Hence, these links indicate that graph theoretic measures and Mercer kernel-based measures are closely related to certain information theoretic measures and non-parametric density estimation. Graph theory and Mercer kernel-based theory have been important parts of machine learning research in recent years.
Graph theory [15] has been used for decades in various scientific fields for many purposes. In the last decade, it has also been introduced to the field of computer vision and machine learning, by optimizing the so-called graph cut [16]. The graph cut provides a measure of the cost of partitioning a graph into two subgraphs.
Mercer kernel-based methods [17], [18], [19], [20] have been dominating in machine learning and pattern recognition since the introduction of the support vector machine [21], [22], [23], [24]. Here, the main idea is to implicitly map the data points into a potentially infinite-dimensional non-linear feature space using Mercer kernels. In the Mercer kernel feature space, it is more likely to obtain linearly separable data, and linear machine learning techniques may be used.
Quite recently, yet another machine learning field has received significant attention, namely the spectral methods [17]. Spectral methods refer to techniques using the eigenvalues (spectrum) and eigenvectors of certain data matrices to perform the machine learning tasks. See for example [25], [26], [27], [28], [29]. Kernel PCA [30] is one prominent example of a spectral method. It basically performs a principal component analysis (PCA) approximation to Mercer kernel feature spaces.
The remainder of the paper is organized as follows. In Section 2 the Parzen window technique for density estimation is discussed. In Section 3, it is shown how the CS divergence may be estimated non-parametrically using the Parzen window method. Furthermore, in Section 4, the connection to graph theory is discussed, and in Section 5 the connection to Mercer kernel and spectral methods is discussed. Section 6 discusses an extension of the CS divergence. We make our concluding remarks in Section 7.
Section snippets
Parzen windowing
For the convenience of the reader unfamiliar with non-parametric density estimation, we will in this section review the technique known as Parzen windowing, or kernel density estimation. It is important for estimating the CS divergence. Our review follows those given in [31], [32], [33]. In the subsection, we provide a brief and somewhat limited discussion of the issue of window size selection.
There are two approaches for estimating the pdf of a random variable from its independent and
Cauchy–Schwarz divergence
Measures of how close two pdfs and are in some specific sense, are provided by the information theoretic divergences, such as the Kullback–Leibler divergence [13] or the Chernoff divergences [41]. In this paper, we focus on the so-called CS divergence.
Define the inner-product between two square-integrable functions and as . Then, by the CS inequalitywith equality if and only if the two functions are linearly dependent. Now
Relation to graph theory
In this section we will introduce the graph cut. The graph cut has been an important cost function used for example in image segmentation [16]. Thereafter, we will show that the CS divergence is actually closely related to the graph cut.
Relation to Mercer kernel theory
In this section, we will explain the idea behind Mercer kernel-based machine learning algorithms. We will then show that the CS divergence can be considered a distance measure in such a Mercer kernel feature space. We will discuss how the Mercer kernel feature space can be approximated using spectral techniques, and show that the distance measure represented by the CS divergence makes sense in such a feature space.
Extension to the multi-PDF case
The CS divergence may also be extended, such that it measures the overall “distance” between several pdfs at the same time, as follows:where and . Note that only for . When replacing the actual pdfs by their Parzen window estimators, it can easily be shown thatHence, the Parzen window-based estimator for the multi-pdf CS
Discussion
In this paper, some recent connections between the information theoretic CS pdf divergence measure, graph theory and Mercer kernel and spectral theory have been presented. These connections are revealed when the CS divergence is estimated using the Parzen window technique for pdf estimation. Thus, these connections have the important consequence that they enhance our understanding of these seemingly different machine learning schemes relative to each other, since they have been shown to be
References (56)
- et al.
A mutual information extension to the matched filter
Signal Process.
(2005) - et al.
Blind source separation using Renyi's -marginal entropies
Neurocomputing
(2002) - et al.
Information theoretic learning
A mathematical theory of communication
Bell Syst. Tech. J.
(1948)- A. Renyi, On measures of entropy and information, Selected Papers of Alfred Renyi, vol. 2, Akademiai Kiado, Budapest,...
- et al.
Stochastic blind equalization based on PDF fitting using Parzen estimator
IEEE Trans. Signal Process.
(2005) - et al.
Minimax mutual information approach for independent component analysis
Neural Comput.
(2004) - et al.
Convergence properties and data efficiency of the minimum error-entropy criterion in adaline training
IEEE Trans. Signal Process.
(2003) - et al.
Entropy minimization for supervised digital communications channel equalization
IEEE Trans. Signal Process.
(2002) - et al.
Generalized information potential criterion for adaptive system training
IEEE Trans. Neural Networks
(2002)
An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems
IEEE Trans. Signal Process.
Learning from examples with information theoretic criteria
J. VLSI Signal Process.
On information and sufficiency
Ann. Math. Stat.
Spectral Graph Theory
An optimal graph theoretic approach to data clustering: theory and its applications to image segmentation
IEEE Trans. Pattern Anal. Mach. Intell.
Kernel Methods for Pattern Analysis
An introduction to kernel-based learning algorithms
IEEE Trans. Neural Networks
Learning with Kernels
Support vector networks
Mach. Learn.
The Nature of Statistical Learning Theory
An Introduction to Support Vector Machines
A tutorial on support vector machines for pattern recognition
Knowl. Discovery Data Min.
Nonlinear dimensionality reduction by locally linear embedding
Science
A global geometric framework for nonlinear dimensionality reduction
Science
Laplacian eigenmaps for dimensionality reduction and data representation
Neural Comput.
On spectral clustering: analysis and an algorithm
Cited by (110)
Leveraging tensor kernels to reduce objective function mismatch in deep clustering
2024, Pattern RecognitionShort-term power load probability density forecasting based on GLRQ-Stacking ensemble learning method
2022, International Journal of Electrical Power and Energy SystemsCitation Excerpt :It is a method to study data distribution characteristics based on the sample itself, without using prior knowledge of data distribution. The classic KDE method [36] is the most representative nonparametric probability prediction method. He et al. [37] proposed a hybrid method, in which variational mode decomposition (VMD) decomposed the load sequence and quantile regression forest (QRF) realized quantile prediction.
Distributed Fusion of Labeled Multi-Bernoulli Filters Based on Arithmetic Average
2024, IEEE Signal Processing LettersDual Non-Local Means: a two-stage information-theoretic filter for image denoising
2024, Multimedia Tools and ApplicationsGeneralized Fisher-Darmois-Koopman-Pitman Theorem and Rao-Blackwell Type Estimators for Power-Law Distributions
2023, IEEE Transactions on Information TheoryLearning Canonical Embeddings for Unsupervised Shape Correspondence With Locally Linear Transformations
2023, IEEE Transactions on Pattern Analysis and Machine Intelligence