The Cauchy–Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels

https://doi.org/10.1016/j.jfranklin.2006.03.018Get rights and content

Abstract

This paper contributes a tutorial level discussion of some interesting properties of the recent Cauchy–Schwarz (CS) divergence measure between probability density functions. This measure brings together elements from several different machine learning fields, namely information theory, graph theory and Mercer kernel and spectral theory. These connections are revealed when estimating the CS divergence non-parametrically using the Parzen window technique for density estimation. An important consequence of these connections is that they enhance our understanding of the different machine learning schemes relative to each other.

Introduction

Recently, a new scheme for statistically based machine learning has emerged, coined information theoretic learning (ITL) [1]. The starting point is a data set that globally conveys information about a real-world event. The goal in ITL is to capture, or learn, this information in the form of the parameters of an adaptive system. This is done using information theoretic cost functions as learning criteria. As opposed to the traditional mean squared error criterion, information theoretic cost functions take into account statistical dependencies beyond correlations. This is important in many problems in machine learning, such as blind source separation and independent component analysis, blind equalization and deconvolution, subspace projections, dimensionality reduction and manifold learning, feature extraction, classification and clustering.

Shannon's entropy [2] is the most well-known information theoretic quantity. Renyi [3] proposed another important entropy measure, which satisfies all but one of the axioms laid down by Shannon for an information measure. It contains the Shannon entropy as a special case. We will therefore refer to Renyi's entropy as an information theoretic measure.

These information theoretic cost functions are expressed in terms of probability density functions (pdfs). Hence, ITL implicitly requires pdfs to be estimated, in order to evaluate the information theoretic criterion. Since it is oftentimes desirable to make as few assumptions as possible about the structure of the pdfs in question, Principe et al. [1] argued that Parzen windowing is a convenient density estimation technique. In that paper, it was also proposed to use the Renyi quadratic entropy as an information theoretic cost function, because it can be estimated using Parzen windowing without any approximations besides the Parzen windowing itself. This is not the case for the Shannon entropy. ITL based on the Renyi quadratic entropy and Parzen windowing has been applied with great success on several supervised and unsupervised learning schemes, see for example [4], [5], [6], [7], [8], [9], [10], [11], [12].

Other cost functions based on the Renyi entropy was also proposed, for the same reasons. One important such quantity is the so-called Cauchy–Schwarz (CS) pdf divergence. The CS divergence is a measure of the “distance” between two pdfs, p1(x) and p2(x). It can be considered an approximation to the well-known Kullback–Leibler divergence [13]. The CS divergence satisfies for example the additivity property of an information measure [14], considered by Renyi to be the most important such property. It is therefore referred to as an information theoretic measure. The CS divergence measure is given byDCS(p1,p2)=-logp1(x)p2(x)dxp12(x)dxp22(x)dx.This is a symmetric measure, such that 0DCS<, where the minimum is obtained if and only if p1(x)=p2(x).

This paper provides a tutorial level discussion of some interesting properties of the CS divergence. It turns out that when Parzen windowing is used to estimate the CS divergence, the resulting cost function can be interpreted both in terms of graph theory and Mercer kernel and spectral theory. Fig. 1 illustrates these connections, which we will discuss further in the following sections. Hence, these links indicate that graph theoretic measures and Mercer kernel-based measures are closely related to certain information theoretic measures and non-parametric density estimation. Graph theory and Mercer kernel-based theory have been important parts of machine learning research in recent years.

Graph theory [15] has been used for decades in various scientific fields for many purposes. In the last decade, it has also been introduced to the field of computer vision and machine learning, by optimizing the so-called graph cut [16]. The graph cut provides a measure of the cost of partitioning a graph into two subgraphs.

Mercer kernel-based methods [17], [18], [19], [20] have been dominating in machine learning and pattern recognition since the introduction of the support vector machine [21], [22], [23], [24]. Here, the main idea is to implicitly map the data points into a potentially infinite-dimensional non-linear feature space using Mercer kernels. In the Mercer kernel feature space, it is more likely to obtain linearly separable data, and linear machine learning techniques may be used.

Quite recently, yet another machine learning field has received significant attention, namely the spectral methods [17]. Spectral methods refer to techniques using the eigenvalues (spectrum) and eigenvectors of certain data matrices to perform the machine learning tasks. See for example [25], [26], [27], [28], [29]. Kernel PCA [30] is one prominent example of a spectral method. It basically performs a principal component analysis (PCA) approximation to Mercer kernel feature spaces.

The remainder of the paper is organized as follows. In Section 2 the Parzen window technique for density estimation is discussed. In Section 3, it is shown how the CS divergence may be estimated non-parametrically using the Parzen window method. Furthermore, in Section 4, the connection to graph theory is discussed, and in Section 5 the connection to Mercer kernel and spectral methods is discussed. Section 6 discusses an extension of the CS divergence. We make our concluding remarks in Section 7.

Section snippets

Parzen windowing

For the convenience of the reader unfamiliar with non-parametric density estimation, we will in this section review the technique known as Parzen windowing, or kernel density estimation. It is important for estimating the CS divergence. Our review follows those given in [31], [32], [33]. In the subsection, we provide a brief and somewhat limited discussion of the issue of window size selection.

There are two approaches for estimating the pdf of a random variable from its independent and

Cauchy–Schwarz divergence

Measures of how close two pdfs p1(x) and p2(x) are in some specific sense, are provided by the information theoretic divergences, such as the Kullback–Leibler divergence [13] or the Chernoff divergences [41]. In this paper, we focus on the so-called CS divergence.

Define the inner-product between two square-integrable functions h(x) and g(x) as h,g=h(x)g(x)dx. Then, by the CS inequalityh(x)g(x)dx2|h(x)|2dx|g(x)|2dx,with equality if and only if the two functions are linearly dependent. Now

Relation to graph theory

In this section we will introduce the graph cut. The graph cut has been an important cost function used for example in image segmentation [16]. Thereafter, we will show that the CS divergence is actually closely related to the graph cut.

Relation to Mercer kernel theory

In this section, we will explain the idea behind Mercer kernel-based machine learning algorithms. We will then show that the CS divergence can be considered a distance measure in such a Mercer kernel feature space. We will discuss how the Mercer kernel feature space can be approximated using spectral techniques, and show that the distance measure represented by the CS divergence makes sense in such a feature space.

Extension to the multi-PDF case

The CS divergence may also be extended, such that it measures the overall “distance” between several pdfs at the same time, as follows:DCS(p1,,pC)=-log1κi=1C-1j>ipi,pjpi,pipj,pj,where κ=c=1C-1c and 0DCS(p1,,pC)<. Note that DCS=0 only for p1(x)==pC(x). When replacing the actual pdfs by their Parzen window estimators, it can easily be shown thatD^CS(p1,,pC)=-log1κi=1C-1j>iIC(Gi,Gj)=-log1κi=1C-1j>icos(mi,mj).Hence, the Parzen window-based estimator for the multi-pdf CS

Discussion

In this paper, some recent connections between the information theoretic CS pdf divergence measure, graph theory and Mercer kernel and spectral theory have been presented. These connections are revealed when the CS divergence is estimated using the Parzen window technique for pdf estimation. Thus, these connections have the important consequence that they enhance our understanding of these seemingly different machine learning schemes relative to each other, since they have been shown to be

References (56)

  • D. Erdogmus et al.

    A mutual information extension to the matched filter

    Signal Process.

    (2005)
  • D. Erdogmus et al.

    Blind source separation using Renyi's α-marginal entropies

    Neurocomputing

    (2002)
  • J. Principe et al.

    Information theoretic learning

  • C.E. Shannon

    A mathematical theory of communication

    Bell Syst. Tech. J.

    (1948)
  • A. Renyi, On measures of entropy and information, Selected Papers of Alfred Renyi, vol. 2, Akademiai Kiado, Budapest,...
  • M. Lazaro et al.

    Stochastic blind equalization based on PDF fitting using Parzen estimator

    IEEE Trans. Signal Process.

    (2005)
  • D. Erdogmus et al.

    Minimax mutual information approach for independent component analysis

    Neural Comput.

    (2004)
  • D. Erdogmus et al.

    Convergence properties and data efficiency of the minimum error-entropy criterion in adaline training

    IEEE Trans. Signal Process.

    (2003)
  • I. Santamaria et al.

    Entropy minimization for supervised digital communications channel equalization

    IEEE Trans. Signal Process.

    (2002)
  • D. Erdogmus et al.

    Generalized information potential criterion for adaptive system training

    IEEE Trans. Neural Networks

    (2002)
  • D. Erdogmus et al.

    An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems

    IEEE Trans. Signal Process.

    (2002)
  • J.C. Principe et al.

    Learning from examples with information theoretic criteria

    J. VLSI Signal Process.

    (2000)
  • S. Kullback et al.

    On information and sufficiency

    Ann. Math. Stat.

    (1951)
  • R. Jenssen, An information theoretic approach to machine learning, Ph.D. Thesis, University of Tromsø, Tromsø, Norway,...
  • F.R.K. Chung

    Spectral Graph Theory

    (1997)
  • Z. Wu et al.

    An optimal graph theoretic approach to data clustering: theory and its applications to image segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1993)
  • J. Shawe-Taylor et al.

    Kernel Methods for Pattern Analysis

    (2004)
  • K.R. Müller et al.

    An introduction to kernel-based learning algorithms

    IEEE Trans. Neural Networks

    (2001)
  • F. Perez-Cruz, O. Bousquet, Kernel methods and their potential use in signal processing, IEEE Signal Processing...
  • B. Schölkopf et al.

    Learning with Kernels

    (2002)
  • C. Cortes et al.

    Support vector networks

    Mach. Learn.

    (1995)
  • V.N. Vapnik

    The Nature of Statistical Learning Theory

    (1995)
  • N. Cristianini et al.

    An Introduction to Support Vector Machines

    (2000)
  • C.J.C. Burges

    A tutorial on support vector machines for pattern recognition

    Knowl. Discovery Data Min.

    (1998)
  • S. Roweis et al.

    Nonlinear dimensionality reduction by locally linear embedding

    Science

    (2000)
  • J. Tenenbaum et al.

    A global geometric framework for nonlinear dimensionality reduction

    Science

    (2000)
  • M. Belkin et al.

    Laplacian eigenmaps for dimensionality reduction and data representation

    Neural Comput.

    (2003)
  • A.Y. Ng et al.

    On spectral clustering: analysis and an algorithm

  • Cited by (110)

    • Short-term power load probability density forecasting based on GLRQ-Stacking ensemble learning method

      2022, International Journal of Electrical Power and Energy Systems
      Citation Excerpt :

      It is a method to study data distribution characteristics based on the sample itself, without using prior knowledge of data distribution. The classic KDE method [36] is the most representative nonparametric probability prediction method. He et al. [37] proposed a hybrid method, in which variational mode decomposition (VMD) decomposed the load sequence and quantile regression forest (QRF) realized quantile prediction.

    View all citing articles on Scopus
    View full text