On bandwidth selection using minimal spanning tree for kernel density estimation

https://doi.org/10.1016/j.csda.2016.04.005Get rights and content

Abstract

The use of kernel density estimation is quite well known in large variety of machine learning applications like classification, clustering, feature selection, etc. One of the major issues in the construction of kernel density estimators is the tuning of bandwidth parameter. Most of the bandwidth selection procedures optimize mean integrated squared or absolute error, which require huge computational time as the size of the data increases. Here, the bandwidth has been taken to be a function of inter-point distances of the data set. It is defined as a function of the length of Euclidean Minimal Spanning Tree of the given sample points. No rigorous theory about the asymptotic properties of the EMST based density estimator has been developed in the literature. Theoretical analysis of the asymptotic properties of the EMST based density estimator has been established and proved that the estimator is asymptotically unbiased to the original density at its every continuity point. Moreover, theoretical analysis has been provided for general kernel. Experiments are conducted using both synthetic and real-life data sets to compare the performance of the EMST bandwidth to those of conventional cross-validation and plug-in bandwidth selectors. It is found that the EMST based estimator achieves the comparative performance, while being simpler and faster than the conventional estimators.

Introduction

Many tasks in pattern recognition and machine learning often require the knowledge of underlying densities of the observed data (Menardi and Azzalini, 2014, Brown et al., 2012, Stover and Ulm, 2013, Brox et al., 2007, Ji et al., 2014, Jones and Rehg, 2002, Liu et al., 2007). For example in Bayes classification, the decision rule involves estimation of class conditional probabilities of the training data (Duda et al., 1999, Ramoni and Sebastiani, 2001, Kim and Scott, 2010). And, in model based clustering, every cluster corresponds to ‘mode’ or ‘peak’ in the estimated probability density of a given set of points (Li et al., 2007, Hinneburg and Gabriel, 2007, Tang et al., 2015). Estimation of the density can be done either in a parametric or non-parametric way. In parametric estimation, assumptions are made about the structure of the density, whereas in non-parametric estimation no assumptions are made about the form of the density function. Various methods have been studied for non-parametric density estimation such as histogram, kernel density estimator, spline estimators, orthogonal series estimators, etc., (Silverman, 1986, Scott, 2009, Golyandina et al., 2012). The kernel method is perhaps the most popular and well known technique of non-parametric estimation (Parzen, 1962, Cacoullos, 1966).

Throughout this article, we use the following notation. Let X¯1,X¯2,,X¯nRd,d2 denote ‘n’ independent and identically distributed random vectors, and X¯i=(Xi1,,Xid), (.) represents the transpose. A general vector x¯ has the representation x¯=(x1,,xd) and E(.) denotes expectation of a random vector. Also, Rd will be shorthand for RR, dx¯ will be shorthand for dx1dxd and (.)dx¯ denotes (.)dx1..dxd.

A general d-dimensional kernel density estimator fˆ, for a random sample X¯1,X¯2,,X¯n with common probability density function f, is fˆn(x¯)=1ni=1nKH(x¯X¯i), where KH is the scaled kernel function i.e., KH(x¯)=|H|1/2K(H1/2x¯), K is a d-variate kernel function, H is a symmetric positive definite d×d bandwidth matrix and || is the determinant. Traditionally, K is assumed to be symmetric and K(x¯)dx¯=1. Some commonly used kernel functions are uniform, triangle, Epanechnikov, Gaussian, etc. The most widely used kernel is the Gaussian with zero mean and unit variance. From the above equation of fˆn(x¯), it is clear that kernel density estimate at any test point x¯ is simply the sum of kernel values caused by all training points X¯i. It is well known that the bandwidth selection is the most important and crucial step to obtain a good estimate. There are mainly two computational challenges associated with KDE; one is the selection of bandwidth, which is estimated using training data and the other is the construction of density at any test point. Note that the issue of bandwidth selection is the only problem considered in this article.

The bandwidth matrix H can be considered as a diagonal positive definite matrix i.e., H=diag(h12,,hd2),hi>0,i to simplify the above equation. Further simplification is obtained from the restriction, hi=h(>0),i, i.e., H=diag(h2,,h2) and this leads to a single bandwidth kernel density estimator as, fˆn(x¯)=1ni=1nKh(x¯X¯i)=1nhdi=1nK(h1(x¯X¯i)). A full bandwidth matrix provides more flexibility, but it also introduces more complexity into the estimator since more parameters need to be tuned (Wand and Jones, 1994). Although the selection of bandwidth can be done subjectively, but there is a great demand for automatic selection. Several automatic procedures compute optimal bandwidth value by minimizing the discrepancy between the estimate and target density by using some error criterion. A few such error criteria are given below (Wand and Jones, 1994).

  • Mean Squared Error (MSE): MSE(fˆn(x¯))=E{(fˆn(x¯)f(x¯))2}.

  • Mean Integrated Squared Error (MISE): MISE(fˆn)=E{(fˆn(x¯)f(x¯))2dx¯}.

  • Mean Integrated Absolute Error (MIAE): MIAE(fˆn)=E{|fˆn(x¯)f(x¯)|dx¯}.

Most of the modern bandwidth selectors are motivated by minimizing the MSE or MISE because these two criteria can be decomposed into variance and bias terms, as MSE(fˆn(x¯))=(E[fˆn(x¯)]f(x))2+Varfˆn(x¯)=(Biasfˆn(x¯))2+Varfˆn(x¯),MISE(fˆn)=E[(fˆn(x¯)f(x¯))2dx¯]=E[(fˆn(x¯)f(x¯))2]dx¯=MSE(fˆn(x¯)). The optimal bandwidth that minimizes MISE, when the underlying density is a d-variate normal and the diagonal bandwidth matrix is employed, can be approximated by Silverman (1986), hi=σi{4(d+2)n}1(d+4),i=1,2,,d; where σi is the standard deviation of the ith variable. This is called “Normal Reference (NR) rule”. The most studied automatic bandwidth selector which aims to minimize MISE is the Least Squares Cross Validation (LSCV) (Bowman, 1984). The LSCV selector of H is, HˆLSCV=minHLSCV(H), where LSCV(H)=fˆ(x¯;H)2dx¯2ni=1nfˆi(X¯i;H), and fˆi(:,H) is the kernel estimator based on the sample with out X¯i. One can minimize LSCV(H) over all positive definite diagonal matrices. Since, the cross-validation criterion involves numerical optimization, it becomes increasingly difficult and time consuming as the data size increases. It has been shown that HˆLSCV is highly variable (Wand and Jones, 1994). A number of modifications of LSCV have been proposed in order to improve its performance. These include the Biased Cross Validation (BCV) method (Scott and Terrell, 1987), the method in Ahmad and Ran (2004) based on kernel contrasts, likelihood cross validation (Zhang et al., 2006), indirect cross validation (Savchuk et al., 2010), and do-validation method (Mammen et al., 2012). BCV is a well-known method and aims to minimize the Asymptotic MISE (AMISE), instead of the exact MISE formula. Another most popular data-driven bandwidth selector is the Direct Plug-In (DPI) (Sheather and Jones, 1991, Liao et al., 2010). This is also based on AMISE, where the estimates of the unknown quantities are being “plugged in”. Plug-in produces more stable bandwidths than does cross validation, and hence is the currently more popular method. Smoothed Cross Validation (SCV) can be thought of as a hybrid of LSCV and BCV (Hall et al., 1992). It is based on the asymptotic integrated variance, but considers exact integrated squared bias rather than using its asymptotic form. Although bandwidth is usually taken to be a constant, several methods have been proposed to vary it. One such kind is the popular kth nearest neighbor estimate (Loftsgaarden and Quesenberry et al., 1965), the adaptive kernel estimate proposed in Breiman et al. (1977) and also the method in Mahapatruni and Gray (2011). In Zougab et al. (2014), a Bayesian estimation of adaptive bandwidth matrices in multivariate kernel density estimation is investigated, when the quadratic and entropy loss functions are used. Number of articles dealing with the problem of bandwidth selection exists in the literature (Terrell and Scott, 1992, Hall, 1992, Heidenreich et al., 2013, Cheng et al., 2014). The most successful state-of-the-art bandwidth selection methods involve optimizing one of MSE, MISE and AMISE criterion functions. Evaluation of such criterion functions involve O(n2) computations, where n is the number of samples (Silverman, 1986). Optimization of these criterion functions then becomes computationally very expensive for large data sets. In case of LSCV, 12n(n1) evaluations of the kernel function K(x) are needed to compute each value of the criterion function. If Gaussian kernel is considered, a single component value K(x) requires O(d2) operations, d is the dimension of the data (Silverman, 1986). So O(n2d2) computations are required to compute each value of the criterion function. Other methods like BCV, PI, etc., whose criterion functions depend on the estimation of general integrated squared density derivative functionals, also requires O(n2d2) computational cost. To optimize such criterion functions, most commonly used method is Quasi-Newton, which is an iterative procedure, and approximates Hessian matrix using recent function and gradient evaluations (Gill et al., 1981). If finite difference gradients are used, each iteration of this technique requires d function evaluations. In addition to this, calculation of step directions, and Hessian approximations involves 2d2+O(d) multiplications and same number of additions and subtractions, per iteration (Byrd et al., 1988). Since, each function evaluation needs O(n2d2) operations, it is seen that entire cost, per iteration, is d×O(n2d2)+2d2+O(d)=O(n2d3). If optimization process takes nc iterations, total complexity would then be O(ncn2d3). One can reduce computational time by binning the samples (Scott and Sheather, 1985, Yang et al., 2003, Raykar et al., 2010). Another approach is to use the representative condensed set instead of the whole data set for density estimation (Girolami and He, 2003, Deng et al., 2008, Wang et al., 2014). However, much of the information about the position of the samples is lost by these methods. As large data sets are very common in modern machine learning applications for density estimation, design of a novel bandwidth selector that can handle large data sets with reduced computational time but provides similar accuracy is very much required.

Parzen (1962) showed that, if any sequence h=hn of positive integers satisfying hn0,nhn as n then the resulting kernel density estimator is asymptotically unbiased and consistent to the target density. Generalization of Parzen’s work to the multivariate case is presented in Cacoullos (1966). Here, bandwidth is considered as a function of number of data points (n) only. Two different data with same number of data points are shown in Fig. 1, one is scaled version of the other. Clearly, the same bandwidth which only depends on the number of data points does not work for both cases. It is desirable that inter-point distances of the data should play a role in the selection of bandwidth. Therefore, bandwidth should not only be a function of n, but also of inter-point distances of the data. Euclidean Minimal spanning Tree (EMST) (Shamos and Hoey, 1975, Preparata and Shamos, 1985, March et al., 2010) is entirely determined by the Euclidean distance between sample points, and it has a close relationship with the distribution of samples. So, length of the EMST has been considered as the bandwidth for kernel density estimation. In Chaudhuri et al. (1996), bandwidth has been defined using EMST of the given samples, but the theory about the asymptotic properties requires kernel to be uniform. For proving the results, they have constructed two sequences of numbers tn,γn such that for very ϵ>0, M0>0 s.t. P(tnhnγn)1ϵ,nM0; tn,γn0, ntn2,nγn2 and tnγn1 as n. Such framework has not been extended to prove the asymptotic properties for general kernel. Additionally, two assumptions were considered, which are given below. Suppose, Nα(x¯)={y¯:|xiyi|α,i}, sequence of sets An such that P(hnAn)1ϵ,n, and νξn(An)=P(hnAn|x¯1=ξ).

  • Assumption  1. Let M1>0 such that P(x¯1Nα(x¯)|hn=α)α2M1,nM2>0andα.

  • Assumption  2. Let there exist M3>0 such that νξn(An)1ϵ,n>M3andξ and for every such sequence An.

In this article, none of the said assumptions is considered. It is also a fact that the assumption 2 is a strong assumption. Moreover, the two sequences (tn,γn), as mentioned above, are also not considered. The approach for the proof is very much different from the previous article. In this article, using a different approach, theoretical analysis of the asymptotic properties of the EMST based density estimator for general kernel, under a mild assumption of compact support, has been provided. It is proved that the resulting estimator is asymptotically unbiased to the original density at its every continuity point. The performance of the EMST bandwidth is found to be comparative to the performance of traditional cross validation and plugin techniques. But, computational time is much smaller than the traditional methods.

The rest of the paper is organized as follows. Section  2 contains theoretical analysis of the asymptotic properties of the EMST based bandwidth and the resulting density estimator. In Section  3, practical benefits of this estimator are demonstrated on both synthetic and real-life data sets by comparing it with some of the existing methods. Finally, Section  4 contains the discussion and conclusion.

Section snippets

Asymptotic analysis of EMST based bandwidth selector

This section presents theoretical results of the EMST based density estimator. First, bandwidth selection using EMST of the given samples is described.

Definition 1 Euclidean Minimal Spanning Tree (EMST)

Let V={x¯1,x¯2,,x¯n} be the set of given observations. Let G=(V,E) be fully connected, undirected graph defined on V and E={eij};i,j be a set of all edges eij=(x¯i,x¯j). A weight wij, Euclidean distance between x¯i and x¯j, is assigned to each edge eij. Then EMST is a subgraph (G) of G with the minimum total weight connecting all the vertices

Experimental results

In this section, performance of the EMST based bandwidth selector has been studied. Experimental study has been conducted with the help of artificial data sets (from Gaussian distribution) as well as real-life publicly available data sets. Ten different shapes of bivariate Gaussian densities, by varying modes, have been considered. For each such case, data samples of different sizes, n=50, 100, 250, 500, 1000, 2500, and 5000 have been generated. The optimal bandwidth h (Wand and Jones, 1993)

Discussion and conclusions

Euclidean minimal spanning tree based bandwidth has been considered for multivariate kernel density estimation. The key idea here is that the inter-point distances of the data should affect the selection of bandwidth. Unlike the traditional approaches, which are based on optimizing either MSE, MISE or AMISE explicitly, here the bandwidth is constructed using EMST of the given samples. The absence of optimizing error criterion and its dependency on the minimal spanning tree contributes to low

Acknowledgments

A part of this work has been done at Center for Soft Computing Research (CSCR), ISI, Kolkata, Project No: (IR/S3/ENC-01/2002). The authors would like to thank Prof. Probal Chaudhuri, Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, for his valuable comments. Also authors would like to thank Dr Tarn Duong, University of Paris 6, Paris, France, for sharing ‘R’ codes related to bandwidth selection.

References (52)

  • A.W. Bowman

    An alternative method of cross-validation for the smoothing of density estimates

    Biometrika

    (1984)
  • L. Breiman et al.

    Variable kernel estimates of multivariate densities

    Technometrics

    (1977)
  • G. Brown et al.

    Conditional likelihood maximisation: a unifying framework for information theoretic feature selection

    J. Mach. Learn. Res.

    (2012)
  • T. Brox et al.

    Nonparametric density estimation with adaptive, anisotropic kernels for human motion tracking

  • R.H. Byrd et al.

    Parallel quasi-Newton methods for unconstrained optimization

    Math. Program.

    (1988)
  • T. Cacoullos

    Estimation of a multivariate density

    Ann. Inst. Statist. Math.

    (1966)
  • J. Chacón et al.

    Multivariate plug-in bandwidth selection with unconstrained pilot bandwidth matrices

    TEST

    (2010)
  • Cheng, T., Gao, J., Zhang, X., 2014. Semiparametric localized bandwidth selection in kernel density estimation....
  • R.O. Duda et al.

    Pattern Classification

    (1999)
  • Duong, T., 2014. ks: Kernel smoothing, r package version 1.9.2. URL:...
  • P.E. Gill et al.

    Practical Optimization

    (1981)
  • M. Girolami et al.

    Probability density estimation from optimally condensed data samples

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2003)
  • P. Hall

    On global properties of variable bandwidth density estimators

    Ann. Statist.

    (1992)
  • P. Hall et al.

    Smoothed cross-validation

    Probab. Theory Related Fields

    (1992)
  • N.B. Heidenreich et al.

    Bandwidth selection for kernel density estimation: a review of fully automatic selectors

    Adv. Stat. Anal.

    (2013)
  • A. Hinneburg et al.

    Denclue 2.0: Fast clustering based on kernel density estimation

  • Cited by (9)

    • Capacity value estimation of plug-in electric vehicle parking-lots in urban power systems: A physical-social coupling perspective

      2020, Applied Energy
      Citation Excerpt :

      In actual applications, the selection about the shape of kernel function K(·) and the bandwidth setting h would have a crucial impact on the effectiveness of fitting results (i.e., obtained PDF). To realize the best performance of NKDA, in our study, we conduct exhaustive comparisons to check the fitness of various kernel functions (including normal, triangular, Epanechnikov, quartic, and cosine); moreover, a novel spanning-tree technique as proposed by [28] is utilized to determine the most suitable bandwidth to be applied in our evaluation. The goodness-of-fit of the tested kernel functions is evaluated using the mean-square-root error (MSRE) index [27].

    • Hierarchical quick shift guided recurrent clustering

      2020, Proceedings - International Conference on Data Engineering
    View all citing articles on Scopus
    View full text