On bandwidth selection using minimal spanning tree for kernel density estimation
Introduction
Many tasks in pattern recognition and machine learning often require the knowledge of underlying densities of the observed data (Menardi and Azzalini, 2014, Brown et al., 2012, Stover and Ulm, 2013, Brox et al., 2007, Ji et al., 2014, Jones and Rehg, 2002, Liu et al., 2007). For example in Bayes classification, the decision rule involves estimation of class conditional probabilities of the training data (Duda et al., 1999, Ramoni and Sebastiani, 2001, Kim and Scott, 2010). And, in model based clustering, every cluster corresponds to ‘mode’ or ‘peak’ in the estimated probability density of a given set of points (Li et al., 2007, Hinneburg and Gabriel, 2007, Tang et al., 2015). Estimation of the density can be done either in a parametric or non-parametric way. In parametric estimation, assumptions are made about the structure of the density, whereas in non-parametric estimation no assumptions are made about the form of the density function. Various methods have been studied for non-parametric density estimation such as histogram, kernel density estimator, spline estimators, orthogonal series estimators, etc., (Silverman, 1986, Scott, 2009, Golyandina et al., 2012). The kernel method is perhaps the most popular and well known technique of non-parametric estimation (Parzen, 1962, Cacoullos, 1966).
Throughout this article, we use the following notation. Let denote ‘’ independent and identically distributed random vectors, and , represents the transpose. A general vector has the representation and E(.) denotes expectation of a random vector. Also, will be shorthand for , will be shorthand for and denotes .
A general -dimensional kernel density estimator , for a random sample with common probability density function , is where is the scaled kernel function i.e., , is a -variate kernel function, is a symmetric positive definite bandwidth matrix and is the determinant. Traditionally, is assumed to be symmetric and . Some commonly used kernel functions are uniform, triangle, Epanechnikov, Gaussian, etc. The most widely used kernel is the Gaussian with zero mean and unit variance. From the above equation of , it is clear that kernel density estimate at any test point is simply the sum of kernel values caused by all training points . It is well known that the bandwidth selection is the most important and crucial step to obtain a good estimate. There are mainly two computational challenges associated with KDE; one is the selection of bandwidth, which is estimated using training data and the other is the construction of density at any test point. Note that the issue of bandwidth selection is the only problem considered in this article.
The bandwidth matrix can be considered as a diagonal positive definite matrix i.e., to simplify the above equation. Further simplification is obtained from the restriction, , i.e., and this leads to a single bandwidth kernel density estimator as, A full bandwidth matrix provides more flexibility, but it also introduces more complexity into the estimator since more parameters need to be tuned (Wand and Jones, 1994). Although the selection of bandwidth can be done subjectively, but there is a great demand for automatic selection. Several automatic procedures compute optimal bandwidth value by minimizing the discrepancy between the estimate and target density by using some error criterion. A few such error criteria are given below (Wand and Jones, 1994).
- •
Mean Squared Error (MSE): .
- •
Mean Integrated Squared Error (MISE): .
- •
Mean Integrated Absolute Error (MIAE): .
Parzen (1962) showed that, if any sequence of positive integers satisfying as then the resulting kernel density estimator is asymptotically unbiased and consistent to the target density. Generalization of Parzen’s work to the multivariate case is presented in Cacoullos (1966). Here, bandwidth is considered as a function of number of data points only. Two different data with same number of data points are shown in Fig. 1, one is scaled version of the other. Clearly, the same bandwidth which only depends on the number of data points does not work for both cases. It is desirable that inter-point distances of the data should play a role in the selection of bandwidth. Therefore, bandwidth should not only be a function of , but also of inter-point distances of the data. Euclidean Minimal spanning Tree (EMST) (Shamos and Hoey, 1975, Preparata and Shamos, 1985, March et al., 2010) is entirely determined by the Euclidean distance between sample points, and it has a close relationship with the distribution of samples. So, length of the EMST has been considered as the bandwidth for kernel density estimation. In Chaudhuri et al. (1996), bandwidth has been defined using EMST of the given samples, but the theory about the asymptotic properties requires kernel to be uniform. For proving the results, they have constructed two sequences of numbers such that for very , s.t. ; , and as . Such framework has not been extended to prove the asymptotic properties for general kernel. Additionally, two assumptions were considered, which are given below. Suppose, , sequence of sets such that , and .
- •
Assumption 1. Let such that .
- •
Assumption 2. Let there exist such that and for every such sequence .
The rest of the paper is organized as follows. Section 2 contains theoretical analysis of the asymptotic properties of the EMST based bandwidth and the resulting density estimator. In Section 3, practical benefits of this estimator are demonstrated on both synthetic and real-life data sets by comparing it with some of the existing methods. Finally, Section 4 contains the discussion and conclusion.
Section snippets
Asymptotic analysis of EMST based bandwidth selector
This section presents theoretical results of the EMST based density estimator. First, bandwidth selection using EMST of the given samples is described. Definition 1 Euclidean Minimal Spanning Tree (EMST) Let be the set of given observations. Let be fully connected, undirected graph defined on and be a set of all edges . A weight , Euclidean distance between and , is assigned to each edge . Then EMST is a subgraph () of with the minimum total weight connecting all the vertices
Experimental results
In this section, performance of the EMST based bandwidth selector has been studied. Experimental study has been conducted with the help of artificial data sets (from Gaussian distribution) as well as real-life publicly available data sets. Ten different shapes of bivariate Gaussian densities, by varying modes, have been considered. For each such case, data samples of different sizes, , 100, 250, 500, 1000, 2500, and 5000 have been generated. The optimal bandwidth (Wand and Jones, 1993)
Discussion and conclusions
Euclidean minimal spanning tree based bandwidth has been considered for multivariate kernel density estimation. The key idea here is that the inter-point distances of the data should affect the selection of bandwidth. Unlike the traditional approaches, which are based on optimizing either MSE, MISE or AMISE explicitly, here the bandwidth is constructed using EMST of the given samples. The absence of optimizing error criterion and its dependency on the minimal spanning tree contributes to low
Acknowledgments
A part of this work has been done at Center for Soft Computing Research (CSCR), ISI, Kolkata, Project No: (IR/S3/ENC-01/2002). The authors would like to thank Prof. Probal Chaudhuri, Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, for his valuable comments. Also authors would like to thank Dr Tarn Duong, University of Paris 6, Paris, France, for sharing ‘R’ codes related to bandwidth selection.
References (52)
- et al.
A data driven procedure for density estimation with some applications
Pattern Recognit.
(1996) - et al.
FRSDE: Fast reduced set density estimator using minimal enclosing ball approximation
Pattern Recognit.
(2008) - et al.
New approaches to nonparametric density estimation and selection of smoothing parameters
Comput. Statist. Data Anal.
(2012) - et al.
Automatic image annotation by semi-supervised manifold kernel density estimation
Inform. Sci.
(2014) - et al.
Robust bayes classifiers
Artificial Intelligence
(2001) - et al.
Hyperparameter estimation and plug-in kernel density estimates for maximum a posteriori land-cover classification with multiband satellite data
Comput. Statist. Data Anal.
(2013) - et al.
A Bayesian approach to bandwidth selection for multivariate kernel density estimation
Comput. Statist. Data Anal.
(2006) - et al.
Bayesian estimation of adaptive bandwidth matrices in multivariate kernel density estimation
Comput. Statist. Data Anal.
(2014) - et al.
Kernel contrasts: a data-based method of choosing smoothing parameters in nonparametric density estimation
J. Nonparametr. Stat.
(2004) - Bache, K., Lichman, M., 2013. UCI Machine Learning Repository. URL:...
An alternative method of cross-validation for the smoothing of density estimates
Biometrika
Variable kernel estimates of multivariate densities
Technometrics
Conditional likelihood maximisation: a unifying framework for information theoretic feature selection
J. Mach. Learn. Res.
Nonparametric density estimation with adaptive, anisotropic kernels for human motion tracking
Parallel quasi-Newton methods for unconstrained optimization
Math. Program.
Estimation of a multivariate density
Ann. Inst. Statist. Math.
Multivariate plug-in bandwidth selection with unconstrained pilot bandwidth matrices
TEST
Pattern Classification
Practical Optimization
Probability density estimation from optimally condensed data samples
IEEE Trans. Pattern Anal. Mach. Intell.
On global properties of variable bandwidth density estimators
Ann. Statist.
Smoothed cross-validation
Probab. Theory Related Fields
Bandwidth selection for kernel density estimation: a review of fully automatic selectors
Adv. Stat. Anal.
Denclue 2.0: Fast clustering based on kernel density estimation
Cited by (9)
Capacity value estimation of plug-in electric vehicle parking-lots in urban power systems: A physical-social coupling perspective
2020, Applied EnergyCitation Excerpt :In actual applications, the selection about the shape of kernel function K(·) and the bandwidth setting h would have a crucial impact on the effectiveness of fitting results (i.e., obtained PDF). To realize the best performance of NKDA, in our study, we conduct exhaustive comparisons to check the fitness of various kernel functions (including normal, triangular, Epanechnikov, quartic, and cosine); moreover, a novel spanning-tree technique as proposed by [28] is utilized to determine the most suitable bandwidth to be applied in our evaluation. The goodness-of-fit of the tested kernel functions is evaluated using the mean-square-root error (MSRE) index [27].
Novel planning methodology for energy stations and networks in regional integrated energy systems
2020, Energy Conversion and ManagementThe exploration in the size of scientific collaboration team using kernel density estimation
2023, Aslib Journal of Information ManagementLog-concave ridge estimation
2020, arXivHierarchical quick shift guided recurrent clustering
2020, Proceedings - International Conference on Data Engineering