Computationally Efficient Exact Calculation of Kernel Density Derivatives

Shaker, Matineh; Myhre, Jonas Nordhaug; Erdogmus, Deniz

doi:10.1007/s11265-014-0904-1

Computationally Efficient Exact Calculation of Kernel Density Derivatives

Published: 13 June 2014

Volume 81, pages 321–332, (2015)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Matineh Shaker¹,
Jonas Nordhaug Myhre² &
Deniz Erdogmus¹

539 Accesses
1 Citation
Explore all metrics

Abstract

Machine learning research related to the derivatives of the kernel density estimator has received limited attention compared to the density estimator itself. This is despite of the general consensus that most of the important features of a data distribution, such as modes, curvature or even cluster structure, are characterized by its derivatives. In this paper we present a computationally efficient algorithm to calculate kernel density estimates and their derivatives for linearly separable kernels, with significant savings especially for high dimensional data and higher order derivatives. It significantly reduces the number of operations (multiplications and derivative evaluations) to calculate the estimates, while keeping results exact (i.e. no approximations are involved). The main idea is that the calculation of multivariate separable kernels and their derivatives, such as the gradient vector and the Hessian matrix involves significant number of redundant operations that can be eliminated using the chain rule. A tree-based algorithm that calculates exact kernel density estimate and derivatives in the most efficient fashion is presented with the particular focus being on optimizing kernel evaluations for individual data pairs. In contrast, most approaches in the literature resort to approximations of functions or downsampling. Overall computational savings of the presented method could be further increased by incorporating such approximations, which aim to reduce the number of pairs of data considered. The theoretical computational complexity of the tree-based and direct methods that perform all multiplications are compared. In experimental results, calculating separable kernels and their derivatives is considered, as well as a measure that evaluates how close a point is to the principal curve of a density, which employs first and second derivatives. These results indicate considerable improvement in computational complexity, hence time over the direct approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Constrained Recursive Kernel Density/Regression Estimation by Stochastic Quasi-Gradient Methods

Major advancements in kernel function approximation

Article 01 August 2020

Data Based Construction of Kernels for Classification

References

Arsalane, C. G. (2013). Kernel Estimator and Bandwidth Selection for Density and its Derivatives.
Bache, K., & Lichman, M. (2013). UCI Machine Learning Repository.
Bas, E., & Erdogmus, D. (2011). Connectivity of Projected High Dimensional Data Charts on One-Dimensional Curves. Signal Processing, 91(10), 2404–2409.
Article MATH Google Scholar
Bas, E., & Erdogmus, D. (2011). Acoustics, Speech and Signal Processing (ICASSP). IEEE International Conference on, 2276–2279.
Chacón, J. E., Duong, T., Wand, M.P. (2009). Asymptotics for General Multivariate Kernel Density DerivativeEstimators.
Chacón, J. E., & Duong, T. (2013). Data-Driven Density Derivative Estimation, with Applications to Nonparametric Clustering and Bump Hunting. Electronic Journal of Statistics, 7, 499–532.
Article MATH MathSciNet Google Scholar
Chen, H., & Meer, P. (2002). Robust computer vision throughkernel density estimation. Computer Vision ECCV 2002, (pp. 236–250): Springer.
Chen, Y., Welling, M., Smola, A. (2012). Super-samples from kernel herding. arXiv preprint arXiv:1203.3472.
Comaniciu, D., & Peter, M. (2002). Mean shift: A Robust Approach Toward Feature Space Analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(5), 603–619.
Article Google Scholar
Duong, T., Cowling, A., Koch, I., Wand, M. P. (2008). Feature Significance for Multivariate Kernel Density Estimation. Computational Statistics & Data Analysis, 52(9), 4225–4242.
Article MATH MathSciNet Google Scholar
Duong, T., & Hazelton, M. L. (2005). Convergence Rates for Unconstrained Bandwidth Matrix Selectors in Multivariate Kernel Density Sstimation. Journal of Multivariate Analysis, 93(2), 417–433.
Article MATH MathSciNet Google Scholar
Duong, T., & Hazelton, M. L. (2005). Cross-Validation Bandwidth Matrices for Multivariate Kernel Density Estimation. Scandinavian Journal of Statistics, 32(3), 485–506.
Article MATH MathSciNet Google Scholar
Engel, Y., Mannor, S., Meir, R. (2004). The Kernel Recursive Least-Squares Algorithm. Signal Processing, IEEE Transactions on, 52(8), 2275–2285.
Article MathSciNet Google Scholar
Fan, J., & Marron, James S (1994). Fast Implementations of Nonparametric Curve Estimators. Journal of Computational and Graphical Statistics, 3(1), 35–56.
Google Scholar
Greengard, L., & Strain, J. (1991). The Fast Gauss Transform. SIAM Journal on Scientific and Statistical Computing, 12(1), 79–94.
Article MATH MathSciNet Google Scholar
Hofmann, T., Schölkopf, B., Smola, A.J. (2008). Kernel Methods in Machine Learning. The annals of statistics, 1171–1220.
Kleinberg, J., & Tardos, E. (2006). Algorithm design Pearson Education.
Myhre, J. N., & Jenssen, R. (2012). Mixture weight influence on kernel entropy component analysis and semi-supervised learning using the lasso.In Machine Learning for Signal Processing (MLSP). 2012 IEEE International Workshop on, 1–6.
Ozertem, U., & Erdogmus, D. (2011). Locally Defined Principal Curves and Surfaces. Journal of Machine Learning Research, 12(4), 1249–1286.
Park, B.U., & Marron, J. S. (1990). Comparison of Data-Driven Bandwidth Selectors. Journal of the American Statistical Association, 85(409), 66–72.
Article Google Scholar
Raykar, V. C., Duraiswami, R., Zhao, L. H. (2010). Fast Computation of Kernel Estimators. Journal of Computational and Graphical Statistics, 19(1), 205–220.
Article MathSciNet Google Scholar
Scott, D. W., & Sheather, S. J. (1985). Kernel Density Estimation with Binned Data. Communications in Statistics-Theory and Methods, 14(6), 1353–1359.
Article Google Scholar
Sheather, S. J., & Jones, M. C. (1991). A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statistical Society. Series B (Methodological), 683–690.
Silverman, B. W. (1986). Density estimation for statistics and data analysis. Chapman & Hall/CRC, 26.
Silverman, B. W. (1982). Algorithm as 176: Kernel Density Estimation Using the Fast Fourier Transform. Journal of the Royal Statistical Society. Series C (Applied Statistics), 31(1), 93–99.
MATH Google Scholar
Simon, H. (1994). Neural Networks: A Comprehensive Foundation,Prentice Hall PTR.
Singh, R.S. (1977). Applications of Estimators of a Density and its Derivatives to Certain Statistical Problems. Journal of the Royal Statistical Society. Series B (Methodological), 357–363.
Taylor, J. S., & Cristianini, N. (2004). Kernel Methods for Pattern Analysis: Cambridge University press.
Teofilo, F.G. (1985). Clustering to Minimize the Maximum Intercluster Distance. Theoretical Computer Science, 38, 293–306.
Article MATH MathSciNet Google Scholar
Wand, M.P., & Jones, M. C. (1994). Kernel Smoothing. Chapman & Hall/CRC, 60.
Williams, C., & Seeger, M. (2001). Using the Nyström Method to Speed Up Kernel Machines. Advances in Neural Information Processing Systems 13. Citeseer.
Yang, C., Duraiswami, R., Gumerov, N. A., Davis, L. (2003). Improved fast gauss transform and efficient kernel density estimation.In: Ninth IEEE Proceedings International Conference on Computer Vision, 2003, (pp. 664–671).

Download references

Acknowledgments

This work is supported by NSF under grants IIS-1149570, IIS-1118061, and NIH grant 1R21EY022387.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering 409 Dana Research Center, Northeastern University, 360 Huntington Avenue, Boston, MA, 02115, USA
Matineh Shaker & Deniz Erdogmus
Department of Physics and Technology, University of Tromsø, N-9037, Tromsø, Norway
Jonas Nordhaug Myhre

Authors

Matineh Shaker
View author publications
You can also search for this author inPubMed Google Scholar
Jonas Nordhaug Myhre
View author publications
You can also search for this author inPubMed Google Scholar
Deniz Erdogmus
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Matineh Shaker.

Additional information

M. Shaker and J. N. Myhre contributed equally to this work.

Appendices

Appendix A:

Proof of fact 1

Subject to the constraint $\sum _{l} \left | v_{lm} \right | \leq p$ on the growth of T, the number of nodes in the l-th layer is equal to $\binom {l+p}{p}$. The proof of this is based on mathematical induction for all l=1,…,d. Let the number of nodes in each layer be n _l.

Base case

When l=1, the first dimension contains all the derivative orders j=0,…,p which makes n ₁=p+1. The nodes with index j at the first layer are expanded by p+1−j child nodes in the second layer, and therefore $n_{2}=(p+1)+p+\dots +2+1=\binom {2+p}{2}$ which supports the statement for l=2.

Induction step

Let l ^′ be given as the layer number, and suppose the hypothesis is true for l=l ^′. Using Pascal’s rule about binomial coefficients, n _l ^′+1 can be written as

$$ \binom{l'+1+p}{p}=\binom{l'+p}{p}+\binom{l'+p}{p-1} $$

(19)

where the left term is the number of nodes at the layer l ^′+1. Looking back at the tree structure and Figure 1 as an example, it can be observed that the tree has a self-repeating pattern. At the new layer l ^′+1, the nodes are divided into two parts. The upper part is originally expanded from v ₁₁, and it exactly repeats the nodes of previous layer. The reason is that the starting node of this branch is the node containing j=0 at the first layer, which does not change the maximum derivative order constraint. As a result, the branch can be shifted one layer back to the root, and therefore this recursive nature propagates to the whole tree. In (19), this branch is represented by the term $\binom {l'+p}{p}$ which is equal to n _l ^′, according to the inductionhypothesis.

The lower part of the tree is expanded from all other nodes of the first layer expect v ₁₁. One can observe that these branches are identical to those of a tree with the same number of layers but with the maximum derivative order of p−1. Again the reasoning is based on the maximum derivative constraint. Since at the first layer the node value is nonzero, the nodes have one less derivative order for the valid expansion. Based on the induction hypothesis, the number of nodes in the layer l ^′+1 subject to the constraint $\sum _{l} \left | v_{lm} \right | \leq (p-1)$, will be $\binom {(l'+1)+(p-1)}{p-1}$ or $\binom {l'+p}{p-1}$, and equal to the second term of the the right side of (19). Thus the induction hypothesis is proved for l=l ^′+1, and the statement is generally true.

Appendix B:

In the case of multivariate separable Gaussian kernels, the exponent has the Mahalanobis distance form. Therefore, decomposing the inverse of covariance matrix (precision matrix) gives the matrix that scales and orients the input data relative to the desired kernel bandwidth.

$$ \mathbf{\Sigma}_{i}^{-1} = \mathbf{S}_{i}^{T} \mathbf{S}_{i}. $$

(20)

In order to extract S _i, the eigendecomposition of C _i is utilized

$$ \mathbf{\Sigma}_{i} = \mathbf{Q}_{i} \mathbf{\Lambda}_{i} \mathbf{Q}_{i}^{T}, $$

(21)

where the eigenvectors and eigenvalues of C _i are the columns and diagonal elements of orthogonal Q _i and diagonal Λ_i, respectively. From (20) and (21), the scale matrix S _i is written as follows

$$ \mathbf{S}_{i} = \mathbf{\Lambda}_{i}^{-1/2} \mathbf{Q}_{i}^{T}. $$

(22)

Using this formation, S _i is the bandwidth matrix affiliated with X _i. Similarly, a Cholesky factorization of $\mathbf {C}_{i} =\mathbf {R}^{T}_{i}\mathbf {R}_{i}$ would yield S _i=R _i as an option.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shaker, M., Myhre, J.N. & Erdogmus, D. Computationally Efficient Exact Calculation of Kernel Density Derivatives. J Sign Process Syst 81, 321–332 (2015). https://doi.org/10.1007/s11265-014-0904-1

Download citation

Received: 04 April 2013
Revised: 28 January 2014
Accepted: 29 April 2014
Published: 13 June 2014
Issue Date: December 2015
DOI: https://doi.org/10.1007/s11265-014-0904-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Computationally Efficient Exact Calculation of Kernel Density Derivatives

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Constrained Recursive Kernel Density/Regression Estimation by Stochastic Quasi-Gradient Methods

Major advancements in kernel function approximation

Data Based Construction of Kernels for Classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A:

Proof of fact 1

Base case

Induction step

Appendix B:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now