Abstract
Machine learning research related to the derivatives of the kernel density estimator has received limited attention compared to the density estimator itself. This is despite of the general consensus that most of the important features of a data distribution, such as modes, curvature or even cluster structure, are characterized by its derivatives. In this paper we present a computationally efficient algorithm to calculate kernel density estimates and their derivatives for linearly separable kernels, with significant savings especially for high dimensional data and higher order derivatives. It significantly reduces the number of operations (multiplications and derivative evaluations) to calculate the estimates, while keeping results exact (i.e. no approximations are involved). The main idea is that the calculation of multivariate separable kernels and their derivatives, such as the gradient vector and the Hessian matrix involves significant number of redundant operations that can be eliminated using the chain rule. A tree-based algorithm that calculates exact kernel density estimate and derivatives in the most efficient fashion is presented with the particular focus being on optimizing kernel evaluations for individual data pairs. In contrast, most approaches in the literature resort to approximations of functions or downsampling. Overall computational savings of the presented method could be further increased by incorporating such approximations, which aim to reduce the number of pairs of data considered. The theoretical computational complexity of the tree-based and direct methods that perform all multiplications are compared. In experimental results, calculating separable kernels and their derivatives is considered, as well as a measure that evaluates how close a point is to the principal curve of a density, which employs first and second derivatives. These results indicate considerable improvement in computational complexity, hence time over the direct approach.





Similar content being viewed by others
References
Arsalane, C. G. (2013). Kernel Estimator and Bandwidth Selection for Density and its Derivatives.
Bache, K., & Lichman, M. (2013). UCI Machine Learning Repository.
Bas, E., & Erdogmus, D. (2011). Connectivity of Projected High Dimensional Data Charts on One-Dimensional Curves. Signal Processing, 91(10), 2404–2409.
Bas, E., & Erdogmus, D. (2011). Acoustics, Speech and Signal Processing (ICASSP). IEEE International Conference on, 2276–2279.
Chacón, J. E., Duong, T., Wand, M.P. (2009). Asymptotics for General Multivariate Kernel Density DerivativeEstimators.
Chacón, J. E., & Duong, T. (2013). Data-Driven Density Derivative Estimation, with Applications to Nonparametric Clustering and Bump Hunting. Electronic Journal of Statistics, 7, 499–532.
Chen, H., & Meer, P. (2002). Robust computer vision throughkernel density estimation. Computer Vision ECCV 2002, (pp. 236–250): Springer.
Chen, Y., Welling, M., Smola, A. (2012). Super-samples from kernel herding. arXiv preprint arXiv:1203.3472.
Comaniciu, D., & Peter, M. (2002). Mean shift: A Robust Approach Toward Feature Space Analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(5), 603–619.
Duong, T., Cowling, A., Koch, I., Wand, M. P. (2008). Feature Significance for Multivariate Kernel Density Estimation. Computational Statistics & Data Analysis, 52(9), 4225–4242.
Duong, T., & Hazelton, M. L. (2005). Convergence Rates for Unconstrained Bandwidth Matrix Selectors in Multivariate Kernel Density Sstimation. Journal of Multivariate Analysis, 93(2), 417–433.
Duong, T., & Hazelton, M. L. (2005). Cross-Validation Bandwidth Matrices for Multivariate Kernel Density Estimation. Scandinavian Journal of Statistics, 32(3), 485–506.
Engel, Y., Mannor, S., Meir, R. (2004). The Kernel Recursive Least-Squares Algorithm. Signal Processing, IEEE Transactions on, 52(8), 2275–2285.
Fan, J., & Marron, James S (1994). Fast Implementations of Nonparametric Curve Estimators. Journal of Computational and Graphical Statistics, 3(1), 35–56.
Greengard, L., & Strain, J. (1991). The Fast Gauss Transform. SIAM Journal on Scientific and Statistical Computing, 12(1), 79–94.
Hofmann, T., Schölkopf, B., Smola, A.J. (2008). Kernel Methods in Machine Learning. The annals of statistics, 1171–1220.
Kleinberg, J., & Tardos, E. (2006). Algorithm design Pearson Education.
Myhre, J. N., & Jenssen, R. (2012). Mixture weight influence on kernel entropy component analysis and semi-supervised learning using the lasso.In Machine Learning for Signal Processing (MLSP). 2012 IEEE International Workshop on, 1–6.
Ozertem, U., & Erdogmus, D. (2011). Locally Defined Principal Curves and Surfaces. Journal of Machine Learning Research, 12(4), 1249–1286.
Park, B.U., & Marron, J. S. (1990). Comparison of Data-Driven Bandwidth Selectors. Journal of the American Statistical Association, 85(409), 66–72.
Raykar, V. C., Duraiswami, R., Zhao, L. H. (2010). Fast Computation of Kernel Estimators. Journal of Computational and Graphical Statistics, 19(1), 205–220.
Scott, D. W., & Sheather, S. J. (1985). Kernel Density Estimation with Binned Data. Communications in Statistics-Theory and Methods, 14(6), 1353–1359.
Sheather, S. J., & Jones, M. C. (1991). A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statistical Society. Series B (Methodological), 683–690.
Silverman, B. W. (1986). Density estimation for statistics and data analysis. Chapman & Hall/CRC, 26.
Silverman, B. W. (1982). Algorithm as 176: Kernel Density Estimation Using the Fast Fourier Transform. Journal of the Royal Statistical Society. Series C (Applied Statistics), 31(1), 93–99.
Simon, H. (1994). Neural Networks: A Comprehensive Foundation,Prentice Hall PTR.
Singh, R.S. (1977). Applications of Estimators of a Density and its Derivatives to Certain Statistical Problems. Journal of the Royal Statistical Society. Series B (Methodological), 357–363.
Taylor, J. S., & Cristianini, N. (2004). Kernel Methods for Pattern Analysis: Cambridge University press.
Teofilo, F.G. (1985). Clustering to Minimize the Maximum Intercluster Distance. Theoretical Computer Science, 38, 293–306.
Wand, M.P., & Jones, M. C. (1994). Kernel Smoothing. Chapman & Hall/CRC, 60.
Williams, C., & Seeger, M. (2001). Using the Nyström Method to Speed Up Kernel Machines. Advances in Neural Information Processing Systems 13. Citeseer.
Yang, C., Duraiswami, R., Gumerov, N. A., Davis, L. (2003). Improved fast gauss transform and efficient kernel density estimation.In: Ninth IEEE Proceedings International Conference on Computer Vision, 2003, (pp. 664–671).
Acknowledgments
This work is supported by NSF under grants IIS-1149570, IIS-1118061, and NIH grant 1R21EY022387.
Author information
Authors and Affiliations
Corresponding author
Additional information
M. Shaker and J. N. Myhre contributed equally to this work.
Appendices
Appendix A:
Proof of fact 1
Subject to the constraint \(\sum _{l} \left | v_{lm} \right | \leq p\) on the growth of T, the number of nodes in the l-th layer is equal to \(\binom {l+p}{p}\). The proof of this is based on mathematical induction for all l=1,…,d. Let the number of nodes in each layer be n l .
Base case
When l=1, the first dimension contains all the derivative orders j=0,…,p which makes n 1=p+1. The nodes with index j at the first layer are expanded by p+1−j child nodes in the second layer, and therefore \(n_{2}=(p+1)+p+\dots +2+1=\binom {2+p}{2}\) which supports the statement for l=2.
Induction step
Let l ′ be given as the layer number, and suppose the hypothesis is true for l=l ′. Using Pascal’s rule about binomial coefficients, n l ′+1 can be written as
where the left term is the number of nodes at the layer l ′+1. Looking back at the tree structure and Figure 1 as an example, it can be observed that the tree has a self-repeating pattern. At the new layer l ′+1, the nodes are divided into two parts. The upper part is originally expanded from v 11, and it exactly repeats the nodes of previous layer. The reason is that the starting node of this branch is the node containing j=0 at the first layer, which does not change the maximum derivative order constraint. As a result, the branch can be shifted one layer back to the root, and therefore this recursive nature propagates to the whole tree. In (19), this branch is represented by the term \(\binom {l'+p}{p}\) which is equal to n l ′, according to the inductionhypothesis.
The lower part of the tree is expanded from all other nodes of the first layer expect v 11. One can observe that these branches are identical to those of a tree with the same number of layers but with the maximum derivative order of p−1. Again the reasoning is based on the maximum derivative constraint. Since at the first layer the node value is nonzero, the nodes have one less derivative order for the valid expansion. Based on the induction hypothesis, the number of nodes in the layer l ′+1 subject to the constraint \(\sum _{l} \left | v_{lm} \right | \leq (p-1)\), will be \(\binom {(l'+1)+(p-1)}{p-1}\) or \(\binom {l'+p}{p-1}\), and equal to the second term of the the right side of (19). Thus the induction hypothesis is proved for l=l ′+1, and the statement is generally true.
Appendix B:
In the case of multivariate separable Gaussian kernels, the exponent has the Mahalanobis distance form. Therefore, decomposing the inverse of covariance matrix (precision matrix) gives the matrix that scales and orients the input data relative to the desired kernel bandwidth.
In order to extract S i , the eigendecomposition of C i is utilized
where the eigenvectors and eigenvalues of C i are the columns and diagonal elements of orthogonal Q i and diagonal Λ i , respectively. From (20) and (21), the scale matrix S i is written as follows
Using this formation, S i is the bandwidth matrix affiliated with X i . Similarly, a Cholesky factorization of \(\mathbf {C}_{i} =\mathbf {R}^{T}_{i}\mathbf {R}_{i}\) would yield S i =R i as an option.
Rights and permissions
About this article
Cite this article
Shaker, M., Myhre, J.N. & Erdogmus, D. Computationally Efficient Exact Calculation of Kernel Density Derivatives. J Sign Process Syst 81, 321–332 (2015). https://doi.org/10.1007/s11265-014-0904-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-014-0904-1