On discrete Epanechnikov kernel functions

doi:10.1016/j.csda.2017.07.003

Computational Statistics & Data Analysis

Volume 116, December 2017, Pages 79-105

https://doi.org/10.1016/j.csda.2017.07.003 Get rights and content

Abstract

Least-squares cross-validation is commonly used for selection of smoothing parameters in the discrete data setting; however, in many applied situations, it tends to select relatively small bandwidths. This tendency to undersmooth is due in part to the geometric weighting scheme that many discrete kernels possess. This problem may be avoided by using alternative kernel functions. Specifically, discrete versions (both unordered and ordered) of the popular Epanechnikov kernel do not have rapidly decaying weights. The analytic properties of these kernels are contrasted with commonly used discrete kernel functions and their relative performance is compared using both simulated and real data. The simulation and empirical results show that these kernel functions generally perform well and in some cases demonstrate substantial gains in terms of mean squared error.

Introduction

An intuitive approach to estimate a univariate discrete probability (mass) function is to use the sample frequency of occurrence as the estimator of a cell probability (i.e., frequency approach). However, when the number of cells is close to or even greater than the sample size (the data are sparse), the frequency approach does not work well due to many zero counts (Simonoff, 1996). In this case, applied researchers often resort to a smoothing approach, which introduces bias but can dramatically lower mean squared error (MSE). In this paper, we focus on the kernel smoothing approach where the underlying density $p (x)$ is estimated by $\hat{p} (x) = \frac{1}{n} \sum_{i = 1}^{n} l (\cdot)$ , with a kernel function $l (\cdot)$ appropriate for smoothing discrete data. Existing discrete kernel functions date back to Aitchison and Aitken (1976), Habbema et al. (1978), Titterington (1980), Wang and vanRyzin (1981), and Aitken (1983). More recently, Li and Racine (2003) propose kernel functions for smoothing both unordered and ordered discrete data.

The kernel function’s ability to smooth data hinges on the bandwidth (or smoothing parameter). How this bandwidth is selected is of the utmost importance in applied work and least-squares cross-validation (LSCV) has proven a popular approach when discrete data are present given the lack of simple rule-of-thumb or plug-in bandwidths (see Chu et al., 2015 for some recent work in this direction). However, in many applied situations, LSCV tends to select a relatively small bandwidth relative to the theoretical optimum (undersmoothing), particularly when discrete data are sparse (e.g., see Asparoukhov and Krzanowski (2001) or Coppejans (2003). One explanation for this problem is that many ordered discrete kernel functions possess geometrically decaying weighting schemes, leading to a rapid decline in the weights used to smooth the data (Rajagopalan and Lall, 1995). Adding to this line of reasoning, Chu et al. (2015) show that for an ordered discrete kernel function with geometric weighting structure, the optimal bandwidth, in terms of the mean summed squared error (MSSE) criterion, is a real root from a polynomial, with the order of the polynomial being determined by the number of cells. The main insight from this relationship is that the optimal bandwidth is inversely related to the order of the polynomial, potentially compounding the small bandwidth problem.

These issues also occur in kernel regression estimation. For example, Henderson and Kumbhakar (2006) note that in their longitudinal/panel application, capturing unobserved heterogeneity through an unordered discrete variable (with respect to the cross-sectional dimension) results in a relatively small bandwidth. In this case, the regression estimator essentially uses only $T$ (time) observations for each cross-sectional unit. It is likely that this problem is pervasive in papers using nonparametric methods in the presence of longitudinal data where the cross-sectional specific heterogeneity is treated as an unordered discrete variable.

Although different methods have been proposed to resolve the issue of undersmoothing, most are modifications of existing error criterion and are typically designed mainly for continuous variables (for example, see Härdle et al., 1988, Chiu, 1990, Hart and Yi, 1998, Hurvich et al., 1998, Hall and Robinson, 2009) or involve sample splitting (Li et al., 2016). Unlike existing studies, we attempt to use alternative discrete kernel functions in conjunction with the LSCV criterion.

Rajagopalan and Lall (1995) develop an ordered discrete version of the Epanechnikov (1969) kernel function which does not possess a geometric weighting scheme, providing sufficient smoothing in the presence of sparse data. Unfortunately, applied researchers who adopt kernel smoothing methods are largely unaware of this kernel function (an exception is Guerra et al., 1997) which motivates us to attempt to further its application. Specifically, we detail Rajagopalan and Lall’s (1995) ordered discrete Epanechnikov kernel function and propose an unordered discrete Epanechnikov kernel function.

In a similar vein, Kokonendji et al. (2007) develop a so-called triangular probability mass function and use it as an ordered discrete kernel function. Their triangular kernel function does not impose a geometric weighting structure, but is less relevant for our discussion here as this kernel function is designed for use with count data with excess zeros and the function consists of two parameters which adds additional complications to bandwidth selection.

For both the unordered and ordered discrete Epanechnikov kernel functions, we derive the MSSE of the kernel density estimator (probability mass function). Further, we demonstrate that a sufficient condition for asymptotic normality of both the kernel density and regression estimators is satisfied by this new kernel, namely by establishing a second-order approximation of the discrete kernels proposed here, similar to that used by Li and Racine (2003).

The results here are unique relative to the continuous data setting where it is well known that kernel choice is ancillary to bandwidth choice. The discrete kernel seems to play a more important role. Given that the asymptotic bias and variance of kernel density and regression estimators are independent of the discrete kernel used to smooth the data, this becomes a finite sample issue; this topic is germane to study given that in the continuous only case it is relatively easy to assess efficiency loss through the use of a particular kernel relative to the optimal. Our goal here is to study the impact that the choice of discrete kernel has in a rigorous fashion through a variety of analytic, simulated and real data settings.

Here we examine the discrete Epanechnikov kernel functions versus Aitchison and Aitken’s (1976), Wang and vanRyzin’s (1981), and Li and Racine’s (2003) kernel functions in simulations and empirical examples. For this set of kernel functions, the simulation results show the discrete Epanechnikov kernel functions generally perform well. We find that a researcher is generally no worse off and sometimes better off using a discrete Epanechnikov kernel in density estimation. However, the researcher is significantly worse off when using the ordered discrete Epanechnikov kernel in the (continuous) conditional density setting when there is an irrelevant ordered variable present. In both cross-sectional and longitudinal data regression, the researcher appears to be no worse off using the unordered discrete Epanechnikov kernel and is sometimes strictly better off. However, when the data are sparse, the Wang and van Ryzin or Li and Racine kernel performs better than the ordered discrete Epanechnikov kernel. This result is surprising as the ordered discrete Epanechnikov kernel was designed for sparse data in the density setting. It appears that these properties do not translate to the regression setting. In the case of ordered discrete kernels with longitudinal data, we find no substantial differences across kernel functions in our simulations. Our empirical examples largely mimic the simulation results except in the longitudinal data setting where we find both substantial gains and losses for the discrete Epanechnikov kernels.

The remainder of this paper is organized as follows: Section 2 presents the ordered discrete Epanechnikov kernel, develops the unordered discrete Epanechnikov kernel, compares the analytic properties of these kernel functions with those commonly used in the literature and presents the asymptotic properties of density and regression estimators using these kernels. Section 3 shows the finite sample performance via simulations. Section 4 provides several empirical illustrations and Section 5 concludes.

Section snippets

Discrete Epanechnikov kernel functions

For the case of a continuous random variable $X$ , Epanechnikov (1969) shows that the MSE optimal second-order kernel function is $k (ψ_{X}) = \{\begin{matrix} a ψ_{X}^{2} + b if | ψ_{X} | \leq 1 \\ 0 if | ψ_{X} | > 1, \end{matrix}$ where $- a = b = 0.75$ , $ψ_{X} = \frac{X - x}{h}$ and $h$ is the bandwidth. Rajagopalan and Lall (1995) extend this set-up to an ordered, discrete random variable $X$ . The discrete version of the optimal second-order kernel $l (\cdot)$ is required to satisfy two conditions: (A) $\sum_{X = x - h}^{x + h} l (ψ_{X}) = 1$ and (B) $\sum_{X = x - h}^{x + h} l (ψ_{X}) ψ_{X} = 0$ . Condition (A) is the discrete counterpart of requiring

Simulations

This section provides comprehensive simulations. Our goal is to determine where gains from the discrete Epanechnikov kernel functions might be found and provide some guidance for how to suitably choose kernel functions in practice across different settings. We will provide details on the design of the simulations, present the results and then summarize which kernels we suggest to use and when to use them.

Empirical illustrations

Here we consider different types of real data to evaluate the empirical performance of the kernel functions. We consider a univariate unordered discrete density Lindsey (1995), Greene (2011), univariate ordered discrete density with sparse data (Simonoff, 1996) discrete conditional density (Li and Racine, 2004), and panel data regression Cameron and Trivedi (2005), Henderson and Kumbhakar (2006). We will introduce each of the data sets and then present the results.

Conclusion

In this paper we consider discrete Epanechnikov kernels for use with discrete data. Specifically, we start with Rajagopalan and Lall’s (1995) ordered discrete Epanechnikov kernel function and propose an unordered version. For each of these kernel functions, MSSE for the kernel density estimator is derived and we show that a second order polynomial expansion of these new kernels exists, from which asymptotic normality for the respective kernel density and regression estimators holds.

We compare

Acknowledgments

We would like to thank the editor, Ana Colubi, an anonymous associate editor, two anonymous referees, Subha Chakraborti, Anna Gotlib and Jennifer Stoever. We would also like to thank conference participants at the 6th Asian Meeting of the Econometric Society, the 22nd International Panel Data Conference, the 3rd International Association for Applied Econometrics Conference, the 24th Symposium of the Society for Nonlinear Dynamics and Econometrics, the 25th Annual Meeting of the Midwest

References (43)

AsparoukhovO.K. et al.
A comparison of discriminant procedures for binary variables
Comput. Statist. Data Anal.
(2001)
CoppejansM.
Effective nonparametric estimation in the case of severely discretized data
J. Econometrics
(2003)
GuerraR. et al.
Smoothed bootstrap confidence intervals with discrete data
Comput. Stat. Data Anal.
(1997)
HendersonD.J. et al.
Nonparametric estimation and testing of fixed effect models
J. Econometrics
(2008)
LiD. et al.
Generalized nonparametric smoothing with mixed discrete and continuous data
Comput. Statist. Data Anal.
(2016)
LiQ. et al.
Nonparametric estimation of distributions with categorical and continuous data
J. Multivariate Anal.
(2003)
QianJ. et al.
Estimating semiparametric panel data models by marginal integration
J. Econometrics
(2012)
RacineJ. et al.
Nonparametric estimation of regression functions with both categorical and continuous data
J. Econometrics
(2004)
SuL. et al.
Nonparametric dynamic panel data models: Kernel estimation and specification testing
J. Econometrics
(2013)
AitkenC.G.G.
Kernel methods for the estimation of discrete distributions
J. Stat. Comput. Simul.
(1983)

AitchisonJ. et al.

Multivariate binary discrimination by the kernel method

Biometrika

(1976)

BaltagiB.H. et al.

Public capital stock and state productivity growth: further evidence from an error components model

Empir. Econ.

(1995)

CameronA.C. et al.

Microeconometrics: Methods and Applications

(2005)

ChiuS.-T.

Why bandwidth selectors tend to choose smaller bandwidths, and a remedy

Biometrika

(1990)

ChuC.-Y. et al.

Plug-in bandwidth selection for kernel density estimation with discrete data

Econometrics

(2015)

DongJ. et al.

The construction and properties of boundary kernels for smoothing sparse multinomials

J. Comput. Graph. Statist.

(1994)

EpanechnikovV.A.

Nonparametric estimations of a multivariate probability density

Theory Probab. Appl.

(1969)

GreeneW.

Econometric Analysis

(2011)

HabbemaJ.D.F. et al.

Variable Kernel Density Estimation in Discriminant Analysis

(1978)

HallP. et al.

Cross-validation and the estimation of conditional probability densities

J. Amer. Statist. Assoc.

(2004)

HallP. et al.

Reducing variability of crossvalidation for smoothing-parameter choice

Biometrika

(2009)

Cited by (19)

On a discrete symmetric optimal associated kernel for estimating count data distributions
2024, Statistics and Probability Letters
Probing the Mechanisms Underlying the Transport of the Vinca Alkaloids by P-glycoprotein
2024, Journal of Pharmaceutical Sciences
The efficacy of many cancer drugs is hindered by P-glycoprotein (Pgp), a cellular pump that removes drugs from cells. To improve chemotherapy, drugs capable of evading Pgp must be developed. Despite similarities in structure, vinca alkaloids (VAs) show disparate Pgp-mediated efflux ratios. ATPase activity and binding affinity studies show at least two binding sites for the VAs: high- and low-affinity sites that stimulate and inhibit the ATPase activity rate, respectively. The affinity for ATP from the ATPase kinetics curve for vinblastine (VBL) at the high-affinity site was 2- and 9-fold higher than vinorelbine (VRL) and vincristine (VCR), respectively. Conversely, VBL had the highest K_m (ATP) for the low-affinity site. The dissociation constants (K_Ds) determined by protein fluorescence quenching were in the order VBL < VRL< VCR. The order of the K_Ds was reversed at higher substrate concentrations. Acrylamide quenching of protein fluorescence indicate that the VAs, either at 10 µM or 150 µM, predominantly maintain Pgp in an open-outward conformation. When 3.2 mM AMPPNP was present, 10 µM of either VBL, VRL, or VCR cause Pgp to shift to an open-outward conformation, while 150 µM of the VAs shifted the conformation of Pgp to an intermediate orientation, between opened inward and open-outward. However, the conformational shift induced by saturating AMPPNP and VCR condition was less than either VBL or VRL in the presence of AMPPNP. At 150 µM, atomic force microscopy (AFM) revealed that the VAs shift Pgp population to a predominantly open-inward conformation. Additionally, STDD NMR studies revealed comparable groups in VBL, VRL, and VCR are in contact with the protein during binding. Our results, when coupled with VAs-microtubule structure-activity relationship studies, could lay the foundation for developing next-generation VAs that are effective as anti-tumor agents. A model that illustrates the intricate process of Pgp-mediated transport of the VAs is presented.
Globular cluster detection in the GAIA survey
2019, Neurocomputing
Many techniques for the detection of interesting stellar structures in astronomical data sets require full phase-space or color information, which is not always available. The first data release of the GAIA satellite, for example, provided highly accurate positions and magnitudes for more than one billion sources. Therefore the question arises if such structures can also be automatically found without waiting for more detailed information in future data releases. In this contribution we propose and compare two conceptually different strategies to find globular clusters in the GAIA DR1 survey. The first approach is a nearest neighbor retrieval and the second an anomaly detection. Both techniques are able to find most of the known globular clusters within our observation frames consistently, as well as potential candidates for further investigation. Furthermore we address approximation approaches to scale the strategy to larger data.
Cluster segmentation algorithm for enhancing edge information
2024, Journal of Electronic Imaging
Drug-Induced Conformational Dynamics of P-Glycoprotein Underlies the Transport of Camptothecin Analogs
2023, International Journal of Molecular Sciences
The exploration in the size of scientific collaboration team using kernel density estimation
2023, Aslib Journal of Information Management

View all citing articles on Scopus

View full text

On discrete Epanechnikov kernel functions

Abstract

Introduction

Section snippets

Discrete Epanechnikov kernel functions

Simulations

Empirical illustrations

Conclusion

Acknowledgments

Comput. Statist. Data Anal.

J. Econometrics

Comput. Stat. Data Anal.

J. Econometrics

Comput. Statist. Data Anal.

J. Multivariate Anal.

J. Econometrics

J. Econometrics

J. Econometrics

Kernel methods for the estimation of discrete distributions

J. Stat. Comput. Simul.

Multivariate binary discrimination by the kernel method

Biometrika

Public capital stock and state productivity growth: further evidence from an error components model

Empir. Econ.

Microeconometrics: Methods and Applications

Why bandwidth selectors tend to choose smaller bandwidths, and a remedy

Biometrika

Plug-in bandwidth selection for kernel density estimation with discrete data

Econometrics

The construction and properties of boundary kernels for smoothing sparse multinomials

J. Comput. Graph. Statist.

Nonparametric estimations of a multivariate probability density

Theory Probab. Appl.

Econometric Analysis

Variable Kernel Density Estimation in Discriminant Analysis

Cross-validation and the estimation of conditional probability densities

J. Amer. Statist. Assoc.

Reducing variability of crossvalidation for smoothing-parameter choice

Biometrika