A fast algorithm for computing least-squares cross-validations for nonparametric conditional kernel density functions

https://doi.org/10.1016/j.csda.2009.08.021Get rights and content

Abstract

Nonparametric conditional density functions are widely used in applied econometric and statistical modelling because they provide enriched information summaries of the relationships between dependent and independent variables. Although least-squares cross-validation is considered to be the best criterion for bandwidth selection of the kernel estimator of the conditional density, the number of computations required for this procedure grows exponentially as the number of observations increases. A fast algorithm is proposed to reduce this computational cost, and its accuracy and efficiency are verified via numerical experiments. A practical application is also presented to demonstrate the algorithm’s potential usefulness.

Introduction

Nonparametric conditional density functions have become popular in applied econometric and statistical modelling. They provide summarized information concerning the relationships between independent and dependent variables. Moreover, they are useful for data-driven modelling, such as nonparametric quintile regression, discrete choice modelling, and direct estimation of conditional probability density and distribution functions (see Racine, 2008, for a recent review). The kernel method is a commonly used nonparametric modelling approach, and many studies have examined nonparametric conditional kernel density functions (NP-CKDFs), since the pioneering work of Rosenblatt (1969).

Bandwidth selection is an important consideration for the relevant kernel estimator of the nonparametric conditional density, and it can be accomplished via several techniques. The most popular of these is the plug-in method (e.g. Li and Racine, 2007, Chapter 5), in which the optimal bandwidth is fairly easy to calculate. However, this technique employs a normal distribution to assign a value to the unknown constant in the optimal bandwidth (the bandwidth that minimizes the integrated mean square error). Consequently, the underlying densities are known a priori. Racine (2008) pointed out that the plug-in method tends to oversmooth the bandwidth and yields biased results for larger datasets. Bashtannyk and Hyndman (2001) compared several bandwidth selection strategies for NP-CKDFs that were previously introduced by Hyndman et al. (1996). They found the bootstrap method to be the best, although it requires a considerable amount of computation. Hall (1987) studied log-likelihood cross-validation methods. He pointed out that log-likelihood cross-validation also tends to oversmooth the bandwidth for larger datasets. Holmes et al. (2007) recently developed dual-tree-based algorithms for the bandwidth selection of NP-CKDFs. Although this technique dramatically reduces the computational cost, it is only applicable to log-likelihood cross-validation criteria.

Fan and Yim (2004) and Hall et al. (2004) investigated the least-squares cross-validation (LS-CV) method for selecting the bandwidth of NP-CKDFs. This criterion might be considered the best, in the sense that it minimizes a (weighted) integrated squared error. Fan and Yim (2004) compared the LS-CV method with other conventional techniques and obtained very favourable results concerning its performance. Hall et al. (2004) used the LS-CV method to detect irrelevant/relevant explanatory variables in nonparametric conditional densities. The need for implementation of this technique continues, and the method has already been included in the statistical software R (R Development Core Team, 2007) as a package named “np” (Hayfield and Racine, 2009).

In spite of its advantages, the practical application of LS-CV to NP-CKDFs has been impeded by its computational complexities. As will be seen in the next section, the evaluation of the objective function for the LS-CV criteria requires O(n3) operations, which leads to enormous computational costs when the number of observations (denoted by n) increases. In this paper, we develop a fast algorithm for computing the LS-CV function for NP-CKDFs. We employ a Gaussian kernel function, which is the type most widely used in applications. In addition, we restrict our attention to NP-CKDFs with one independent variable. We remark that although there are other established approaches to computational cost reduction (for example Gray and Moore (2003) and Racine (2002)), ours differs from these by being based on expansion of the kernel function.

The outline of the paper is as follows. In Section 2, we clarify the computational difficulties in the LS-CV criterion and present the proposed algorithm. Section 3 describes numerical experiments designed to verify the accuracy and efficiency of the algorithm compared to the conventional method. In Section 4, the potential practicality of the algorithm is demonstrated via an application to the analysis of travel time variation in highway traffic, using an actual large dataset.

Section snippets

A fast algorithm for computing least-squares cross-validations for nonparametric conditional kernel density functions

Let (Xi,Yi),i=1,,n be the iid sample of an independent and dependent variable pair. In the case of one independent variable, the LS-CV criterion for NP-CKDFs is intended to minimize the following objective function (Li and Racine, 2007, pp. 157–160): CVf(hx,hy)=1ni=1nGˆi(Xi,Yi){μˆi(Xi)}22ni=1ngˆi(Xi,Yi)μˆi(Xi), where {μˆi(Xi)=1n1j=1,jinKhx(Xi,Xj),gˆi(Xi,Yi)=1n1j=1,jinwhy(Yi,Yj)Khx(Xi,Xj),Gˆi(Xi,Yi)=1(n1)2j=1,jinl=1,linKhx(Xi,Xj)Khx(Xi,Xl)ww(Yj,Yl),ww(Yj,Yl)=why(y,Yj)why

Numerical experiment I

In this section, we verify the accuracy and efficiency of the proposed algorithm via numerical experiments using artificial datasets. For a given pair, n and nx, an artificial dataset, denoted by dataset(n,nx), can be created using Algorithm 2, as in Takeuchi et al. (2006). Datasets were generated for the following cases: n=625×2i1 and nx=500×2j1, where i=1,,12 and j=1,2,3. Numerical experiments were carried out serially using one core of the processor of a personal computer (2.2 GHz,

Numerical experiment II

In this section, we verify the practicality of the proposed method by using it to analyze the relationship between time of day and travel time for actual traffic data. In traffic engineering applications, interest has grown in how to represent day-to-day and same-day variations in the travel time of traffic on urban roads (e.g. Hollander and Liu, 2008, for recent examples). In particular, it is important to be able to estimate the distribution of travel times (TT) on an urban road conditional

Concluding remarks

This paper presents a fast algorithm for computing LS-CV for NP-CKDF based on FGT and computational decomposition. Its accuracy and computational efficiency have been verified by numerical experiments. The proposed algorithm is about 3×1011 times faster than is the conventional algorithm, even for small datasets. An application involving travel time underscores its potential practicality as well as the advantages of its appropriate use for large-scale datasets. Tune-up algorithms of this sort

Acknowledgement

We are grateful to Dr. Mogens Fosgerau for providing the travel time dataset.

References (20)

  • D. Bashtannyk et al.

    Bandwidth selection for kernel conditional density estimation

    Computational Statistics and Data Analysis

    (2001)
  • Y. Hollander et al.

    Estimation of the distribution of travel times by repeated simulation

    Transportation Research Part C: Emerging Technologies

    (2008)
  • J. Racine

    Parallel distributed kernel estimation

    Computational Statistics and Data Analysis

    (2002)
  • A. Elgammal et al.

    Efficient kernel density estimation using the fast gauss transform with applications to color modeling and tracking

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2003)
  • J. Fan et al.

    A crossvalidation method for estimating conditional densities

    Biometrika

    (2004)
  • Fosgerau, M., Hjorth, K., Brems, C., Fukuda, D., 2008. Travel time variability: Definition and valuation. Tech. Rep.,...
  • Gray, A., Moore, A., 2003. Very fast multivariate kernel density estimation via computational geometry. In: Joint Stat....
  • L. Greengard et al.

    The fast gauss transform

    SIAM Journal on Scientific Computing

    (1991)
  • P. Hall

    On Kullback–Leibler loss and density estimation

    Annals of Statistics

    (1987)
  • P. Hall et al.

    Cross-validation and the estimation of conditional probability densities

    Journal of the American Statistical Association

    (2004)
There are more references available in the full text version of this article.

Cited by (8)

  • Research on feature extraction algorithm of rolling bearing fatigue evolution stage based on acoustic emission

    2018, Mechanical Systems and Signal Processing
    Citation Excerpt :

    The selection of the kernel parameter is crucial to the extraction result. Many methods have been adopted to optimize better kernel parameter [19–22], such as the methods of grid search, cross validation, genetic algorithm or particle swarm optimization (PSO). Among these methods, PSO is easier in operation, and needs less calculation compare with other methods.

  • An evolutionary modeling approach for designing a contractual REDD+ payment scheme

    2017, Ecological Indicators
    Citation Excerpt :

    The marginal effects of all sample points are estimated first; the mean value for the marginal effects of all sample points is then calculated. Based on Ichimura and Fukuda (2010), optimal bandwidths are automatically selected by using least-square cross-validation (LSCV) estimates. According to the optimal bandwidths, Table 7 shows the average marginal effects of area on gdp in the five REDD+ countries.

  • 3rd Special issue on matrix computations and statistics

    2010, Computational Statistics and Data Analysis
View all citing articles on Scopus
View full text