Variational Bayesian functional PCA

https://doi.org/10.1016/j.csda.2008.09.015Get rights and content

Abstract

A Bayesian approach to analyze the modes of variation in a set of curves is suggested. It is based on a generative model thus allowing for noisy and sparse observations of curves. A Demmler–Reinsch(-type) basis is used to enforce smoothness of the latent (‘eigen’)functions. Inference, including estimation, error assessment and model choice, particularly the choice of the number of eigenfunctions and their degree of smoothness, is derived from a variational approximation of the posterior distribution. The proposed analysis is illustrated with simulated and real data.

Introduction

Functional Principal Component Analysis (FPCA) is Principal Component Analysis (PCA) applied to functions instead of vectors, but functions differ from vectors by continuity or smoothness. Hence some of the objectives of FPCA when applied to a bundle of curves are similar to those of classical PCA: to describe the set of curves by a characteristic average curve and patterns of variability, to find a low-dimensional representation of individual curves, to relate their shape to covariates or to compare different sets of curves (see for example Nerini and Ghattas (2007)). If curves are only partially observed, their low-dimensional representation amounts to a reconstruction of the whole curve, a problem that is specific for functions.

In traditional multivariate analysis principal components (PCs) are defined as linear transformations of the observed random vector, obtained by forming inner products with eigenvectors of the covariance matrix. This approach is generalized in most of frequentist FPCA, substituting the Euclidean space by an inner product function space, often L2, covariance matrices by covariance operators and eigenvectors by eigenfunctions. According to the choice of the function space and particularly the inner product different versions of FPCA result (Besse, 1988, Silverman, 1996, Ocaña et al., 1999, Ocaña et al., 2007, Manté et al., 2007).

To date there is a vast literature on functional data analysis (FDA) and in particular FPCA. The monographs by Ramsay and Silverman, 2005, Ramsay and Silverman, 2002 established FDA as a topic on its own in statistics and demonstrated its potential in many applications. The recent monograph by Ferraty and Vieu (2006) provides a complementary view on FDA emphasizing a nonparametric approach. The paper by Müller (2005) on FPCA and functional regression offers a review as well as an introduction to the ongoing discussion. Also, special issues of several journals indicate that there is growing interest in the field of FDA (Davidian et al., 2004, Valderrama, 2007, Manteiga and Vieu, 2007). Applied studies using FPCA have proven the versatility and usefulness of the methods. In their book Ramsay and Silverman (2002) present a collection of case studies from the fields of growth analysis, criminology, meteorology and others. Biomedical data have been analyzed in the papers by Grambsch et al. (1995), Zhao et al. (2004), Behseta et al. (2005), Müller and Wang (2005) or Hyndman and Ullah (2007). Economical applications of FPCA were considered for example by Aguilera et al. (1999a) or Kneip and Utikal (2001). A geophysical application was presented by Manteiga and Vieu (2007).

As technique for dimension reduction FPCA has been applied as part of a more comprehensive analysis. Particularly Principal Component Regression was extended to functional regression (with scalar, for instance binary response and functional predictors) to cope with multicollinearity among regressors (Escabias et al., 2004, Escabias et al., 2005, Aguilera et al., 2006). Functional regression with PCA applied to functional predictors was also studied by Cardot et al. (2003) and compared (based on simulations and asymptotics) to more direct (penalized likelihood) estimation of the parameter functions. The technique was again used by Müller and Stadtmüller (2005) and Leng and Müller (2006). The approach was however criticized on the same grounds as finite-dimensional PCA in regression (Cardot et al., 2003, Saporta et al., 2007, Escabias et al., 2007): the dimension reduction is obtained independently of the response variable, such that the resulting functional predictors are not necessarily optimal for explaining the response. In an approach related to functional principal component regression techniques for forecasting were elaborated in a series of papers (Aguilera et al., 1997, Aguilera et al., 1999b, Aguilera et al., 1999c, Valderrama et al., 2002, Aguilera et al., 2008). FPCA can further be used for purposes that multivariate PCA is not suitable for. For example, FPCA has been successfully employed to model and estimate features of stochastic processes (Bouzas et al., 2006, Bouzas et al., 2007, Fernández-Alcalá et al., 2007). Chiou and Müller (2007) use FPCA scores for a residual analysis in functional regression.

FPCA like PCA is often used as an exploratory technique and functional estimates are displayed without error bounds. From a frequentist point of view statistical inference and error assessments are based on asymptotic theory or bootstrapping methods. Asymptotic theory in FPCA mainly addresses consistency of estimates for the covariance (operator), eigenfunctions and eigenvalues and the asymptotic distribution of eigenvalues and PCs. Asymptotic results are obtained under varying assumptions, in particular with respect to the choice of the function space and the sampling scheme (for example, Dauxois et al. (1982) for fully observed curves in L2 or Müller and Wang (2005) and Hall et al. (2006) for discretized curves and curves possibly observed at irregularly spaced sparse points). Related to the specification of the function space is the choice of the smoothing method applied to fit observed functions or eigenfunctions like the use of roughness penalties (Pezzulli and Silverman, 1993, Silverman, 1995, Yao and Lee, 2006), kernel methods (Boente and Fraiman, 2000) or local weighted least squares (Hall et al., 2006). Less attention has been paid to the estimation of the mean function (average curve) as nuisance parameter in frequentist FPCA and to the reconstruction of the observed functions based on estimated eigenfunctions. Only recently, when FPCA was extended to curves observed at irregular designs, this has come into the focus of asymptotic theory as well (Müller and Wang, 2005). Alternatively bootstrap methods have been applied to obtain confidence intervals for reconstructed curves (James et al., 2000).

The Bayesian approach provides a principled alternative to asymptotic inference. Considering observations and parameters as dual, additional information is not drawn from infinitely many hypothetical observations but from a prior distribution on the parameter space. Inference can then be based on the posterior distribution of the parameters conditional on the data. The Bayesian approach to FPCA elaborated in this paper yields an easy way to handle approximate posterior distribution which allows for simple and fast error assessments for all quantities of interest. This holds for discretized curves on a common regular grid as well as for curves observed at different designs with irregularly spaced sparse points.

Bayesian contributions to FDA and especially FPCA are rare (Behseta et al., 2005). There are mainly two reasons why Bayesians tend to be reluctant about FDA and FPCA: one concern is about the function spaces occurring as sample space or as parameter space in genuine FDA. There is hardly any choice for densities on function spaces to be used as likelihood or prior. A probability distribution on function space usually is conceptualized as stochastic process implying finite-dimensional margins for a discretization of functions. In practice working with a discretization may be sufficient, theoretically it is less ambitious. Behseta et al. (2005) present an approach referring to a stochastic process. In this paper a finite-dimensional discretization with a stochastic process in the background will be employed.

A second reason lies in the tradition of perceiving (and teaching) PCA for random vectors according to the ‘representative’ approach, where the PCs are (linear) transformations of the observed random vector. The stochastically independent PCs thus ‘represent’ the original variables. This is the prevailing approach which is generalized in most of frequentist FPCA. To Bayesians this view implies a rather inconvenient parameterization of the problem. Much more appealing is the (geometrically) ‘generative’ approach based on the optimality property of PCA that the (standardized) eigenvectors of the covariance matrix form (orthonormal) basis vectors of low-dimensional best approximating subspaces for the observed data vectors. Related is the (stochastically) generative approach based on the optimality property of PCs as random variables to best predict the observed random variables componentwise.

From a frequentist perspective the generative approach was elaborated by a French group of statisticians in Toulouse (Besse, 1994, Besse et al., 1997, Cardot, 2000). Although it was observed very early (Reinsel and Velu, 1998, ch. 7) that PCs can be obtained from a generative model performing a Factor Analysis with isotropic noise and Bayesian Factor Analysis had been around for a long time (see Rowe (2003) and the references therein), it was the revival of this approach by Tipping and Bishop (1999) under the name of ‘probabilistic PCA’ that enhanced Bayesian PCA in recent years (Bishop, 1999a, Bishop, 2006, Minka, 2001, Zhang et al., 2004). Šmídl and Quinn (2007) provide a review while presenting their own version.

Given the generative model key inferential issues in Bayesian PCA are:

  • -

    A factor analytic model implies an indeterminacy of factors up to a nonsingular transformation or at least up to sign and rotation. In Factor Analysis therefore often a rotation of estimated factors is carried out to improve interpretability of the factors. In classical PCA orthonormal eigenvectors span the best approximating subspaces, hence it is natural to impose orthogonality constraints in the generative model (Minka, 2001, Šmídl and Quinn, 2007).

  • -

    Factor Analysis and thus probabilistic PCA is based on a generative latent variable model. Factors are not initially assumed to be transformations of observed variables. Generative models can be easily generalized to allow for non-Gaussian distributional assumptions for the latent variables as exemplified by the machine learning community in the development of Independent Component Analysis, respectively Independent Factor Analysis. See Roberts and Everson (2001) or Hyvärinen et al. (2001).

  • -

    Bayesian inference in Independent Component Analysis or Independent Factor Analysis is often based on variational methods or ‘ensemble learning’ to derive an approximate posterior distribution in closed form (Attias, 1998, Choudrey and Roberts, 2001). Variational inference was also elaborated for probabilistic PCA without orthogonality constraints by Bishop (1999b) and with orthogonality constraints by Šmídl and Quinn (2007).

  • -

    The choice of the number of PCs or factors is an important subproblem in PCA or Factor Analysis. The variational approach includes options to decide on the number of factors (Bishop, 1999b; Bishop, 2006, ch. 10.1.4). A more sophisticated approach with a prior on the number of factors requires inference based on sampling methods (Lopes and West, 2004, Zhang et al., 2004).

Putting all these results together it seems promising to try a Bayesian FPCA for discretized functions, possibly observed with noise, as a probabilistic PCA incorporating a smoothness constraint in order to find smooth eigenfunctions to span a low-dimensional subspace for the original functions. In frequentist FDA usually smoothness is enforced introducing roughness penalties (Ramsay and Silverman, 2005, Besse, 1994). A roughness penalty is equivalent to a smoothness prior and is an obvious option in a Bayesian analysis. Results obtained from an implementation of this idea will be discussed. An alternative to a smoothness prior is the choice of a basis of smooth functions like splines in which the eigenfunctions are represented. With a common basis also observations of functions at different possibly irregularly spaced points can be handled. This advantage motivated the generative model of James et al. (2000), which is closest to the approach pursued here, though their inference is frequentist.

Most often, because of their flexibility, free knot B-splines are chosen as common basis functions. The knots then act as smoothing parameters the estimation of which in a Bayesian approach requires sampling techniques. Hence the speed and ease of implementation in closed form of the variational algorithm get lost. In order to reduce the number of smoothing parameters and to retain the applicability of a variational algorithm including its device for a quick assessment of models with various numbers of eigenfunctions a different spline basis is suggested. All functions are represented as interpolation splines on a fine grid of points (i.e. many, but known knots), and the Demmler–Reinsch basis is chosen to build up interpolation splines. This is most parsimonious as the Demmler–Reinsch basis itself results from a PCA for interpolation splines (van der Linde, 2003). Such a basis function is the more important in reconstructing an interpolation spline the rougher it is. Therefore, this basis is very effective in filtering noise, and the degree of smoothness of the eigenfunctions can be controlled by just one smoothness parameter, the number of Demmler–Reinsch basis functions. Enforcing smoothness technically conflicts with orthogonality constraints imposed by a prior. Therefore, the selection of a rotation is not incorporated into the derivation of the (approximate) posterior distribution, but based on that distribution.

In summary, the proposed Bayesian FPCA essentially combines a generative model like that considered by James et al. (2000) with a variational algorithm like the one outlined by Bishop (1999b). To achieve this, a special basis of smooth functions is selected. The corresponding variational algorithm is partially newly derived, and finally a rotation of the resulting eigenfunctions is suggested. The approach is applied to simulated data to illustrate its performance and to real data for comparison to previous analyses in the literature emphasizing the estimation of the (smooth) mean function and the error assessment for reconstructed curves.

In what follows the paper is organized as follows: In Section 2 a motivating example with simulated curves is described illustrating the goals and problems in FPCA. Next the model for FPCA is specified and the variational algorithm for inference is outlined in Section 3. In Section 4 the motivating example is resumed. Also, another example with a real data set is investigated. A concluding discussion is given in Section 5.

Section snippets

Description of simulated data and preliminary analyses

Consider the mixture of two Gaussian densities, g(t)=0.5pN(0,1)+0.5pN(6,4). To obtain a single curve a sample of 30 values tj was generated from this mixture distribution, and a kernel estimate ĝ(t)=130j=130k(tjth) was formed with a standard Gaussian kernel k and bandwidth h=1.06σ̂t300.2, where σ̂t denotes the empirical standard deviation in the sample. Although this bandwidth is asymptotically optimal, the kernel estimate tends to undersmooth the first peak of g at zero.

100 curves were

Model(s) and variational algorithm

Consider real valued functions fm,m=1M, on R, and assume fm to be a thin plate spline of order 2 (see Wahba (1990, ch. 2.4)). To discretize, a grid d of N points is chosen, yielding vectors of function values fmdRN. Each fm is approximated by the interpolation spline hI(fmd), the smoothest function (in the sense of minimizing (h(t))2dt) among all functions h attaining the values fmd on d. Thus the discretization is equivalent to restricting ourselves to a function space HI(d), say, of

Complete curves with noise

M=100 curves are now given as kernel estimates ĝ (Eq. (3)) plus Gaussian noise as described in Section 2. The dimensionality rQ of Qd to represent the mean function and rR of Rd to represent the K eigenfunctions was determined maximizing the lower bound Lq for each K{1,,6} with respect to rQ and rR both within a range of 5–15. The maximization of the lower bound was based on 15 iterations (with one iteration comprising an update of all parameters), which correspond to a threshold of 0.0005

Smooth mean functions

A difference between the (frequentist) representative and the (Bayesian) generative approach lies in the treatment of the mean function. If the focus is on estimating the covariance, the mean function is regarded as nuisance parameter and estimated separately. In the generative approach the mean function and the eigenfunctions as smooth functions are estimated simultaneously (respectively alternatingly in the variational algorithm), and a crucial issue for the performance of the model is

Acknowledgements

The author would like to thank two anonymous referees for valuable hints and comments which helped to considerably improve an earlier draft of this paper.

References (72)

  • W.G. Manteiga et al.

    Statistics for functional data

    Comput. Statist. Data Anal.

    (2007)
  • D. Nerini et al.

    Classifying densities using functional regression trees. Applications in oceanology

    Comput. Statist. Data Anal.

    (2007)
  • F.A. Ocaña et al.

    Functional principal component analysis by choice of norm

    J. Multivariate Anal.

    (1999)
  • V. Šmídl et al.

    On Bayesian principal component analysis

    Comput. Stat. Data Anal.

    (2007)
  • M.E. Tipping et al.

    Variational inference for Student-t models: Robust Bayesian interpolation and generalized component analysis

    Neurocomputing

    (2005)
  • A.M. Aguilera et al.

    Stochastic modelling for evolution of stock prices by means of functional principal component analysis

    Appl. Stoch. Models Bus. Ind.

    (1999)
  • A.M. Aguilera et al.

    Forecasting time series by functional PCA. Discussion of several weighted approaches

    Comput. Statist.

    (1999)
  • A.M. Aguilera et al.

    Forecasting with unequally spaced data by a functional principal component approach

    Test

    (1999)
  • A.M. Aguilera et al.

    An approximated principal component prediction model for continuous-time stochastic processes

    Appl. Stoch. Models Data Anal.

    (1997)
  • H. Attias

    Independent factor analysis

    Neural Comput.

    (1998)
  • S. Behseta et al.

    Hierarchical models for assessing variability among functions

    Biometrika

    (2005)
  • Besse, P.C., 1994. Models for multivariate data analysis. In: Dutter R., Grossmann W. (Eds.). COMPSTAT, Proceedings in...
  • P.C. Besse

    Spline functions and optimal metric in linear principal component analysis

  • C.M. Bishop

    Bayesian PCA

  • C.M. Bishop

    Variational principal components

  • C.M. Bishop

    Pattern Recognition and Machine Learning

    (2006)
  • P.R. Bouzas et al.

    Functional approach to the random mean of a compound Cox process

    Comput. Statist.

    (2007)
  • H. Cardot

    Nonparametric estimation of smoothed principal components analysis of sampled noisy functions

    Nonparametric Statist.

    (2000)
  • H. Cardot et al.

    Spline estimators for the functional linear model

    Statist. Sinica

    (2003)
  • Cemgil, A.T., Févotte, C., Godsill, S.J., 2007. Variational and stochastic inference for Bayesian source separation....
  • Choudrey, R.A., Roberts, S.J., 2001. Flexible Bayesian independent component analysis for blind source separation In:...
  • M. Davidian et al.

    Emerging issues in longitudinal and functional data analysis. Introduction

    Statist. Sinica

    (2004)
  • M. Escabias et al.

    Modeling environmental data by functional principal component logistic regression

    Environmetrics

    (2005)
  • M. Escabias et al.

    Principal component estimation of functional logistic regression: Discussion of two different approaches

    Nonparametric Statist.

    (2004)
  • R.M. Fernández-Alcalá et al.

    Functional estimation incorporating prior correlation information

    Comput. Statist.

    (2007)
  • F. Ferraty et al.

    Nonparametric Functional Data Analysis

    (2006)
  • Cited by (30)

    • Penalized PCA approaches for B-spline expansions of smooth functional data

      2013, Applied Mathematics and Computation
      Citation Excerpt :

      A estimation procedure based on basis expansions of sample curves was introduced by [15]. A Bayesian approach to FPCA based on a generative model for noisy and sparse observations of curves was developed in [16]. One usual form of estimating FPCA from discrete observations of the sample curves is based on basis expansion approximation.

    • Principal components for multivariate functional data

      2011, Computational Statistics and Data Analysis
    • Dimensionality reduction when data are density functions

      2011, Computational Statistics and Data Analysis
      Citation Excerpt :

      Then they apply FPCA to reduce the dimensionality of the resulting functional dataset. Recently, Nerini and Ghattas (2007) consider regression trees where responses are density functions and use FPCA to interpret the main modes of variation in each terminal node, and van der Linde (2008) proposes a Bayesian FPCA and illustrates this proposal by a simulated dataset consisting of non-parametric density estimates. Density functions (non-negative functions integrating up to 1) share some features with compositional data (Aitchison, 1986): finite dimensional non-negative data with constant sum that provide a quantitative description of the parts of some whole, conveying exclusively relative information.

    • Forecasting of density functions with an application to cross-sectional and intraday returns

      2019, International Journal of Forecasting
      Citation Excerpt :

      Kneip and Utikal (2001) use kernel density estimation to obtain annual income densities, and study the temporal evolution of income density functions in the United Kingdom from 1968 to 1988. Nerini and Ghattas (2007) consider regression trees where the responses are density functions, and use functional principal component analysis to interpret the main mode of variation in each terminal node. van der Linde (2008) proposes a Bayesian functional principal component analysis and applies it to a simulated dataset consisting of nonparametric density estimates.

    View all citing articles on Scopus
    View full text