Variational Bayesian functional PCA
Introduction
Functional Principal Component Analysis (FPCA) is Principal Component Analysis (PCA) applied to functions instead of vectors, but functions differ from vectors by continuity or smoothness. Hence some of the objectives of FPCA when applied to a bundle of curves are similar to those of classical PCA: to describe the set of curves by a characteristic average curve and patterns of variability, to find a low-dimensional representation of individual curves, to relate their shape to covariates or to compare different sets of curves (see for example Nerini and Ghattas (2007)). If curves are only partially observed, their low-dimensional representation amounts to a reconstruction of the whole curve, a problem that is specific for functions.
In traditional multivariate analysis principal components (PCs) are defined as linear transformations of the observed random vector, obtained by forming inner products with eigenvectors of the covariance matrix. This approach is generalized in most of frequentist FPCA, substituting the Euclidean space by an inner product function space, often , covariance matrices by covariance operators and eigenvectors by eigenfunctions. According to the choice of the function space and particularly the inner product different versions of FPCA result (Besse, 1988, Silverman, 1996, Ocaña et al., 1999, Ocaña et al., 2007, Manté et al., 2007).
To date there is a vast literature on functional data analysis (FDA) and in particular FPCA. The monographs by Ramsay and Silverman, 2005, Ramsay and Silverman, 2002 established FDA as a topic on its own in statistics and demonstrated its potential in many applications. The recent monograph by Ferraty and Vieu (2006) provides a complementary view on FDA emphasizing a nonparametric approach. The paper by Müller (2005) on FPCA and functional regression offers a review as well as an introduction to the ongoing discussion. Also, special issues of several journals indicate that there is growing interest in the field of FDA (Davidian et al., 2004, Valderrama, 2007, Manteiga and Vieu, 2007). Applied studies using FPCA have proven the versatility and usefulness of the methods. In their book Ramsay and Silverman (2002) present a collection of case studies from the fields of growth analysis, criminology, meteorology and others. Biomedical data have been analyzed in the papers by Grambsch et al. (1995), Zhao et al. (2004), Behseta et al. (2005), Müller and Wang (2005) or Hyndman and Ullah (2007). Economical applications of FPCA were considered for example by Aguilera et al. (1999a) or Kneip and Utikal (2001). A geophysical application was presented by Manteiga and Vieu (2007).
As technique for dimension reduction FPCA has been applied as part of a more comprehensive analysis. Particularly Principal Component Regression was extended to functional regression (with scalar, for instance binary response and functional predictors) to cope with multicollinearity among regressors (Escabias et al., 2004, Escabias et al., 2005, Aguilera et al., 2006). Functional regression with PCA applied to functional predictors was also studied by Cardot et al. (2003) and compared (based on simulations and asymptotics) to more direct (penalized likelihood) estimation of the parameter functions. The technique was again used by Müller and Stadtmüller (2005) and Leng and Müller (2006). The approach was however criticized on the same grounds as finite-dimensional PCA in regression (Cardot et al., 2003, Saporta et al., 2007, Escabias et al., 2007): the dimension reduction is obtained independently of the response variable, such that the resulting functional predictors are not necessarily optimal for explaining the response. In an approach related to functional principal component regression techniques for forecasting were elaborated in a series of papers (Aguilera et al., 1997, Aguilera et al., 1999b, Aguilera et al., 1999c, Valderrama et al., 2002, Aguilera et al., 2008). FPCA can further be used for purposes that multivariate PCA is not suitable for. For example, FPCA has been successfully employed to model and estimate features of stochastic processes (Bouzas et al., 2006, Bouzas et al., 2007, Fernández-Alcalá et al., 2007). Chiou and Müller (2007) use FPCA scores for a residual analysis in functional regression.
FPCA like PCA is often used as an exploratory technique and functional estimates are displayed without error bounds. From a frequentist point of view statistical inference and error assessments are based on asymptotic theory or bootstrapping methods. Asymptotic theory in FPCA mainly addresses consistency of estimates for the covariance (operator), eigenfunctions and eigenvalues and the asymptotic distribution of eigenvalues and PCs. Asymptotic results are obtained under varying assumptions, in particular with respect to the choice of the function space and the sampling scheme (for example, Dauxois et al. (1982) for fully observed curves in or Müller and Wang (2005) and Hall et al. (2006) for discretized curves and curves possibly observed at irregularly spaced sparse points). Related to the specification of the function space is the choice of the smoothing method applied to fit observed functions or eigenfunctions like the use of roughness penalties (Pezzulli and Silverman, 1993, Silverman, 1995, Yao and Lee, 2006), kernel methods (Boente and Fraiman, 2000) or local weighted least squares (Hall et al., 2006). Less attention has been paid to the estimation of the mean function (average curve) as nuisance parameter in frequentist FPCA and to the reconstruction of the observed functions based on estimated eigenfunctions. Only recently, when FPCA was extended to curves observed at irregular designs, this has come into the focus of asymptotic theory as well (Müller and Wang, 2005). Alternatively bootstrap methods have been applied to obtain confidence intervals for reconstructed curves (James et al., 2000).
The Bayesian approach provides a principled alternative to asymptotic inference. Considering observations and parameters as dual, additional information is not drawn from infinitely many hypothetical observations but from a prior distribution on the parameter space. Inference can then be based on the posterior distribution of the parameters conditional on the data. The Bayesian approach to FPCA elaborated in this paper yields an easy way to handle approximate posterior distribution which allows for simple and fast error assessments for all quantities of interest. This holds for discretized curves on a common regular grid as well as for curves observed at different designs with irregularly spaced sparse points.
Bayesian contributions to FDA and especially FPCA are rare (Behseta et al., 2005). There are mainly two reasons why Bayesians tend to be reluctant about FDA and FPCA: one concern is about the function spaces occurring as sample space or as parameter space in genuine FDA. There is hardly any choice for densities on function spaces to be used as likelihood or prior. A probability distribution on function space usually is conceptualized as stochastic process implying finite-dimensional margins for a discretization of functions. In practice working with a discretization may be sufficient, theoretically it is less ambitious. Behseta et al. (2005) present an approach referring to a stochastic process. In this paper a finite-dimensional discretization with a stochastic process in the background will be employed.
A second reason lies in the tradition of perceiving (and teaching) PCA for random vectors according to the ‘representative’ approach, where the PCs are (linear) transformations of the observed random vector. The stochastically independent PCs thus ‘represent’ the original variables. This is the prevailing approach which is generalized in most of frequentist FPCA. To Bayesians this view implies a rather inconvenient parameterization of the problem. Much more appealing is the (geometrically) ‘generative’ approach based on the optimality property of PCA that the (standardized) eigenvectors of the covariance matrix form (orthonormal) basis vectors of low-dimensional best approximating subspaces for the observed data vectors. Related is the (stochastically) generative approach based on the optimality property of PCs as random variables to best predict the observed random variables componentwise.
From a frequentist perspective the generative approach was elaborated by a French group of statisticians in Toulouse (Besse, 1994, Besse et al., 1997, Cardot, 2000). Although it was observed very early (Reinsel and Velu, 1998, ch. 7) that PCs can be obtained from a generative model performing a Factor Analysis with isotropic noise and Bayesian Factor Analysis had been around for a long time (see Rowe (2003) and the references therein), it was the revival of this approach by Tipping and Bishop (1999) under the name of ‘probabilistic PCA’ that enhanced Bayesian PCA in recent years (Bishop, 1999a, Bishop, 2006, Minka, 2001, Zhang et al., 2004). Šmídl and Quinn (2007) provide a review while presenting their own version.
Given the generative model key inferential issues in Bayesian PCA are:
- -
A factor analytic model implies an indeterminacy of factors up to a nonsingular transformation or at least up to sign and rotation. In Factor Analysis therefore often a rotation of estimated factors is carried out to improve interpretability of the factors. In classical PCA orthonormal eigenvectors span the best approximating subspaces, hence it is natural to impose orthogonality constraints in the generative model (Minka, 2001, Šmídl and Quinn, 2007).
- -
Factor Analysis and thus probabilistic PCA is based on a generative latent variable model. Factors are not initially assumed to be transformations of observed variables. Generative models can be easily generalized to allow for non-Gaussian distributional assumptions for the latent variables as exemplified by the machine learning community in the development of Independent Component Analysis, respectively Independent Factor Analysis. See Roberts and Everson (2001) or Hyvärinen et al. (2001).
- -
Bayesian inference in Independent Component Analysis or Independent Factor Analysis is often based on variational methods or ‘ensemble learning’ to derive an approximate posterior distribution in closed form (Attias, 1998, Choudrey and Roberts, 2001). Variational inference was also elaborated for probabilistic PCA without orthogonality constraints by Bishop (1999b) and with orthogonality constraints by Šmídl and Quinn (2007).
- -
The choice of the number of PCs or factors is an important subproblem in PCA or Factor Analysis. The variational approach includes options to decide on the number of factors (Bishop, 1999b; Bishop, 2006, ch. 10.1.4). A more sophisticated approach with a prior on the number of factors requires inference based on sampling methods (Lopes and West, 2004, Zhang et al., 2004).
Most often, because of their flexibility, free knot -splines are chosen as common basis functions. The knots then act as smoothing parameters the estimation of which in a Bayesian approach requires sampling techniques. Hence the speed and ease of implementation in closed form of the variational algorithm get lost. In order to reduce the number of smoothing parameters and to retain the applicability of a variational algorithm including its device for a quick assessment of models with various numbers of eigenfunctions a different spline basis is suggested. All functions are represented as interpolation splines on a fine grid of points (i.e. many, but known knots), and the Demmler–Reinsch basis is chosen to build up interpolation splines. This is most parsimonious as the Demmler–Reinsch basis itself results from a PCA for interpolation splines (van der Linde, 2003). Such a basis function is the more important in reconstructing an interpolation spline the rougher it is. Therefore, this basis is very effective in filtering noise, and the degree of smoothness of the eigenfunctions can be controlled by just one smoothness parameter, the number of Demmler–Reinsch basis functions. Enforcing smoothness technically conflicts with orthogonality constraints imposed by a prior. Therefore, the selection of a rotation is not incorporated into the derivation of the (approximate) posterior distribution, but based on that distribution.
In summary, the proposed Bayesian FPCA essentially combines a generative model like that considered by James et al. (2000) with a variational algorithm like the one outlined by Bishop (1999b). To achieve this, a special basis of smooth functions is selected. The corresponding variational algorithm is partially newly derived, and finally a rotation of the resulting eigenfunctions is suggested. The approach is applied to simulated data to illustrate its performance and to real data for comparison to previous analyses in the literature emphasizing the estimation of the (smooth) mean function and the error assessment for reconstructed curves.
In what follows the paper is organized as follows: In Section 2 a motivating example with simulated curves is described illustrating the goals and problems in FPCA. Next the model for FPCA is specified and the variational algorithm for inference is outlined in Section 3. In Section 4 the motivating example is resumed. Also, another example with a real data set is investigated. A concluding discussion is given in Section 5.
Section snippets
Description of simulated data and preliminary analyses
Consider the mixture of two Gaussian densities, To obtain a single curve a sample of 30 values was generated from this mixture distribution, and a kernel estimate was formed with a standard Gaussian kernel and bandwidth , where denotes the empirical standard deviation in the sample. Although this bandwidth is asymptotically optimal, the kernel estimate tends to undersmooth the first peak of at zero.
100 curves were
Model(s) and variational algorithm
Consider real valued functions , on , and assume to be a thin plate spline of order 2 (see Wahba (1990, ch. 2.4)). To discretize, a grid of points is chosen, yielding vectors of function values . Each is approximated by the interpolation spline , the smoothest function (in the sense of minimizing ) among all functions attaining the values on . Thus the discretization is equivalent to restricting ourselves to a function space , say, of
Complete curves with noise
curves are now given as kernel estimates (Eq. (3)) plus Gaussian noise as described in Section 2. The dimensionality of to represent the mean function and of to represent the eigenfunctions was determined maximizing the lower bound for each with respect to and both within a range of 5–15. The maximization of the lower bound was based on 15 iterations (with one iteration comprising an update of all parameters), which correspond to a threshold of 0.0005
Smooth mean functions
A difference between the (frequentist) representative and the (Bayesian) generative approach lies in the treatment of the mean function. If the focus is on estimating the covariance, the mean function is regarded as nuisance parameter and estimated separately. In the generative approach the mean function and the eigenfunctions as smooth functions are estimated simultaneously (respectively alternatingly in the variational algorithm), and a crucial issue for the performance of the model is
Acknowledgements
The author would like to thank two anonymous referees for valuable hints and comments which helped to considerably improve an earlier draft of this paper.
References (72)
- et al.
Forecasting binary longitudinal data by a functional PC-ARIMA model
Comput. Statist. Data Anal.
(2008) - et al.
Using principal components for estimating logistic regression with high-dimensional multicollinear data
Comput. Statist. Data Anal.
(2006) - et al.
Simultaneous nonparametric regressions of unbalanced longitudinal data
Comput. Statist. Data Anal.
(1997) - et al.
Kernel-based functional principal components
Statist. Probab. Lett.
(2000) - et al.
Modelling the mean of a doubly stochastic Poisson process by functional data analysis
Comput. Statist. Data Anal.
(2006) - et al.
Diagnostics for functional regression via residual processes
Comp. Stat. Data Anal.
(2007) - et al.
Asymptotic theory for the principal component analysis of a random vector function: Some application to statistical inference
J. Multiv. Anal.
(1982) - et al.
Functional PLS logit regression model
Comput. Statist. Data Anal.
(2007) - et al.
Robust forecasting of mortality and fertility rates: A functional data approach
Comp. Statist. Data Anal.
(2007) - et al.
Principal component analysis of measures, with emphasis on grain size curves
Comput. Statist. Data Anal.
(2007)
Statistics for functional data
Comput. Statist. Data Anal.
Classifying densities using functional regression trees. Applications in oceanology
Comput. Statist. Data Anal.
Functional principal component analysis by choice of norm
J. Multivariate Anal.
On Bayesian principal component analysis
Comput. Stat. Data Anal.
Variational inference for Student- models: Robust Bayesian interpolation and generalized component analysis
Neurocomputing
Stochastic modelling for evolution of stock prices by means of functional principal component analysis
Appl. Stoch. Models Bus. Ind.
Forecasting time series by functional PCA. Discussion of several weighted approaches
Comput. Statist.
Forecasting with unequally spaced data by a functional principal component approach
Test
An approximated principal component prediction model for continuous-time stochastic processes
Appl. Stoch. Models Data Anal.
Independent factor analysis
Neural Comput.
Hierarchical models for assessing variability among functions
Biometrika
Spline functions and optimal metric in linear principal component analysis
Bayesian PCA
Variational principal components
Pattern Recognition and Machine Learning
Functional approach to the random mean of a compound Cox process
Comput. Statist.
Nonparametric estimation of smoothed principal components analysis of sampled noisy functions
Nonparametric Statist.
Spline estimators for the functional linear model
Statist. Sinica
Emerging issues in longitudinal and functional data analysis. Introduction
Statist. Sinica
Modeling environmental data by functional principal component logistic regression
Environmetrics
Principal component estimation of functional logistic regression: Discussion of two different approaches
Nonparametric Statist.
Functional estimation incorporating prior correlation information
Comput. Statist.
Nonparametric Functional Data Analysis
Cited by (30)
Penalized PCA approaches for B-spline expansions of smooth functional data
2013, Applied Mathematics and ComputationCitation Excerpt :A estimation procedure based on basis expansions of sample curves was introduced by [15]. A Bayesian approach to FPCA based on a generative model for noisy and sparse observations of curves was developed in [16]. One usual form of estimating FPCA from discrete observations of the sample curves is based on basis expansion approximation.
Principal components for multivariate functional data
2011, Computational Statistics and Data AnalysisDimensionality reduction when data are density functions
2011, Computational Statistics and Data AnalysisCitation Excerpt :Then they apply FPCA to reduce the dimensionality of the resulting functional dataset. Recently, Nerini and Ghattas (2007) consider regression trees where responses are density functions and use FPCA to interpret the main modes of variation in each terminal node, and van der Linde (2008) proposes a Bayesian FPCA and illustrates this proposal by a simulated dataset consisting of non-parametric density estimates. Density functions (non-negative functions integrating up to 1) share some features with compositional data (Aitchison, 1986): finite dimensional non-negative data with constant sum that provide a quantitative description of the parts of some whole, conveying exclusively relative information.
Principal component analysis based on non-parametric maximum entropy
2010, NeurocomputingForecasting of density functions with an application to cross-sectional and intraday returns
2019, International Journal of ForecastingCitation Excerpt :Kneip and Utikal (2001) use kernel density estimation to obtain annual income densities, and study the temporal evolution of income density functions in the United Kingdom from 1968 to 1988. Nerini and Ghattas (2007) consider regression trees where the responses are density functions, and use functional principal component analysis to interpret the main mode of variation in each terminal node. van der Linde (2008) proposes a Bayesian functional principal component analysis and applies it to a simulated dataset consisting of nonparametric density estimates.