Functional k-means inverse regression

https://doi.org/10.1016/j.csda.2013.09.004Get rights and content

Abstract

A new dimension reduction method is proposed for functional multivariate regression with a multivariate response and a functional predictor by extending the functional sliced inverse regression model. Naive application of existing dimension reduction techniques for univariate response will create too many hyper-rectangular slices. To avoid this curse of dimensionality, a new slicing method is proposed by clustering over the space of the multivariate response, which generates a much smaller set of slices of flexible shapes. The proposed method can be applied to any number of response variables and can be particularly useful for exploratory analysis. In addition, a new eigenvalue-based method for determining the dimensionality of the reduced space is developed. Real and simulation data examples are then presented to demonstrate the effectiveness of the proposed method.

Introduction

Functional data analysis (FDA) is becoming more popular for analyzing data that are believed to be sampled from some underlying smooth functions (see Ramsay and Silverman, 2002, Ramsay and Silverman, 2005, Ferraty and Vieu, 2006, Horváth and Kokoszka, 2012). In the past few decades, many classic multivariate statistical methods have been extended to functional data such as functional principal component analysis (FPCA, Silverman, 1996; Yao et al., 2005; Paul and Peng, 2009), functional linear regression (Ramsay and Dalzell, 1991, Cardot et al., 2003, Goldsmith et al., 2011), nonparametric functional regression (Ferraty and Vieu, 2002, Ferraty et al., 2011, Ferraty et al., 2012), and semi-parametric functional regression (Dauxois et al., 2001, Ferré and Yao, 2003, Aneiros and Vieu, 2008, Chen et al., 2011, Ferraty et al., 2013). For the infinite dimensional nature of functional data, dimension reduction is often necessary. FPCA is widely used in the non-regression context. For regression problems with functional predictors, a number of dimension reduction methods have been proposed in the case of univariate response (Amato et al., 2006, Dauxois et al., 2001, Ferré and Villa, 2006, Ferré and Yao, 2003, Ferré and Yao, 2005, Hsing and Ren, 2009). These methods were motivated by extending the idea of sliced inverse regression (SIR, Li, 1991) to the functional case. For example, Ferré and Yao (2003) considered the following model y=f(abβ1(t)x(t)dt,,abβK(t)x(t)dt,ε), where y is a real scalar variable, ε is a real random variable with mean zero and constant variance and independent of x,x is a random curve in the L2[(a,b)] space that contains square integrable functions from [a,b] into R with the usual inner product f,g=abf(t)g(t)dt. The link function f:RK+1R is assumed nonparametric. Based on model  (1), a dimension reduction space, called the effective dimension reduction (EDR) space, is then spanned by the K linearly independent functions β1(t),,βK(t)L2([a,b]), which themselves are called the EDR directions.

In this article, we consider dimension reduction in functional regression with multivariate response. Although the afore-mentioned SIR-based methods can be directly applied by slicing each response marginally, the number of multivariate slices will increase exponentially as the dimension of the response increases. To overcome this curse of dimensionality, partly motivated by results in Setodji and Cook (2004), we propose the functional k-means inverse regression (FKIR) to replace simple slicing by clustering over the space of the multivariate response. FKIR provides a more effective way to condition on the response, and our results show that it performs well when the link function is either linear or nonlinear. Furthermore, we propose a simple and effective maximum eigenvalue ratio criterion (MERC) to determine the dimensionality of the EDR space.

To illustrate the effectiveness of our method, we consider the cookie data used in Amato et al. (2006) that contains measurements on the composition of 70 biscuit dough pieces. Each dough is a mixture of fat, sucrose, dry flour, and water. The aim of the analysis is to predict the content of these ingredients (response variables) from the spectrum of the mixture. The original spectra data are composed of 700 channels (each channel is viewed as one observation) in the spectral range of 1100–2498 nm in steps of 2 nm. With a sample size of only 70, dimension reduction on the functional predictor is necessary in this case. A simple marginal slicing with just three slices on each of the four response variables will result in 81 multivariate slices overall, which is infeasible because it exceeds the sample size. An application of FKIR divided the 70 response vectors into five clusters (slices) of 13, 19, 8, 18 and 12 cases, and identified an EDR space spanned by four linear combinations of the predictors. Fig. 1 plots each response variable against vi=x,βi (i=1,2,3,4, and βi is the EDR direction), and can provide important insights on choosing a proper parametric functional regression model.

This paper is organized as follows. In Section  2, we briefly review functional sliced inverse regression (FSIR; Ferré and Yao, 2003) for better understanding of the FKIR method. Then we describe the FKIR method in Section  3. Section  4 then presents simulation studies and real data examples to illustrate the merits of FKIR.

Section snippets

Functional sliced inverse regression

The FSIR model is given in (1) as an extension of SIR (Li, 1991) when the predictor x is functional. The key idea of FSIR is to determine EDR directions from the covariance operator Γe=E(E(x|y)E(x|y)), where denotes the tensor product in L2([a,b]), meaning for f1,f2L2([a,b]),(f1f2)(υ)=f1,υf2,  for any  υL2([a,b]). Estimate of Γe is constructed by slicing over the response variable y.

A major challenge in the functional case is that the Hilbert–Schmidt operator Γ=E(xx) is not invertible (

FKIR

One may naively apply FSIR to multivariate responses in Rp by slicing each response variable into H slices. This simple marginal slicing leads to Hp multivariate slices that increases exponentially with H. And due to the curse of dimensionality, most slices will either be empty or contain too few observations. Therefore, a clever slicing scheme becomes necessary. Our idea is to obtain meaningful slices from clustering the observed response values, which effectively reduces the number of slices

Simulation study

In this subsection, we carry out simulations to investigate the performance of FKIR and compare it with two other methods that we call PCA and MDCCA. The PCA method refers to an extension of Aragon (1997) that does marginal slicing over the first principal component of the multivariate responses. The MDCCA method in Wang et al. (2012) is not a SIR-type of method and does not involve slicing. It estimates the EDR space based on a mixed data canonical correlation analysis between the multivariate

Conclusion and discussion

According to our knowledge so far, our proposed FKIR method is the first formal attempt to dimension reduction in functional multivariate regression, though some existing methods, such as the PCA method (Aragon, 1997) and MDCCA (Wang et al., 2012), may be exploited for such a purpose. We use k-means clustering to identify flexible slices (clusters) over the multivariate response space to avoid the curse of dimensionality caused by simple marginal slicing. Through simulation studies and real

Acknowledgments

We sincerely thank the referees and the editor for their insights and comments that allowed us to significantly improve our paper. This research was supported by Program for the National Science Foundation of China (No. 11271064), the Ph.D. Programs Foundation of Ministry of Education of China (20100043110002) and Fund of Jilin Provincial Science & Technology Department (No. 20111804).

References (43)

  • T. Cai et al.

    Prediction in functional linear regression

    The Annals of Statistics

    (2006)
  • H. Cardot et al.

    Spline estimators for the functional linear model

    Statistica Sinica

    (2003)
  • D. Chen et al.

    Single and multiple index functional regression models with nonparametric link

    The Annals of Statistics

    (2011)
  • J.M. Chiou et al.

    Functional clustering and identifying substructures of longitudinal data

    Journal of the Royal Statistical Society Series B

    (2007)
  • R.D. Cook et al.

    Likelihood-based sufficient dimension reduction

    Journal of the American Statistical Association

    (2009)
  • R.D. Cook et al.

    Envelope models for parsimonious and efficient multivariate linear regression

    Statistica Sinica

    (2010)
  • F. Ferraty et al.

    Functional projection pursuit regression

    TEST

    (2013)
  • F. Ferraty et al.

    Kernel regression with functional response

    Electronic Journal of Statistics

    (2011)
  • F. Ferraty et al.

    The functional nonparametric model and application to spectrometric data

    Computational Statistics

    (2002)
  • F. Ferraty et al.

    Nonparametric Functional Data Analysis: Theory and Practice

    (2006)
  • F. Ferraty et al.

    Richesse et complexité des données fonctionnelles

    Revue de Modulad

    (2011)
  • Cited by (0)

    View full text