Functional factorial K-means analysis

doi:10.1016/j.csda.2014.05.010

Computational Statistics & Data Analysis

Volume 79, November 2014, Pages 133-148

https://doi.org/10.1016/j.csda.2014.05.010 Get rights and content

Abstract

A new procedure for simultaneously finding the optimal cluster structure of multivariate functional objects and finding the subspace to represent the cluster structure is presented. The method is based on the $k$ -means criterion for projected functional objects on a subspace in which a cluster structure exists. An efficient alternating least-squares algorithm is described, and the proposed method is extended to a regularized method for smoothness of weight functions. To deal with the negative effect of the correlation of the coefficient matrix of the basis function expansion in the proposed algorithm, a two-step approach to the proposed method is also described. Analyses of artificial and real data demonstrate that the proposed method gives correct and interpretable results compared with existing methods, the functional principal component $k$ -means (FPCK) method and tandem clustering approach. It is also shown that the proposed method can be considered complementary to FPCK.

Introduction

In the last few decades, due to technical advances in storing and processing data, we can obtain large amount of data at hand. A particular case of such data is that of variables taking values into an infinite dimensional space, typically a space of functions defined on some set $T$ . Such data are represented by curves or functions and thus called as functional data. Recently, it becomes easier to observe functional data in medicine, economics, psychometrics, and many others domains (for example, see Ramsay and Silverman, 2005, for an overview).

In the framework of functional data analysis, many clustering methods have been already proposed in the literature. A common way to proceed is to filter first, that is to approximate each function by a linear combination of a few number of basis functions, and then to apply a classical clustering method to the resulting basis coefficients. For example, the works of Abraham et al. (2003) and Serban and Wasserman (2005) adopt the filtering approach. Another approach is a distance-based method in which clustering algorithms based on specific distances for functional data are used. In Tarpey and Kinateder (2003), the $k$ -means algorithm with the usual $L^{2}$ -metric distance is investigated for Gaussian processes, and they prove that the cluster centers are linear combinations of functional principal component analysis (FPCA) eigenfunctions. In addition, Ferraty and Vieu (2006) propose to use a hierarchical clustering algorithm combined with the $L^{2}$ -metric distance with the semi-metric distance. Recent developments of clustering methods for functional data are excellently overviewed in Jacques and Preda (in press).

As described in Jacques and Preda (in press), recently, the other clustering methods for functional data have been developed; the new procedure is to identify simultaneously optimal cluster structure of functions and optimal subspaces for clustering. The use of a low-dimensional representation of functions can be of help in providing simpler and more interpretable solutions. Actually, cluster analysis of functional objects is often carried out in combination with dimension reduction (e.g. Illian et al., 2009, Suyundykov et al., 2010). Bouveyron and Jacques (2011) developed a model-based clustering method for functional data that finds cluster-specific functional subspaces. Yamamoto (2012) proposed a method, called functional principal component $k$ -means (FPCK) analysis, which attempts to find an optimal common subspace for the clustering of multivariate functional data. The method aims to overcome the problem of tandem clustering (Arabie and Hubert, 1994) for functional data, in which first a dimension-reduction technique, such as FPCA (e.g. Ramsay and Silverman, 2005, Besse and Ramsay, 1986, Boente and Fraiman, 2000), is applied and subsequently the ordinary clustering algorithm is used for the principal component scores. Note that Gattone and Rocci (2012) have also developed a subspace clustering procedure that is essentially equivalent to FPCK, though their method deals with univariate functional data.

The methods of Bouveyron and Jacques (2011) and Yamamoto (2012) can be classified into subspace clustering techniques (Timmerman et al., 2010, Vidal, 2011) for functional data. Like subspace clustering techniques for multivariate matrix data, there are two types of methods for functional data: one intends to find a subspace specific to each cluster (Bouveyron and Jacques, 2011), and the other intends to find a subspace that is common to all clusters (Yamamoto, 2012). Here, we focus on the common subspace clustering.

Yamamoto (2012) shows that in various cases the FPCK method can find both an optimal cluster structure and the subspace for the clustering. The FPCK method, however, has a drawback caused by the definition of its loss function; if no substantial correlation is present in the part of functions which is informative on a cluster structure, FPCK fails in obtaining the cluster structure and a subspace for the structure. The drawback will be explained in more detail in the next section. In this paper, to overcome this drawback, we present a new method that simultaneously finds the cluster structure and reduces the dimension of multivariate functional objects. It will be shown that the proposed method has a mutually complementary relationship with the FPCK method.

This paper is organized as follows. Section 2 defines the notation used in this paper and discusses the drawbacks of FPCK analysis. In Section 3, a new clustering and dimension reduction method for functional objects is described, and an algorithm to implement the method is proposed. In Section 4, the performance of the proposed method is studied using artificial data, and an illustrative application to real data is presented in Section 5. Finally, in Section 6, we conclude the paper with a discussion and make recommendations for future research.

Section snippets

Notation

First we present the notation that we will use throughout this paper. Here, the same notations as Yamamoto (2012) will be used for ease of explanation. Suppose that the $n$ th functional object $(n = 1, \dots, N)$ with $P$ variables is represented as $x_{n} (t) = (x_{n p} (t) ∣ p = 1, \dots, P)$ with a domain $T \subset R^{d}$ . For simplicity, we write $x_{n} = (x_{n} (t) ∣ t \in T)$ to denote the $n$ th observed function. In the rest of paper, for general understanding of the problem, we consider the single-variable case, i.e., $P = 1$ ; in this case, the suffix $p$ in

Criterion of the functional factorial $k$ -means method

To overcome the drawback of FPCK analysis discussed above, we propose a new clustering method with dimension reduction. The notation and settings were explained in Section 2. For ease of explanation, we first consider the case in which there is only one variable, i.e., $P = 1$ . Thus, in this section, the suffix $p$ is omitted from the notation. An extension to the multivariate model is straightforward and is described in Appendix A.

A least-squares objective function for the proposed approach, in

Data and evaluation procedures

To investigate the performance of the FFKM method, artificial data, which included a known low-dimensional cluster structure, were analyzed by four different methods: (i) the FFKM method, (ii) the two-step FFKM method (FFKMts), (iii) the FPCK method, and (iv) tandem analysis (TA) that consisted of FPCA using a basis function expansion (Ramsay and Silverman, 2005) followed by a standard $k$ -means cluster analysis of the principal component scores on the first $L$ principal components. Note that the

Empirical example

In this section, we perform an empirical analysis to demonstrate the use of the FFKM method and to compare its performance with that of the existing methods, the FPCK and tandem analysis (TA). We used the well-known phoneme dataset for a speech-recognition problem, as described by Hastie et al. (1995). The data are log-periodograms of 32 ms duration that correspond to five phonemes, as follows: “sh” as in “she”, “dcl” as in “dark”, “iy” as the vowel in “she”, “aa” as the vowel in “dark”, and

Discussion

In this article, we explained the drawbacks of the FPCK method and proposed a new method, FFKM analysis, to overcome the problem. The FFKM method aims to simultaneously classify functional objects into optimal clusters and find a subspace that best describes the classification and dimension reduction of the data. The ALS algorithm was proposed to efficiently solve the minimization problem of the least-squares objective function. Analysis of artificial data reveals that the FFKM method can give

Acknowledgments

We thank the Associate Editor and two anonymous reviewers for their constructive comments that helped to improve the quality of this article. This work was supported by JSPS Grant-in-Aid for JSPS Fellows Number 24-2676.

References (33)

G. Boente et al.
Kernel-based functional principal components
Statist. Probab. Lett.
(2000)
F. Ferraty et al.
Curves discrimination: a non parametric functional approach
Comput. Statist. Data Anal.
(2003)
A. Hardy
On the number of clusters
Comput. Statist. Data Anal.
(1996)
M. Hubert et al.
An adjusted boxplot for skewed distributions
Comput. Statist. Data Anal.
(2008)
J.B. Illian et al.
Functional principal component data analysis: a new method for analysing microbial community fingerprints
J. Microbiol. Meth.
(2009)
M.E. Timmerman et al.
Factorial and reduced $k$ -means reconsidered
Comput. Statist. Data Anal.
(2010)
M. Vichi et al.
Factorial $k$ -means analysis for two-way data
Comput. Statist. Data Anal.
(2001)
C. Abraham et al.
Unsupervised curve clustering using $B$ -splines
Scand. J. Stat.
(2003)
P. Arabie et al.
Cluster Analysis in Marketting Research
P.C. Besse et al.
Principal components analysis of sampled functions
Psychometrika
(1986)

C. Bouveyron et al.

Model-based clustering of time series in group-specific functional subspaces

Adv. Data Anal. Classif.

(2011)

N. Dunford et al.

Linear Operators, Spectral Theory, Self Adjoint Operators in Hilbert Space, Part 2

(1988)

F. Ferraty et al.

Nonparametric Functional Data Analysis

(2006)

S.A. Gattone et al.

Clustering curves on a reduced subspace

J. Comput. Graph. Statist.

(2012)

P.J. Green et al.

Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach

(1994)

T. Hastie et al.

Penalized discriminant analysis

Ann. Statist.

(1995)

Cited by (24)

Hierarchical clustered multiclass discriminant analysis via cross-validation
2023, Computational Statistics and Data Analysis
Citation Excerpt :
In general, cluster analysis and classification problem have different purposes; cluster analysis is used to find collections of classes on the basis of the similarity of two classes (e.g., Euclidean distance and Mahalanobis distance), whereas LDA is conducted to construct the classification rule. Combining two different methods does not usually guarantee the minimization of the overall error rate (e.g., Yamamoto and Terada, 2014; Kawano et al., 2015, 2018). Therefore, it is crucial to develop a clustering technique that aims to minimize the error rate.
Linear discriminant analysis (LDA) is a well-known method for multiclass classification and dimensionality reduction. However, in general, ordinary LDA does not achieve high prediction accuracy when observations in some classes are difficult to be classified. A novel cluster-based LDA method is proposed that significantly improves prediction accuracy. Hierarchical clustering is adopted, and the dissimilarity measure of two clusters is defined by the cross-validation (CV) value. Therefore, clusters are constructed such that the misclassification error rate is minimized. The proposed approach involves a heavy computational load because the CV value must be computed at each step of the hierarchical clustering algorithm. To address this issue, a regression formulation for LDA is developed and an efficient algorithm that computes an approximate CV value is constructed. The performance of the proposed method is investigated by applying it to both artificial and real datasets. The proposed method provides high prediction accuracy with fast computation from both numerical and theoretical viewpoints.
Ensemble calibration model of near-infrared spectroscopy based on functional data analysis
2022, Spectrochimica Acta - Part A: Molecular and Biomolecular Spectroscopy
Citation Excerpt :
A one-step method is used to optimize the curve generation process and the clustering process in one objective function simultaneously, such as linear mixed clustering of P-spline [31], Bayesian mixed effect model clustering [32], etc. The multi-step method is a method in which the curve generation process and the clustering process are carried out independently, such as coefficient clustering iterative algorithm based on functional principal component [33], functional factor k-means algorithm [34], functional subspace segmentation clustering algorithm [35], etc. The Monte Carlo method is a method of probability statistics, which uses statistics or random numbers to explore some answers.
As a nondestructive detection technology, near-infrared spectroscopy has been widely applied in various fields. With the wide application of near-infrared spectroscopy, the research on data processing has attracted more attention. Different from the existing discrete data model and based on the functional data analysis method, an ensemble calibration model FDA-EM-PLS (functional data analysis-ensemble learning-partial least squares) of near-infrared spectroscopy is proposed in this paper. Firstly, the near-infrared spectroscopy of each sample is divided into several intervals, and the functional data analysis is carried out on each interval. Then, the samples are clustered according to the generated functions, which can not only reduce the influence of noise, but also provide a theoretical basis for selecting variables. Further, Monte Carlo sampling is used to generate training subsets from clustering samples for ensemble learning, which not only solves the problem of small samples, but also improves the robustness of the model. The relevant experimental results show that the absolute relative error of FDA-EM-PLS for the corn and soil data are both less than 10%.
Independent component analysis for multivariate functional data
2020, Journal of Multivariate Analysis
Citation Excerpt :
Although univariate functional data analysis is currently exceedingly popular, its multivariate counterpart has received in comparison relatively little attention in the literature. Some previous contributions to the field include: [3,11,16,21,39] discussed multivariate functional principal component analysis; [22,45,50–52] developed multivariate functional clustering; [28] discussed sufficient dimension reduction methodology where both the predictor and the response can be multivariate functions; [14,17,42] used different measures of outlyingness to identify multivariate functional outliers. Consider next the conceptual and theoretical differences between multivariate-functional and univariate-functional extensions of independent component analysis.
We extend two methods of independent component analysis, fourth order blind identification and joint approximate diagonalization of eigen-matrices, to vector-valued functional data. Multivariate functional data occur naturally and frequently in modern applications, and extending independent component analysis to this setting allows us to distill important information from this type of data, going a step further than the functional principal component analysis. To allow the inversion of the covariance operator we make the assumption that the dependency between the component functions lies in a finite-dimensional subspace. In this subspace we define fourth cross-cumulant operators and use them to construct the two novel, Fisher consistent methods for solving the independent component problem for vector-valued functions. Both simulations and an application on a hand gesture data set show the usefulness and advantages of the proposed methods over functional principal component analysis.
Review of Clustering Methods for Functional Data
2023, ACM Transactions on Knowledge Discovery from Data
Modeling the evolution of deaths from infectious diseases with functional data models: The case of COVID-19 in Brazil
2023, Statistics in Medicine
Review of Clustering Methods for Functional Data
2022, arXiv

View all citing articles on Scopus

¹: Tel.: +81 80 9098 3204.

View full text

Functional factorial K-means analysis

Abstract

Introduction

Section snippets

Notation

Criterion of the functional factorial k-means method

Data and evaluation procedures

Empirical example

Discussion

Acknowledgments

Statist. Probab. Lett.

Comput. Statist. Data Anal.

Comput. Statist. Data Anal.

Comput. Statist. Data Anal.

J. Microbiol. Meth.

Comput. Statist. Data Anal.

Comput. Statist. Data Anal.

Unsupervised curve clustering using B-splines

Scand. J. Stat.

Cluster Analysis in Marketting Research

Principal components analysis of sampled functions

Psychometrika

Model-based clustering of time series in group-specific functional subspaces

Adv. Data Anal. Classif.

Linear Operators, Spectral Theory, Self Adjoint Operators in Hilbert Space, Part 2

Nonparametric Functional Data Analysis

Clustering curves on a reduced subspace

J. Comput. Graph. Statist.

Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach

Penalized discriminant analysis

Ann. Statist.

Functional factorial $K$ -means analysis

Criterion of the functional factorial $k$ -means method

Unsupervised curve clustering using $B$ -splines