Functional factorial K-means analysis

https://doi.org/10.1016/j.csda.2014.05.010Get rights and content

Abstract

A new procedure for simultaneously finding the optimal cluster structure of multivariate functional objects and finding the subspace to represent the cluster structure is presented. The method is based on the k-means criterion for projected functional objects on a subspace in which a cluster structure exists. An efficient alternating least-squares algorithm is described, and the proposed method is extended to a regularized method for smoothness of weight functions. To deal with the negative effect of the correlation of the coefficient matrix of the basis function expansion in the proposed algorithm, a two-step approach to the proposed method is also described. Analyses of artificial and real data demonstrate that the proposed method gives correct and interpretable results compared with existing methods, the functional principal component k-means (FPCK) method and tandem clustering approach. It is also shown that the proposed method can be considered complementary to FPCK.

Introduction

In the last few decades, due to technical advances in storing and processing data, we can obtain large amount of data at hand. A particular case of such data is that of variables taking values into an infinite dimensional space, typically a space of functions defined on some set T. Such data are represented by curves or functions and thus called as functional data. Recently, it becomes easier to observe functional data in medicine, economics, psychometrics, and many others domains (for example, see  Ramsay and Silverman, 2005, for an overview).

In the framework of functional data analysis, many clustering methods have been already proposed in the literature. A common way to proceed is to filter first, that is to approximate each function by a linear combination of a few number of basis functions, and then to apply a classical clustering method to the resulting basis coefficients. For example, the works of Abraham et al. (2003) and Serban and Wasserman (2005) adopt the filtering approach. Another approach is a distance-based method in which clustering algorithms based on specific distances for functional data are used. In Tarpey and Kinateder (2003), the k-means algorithm with the usual L2-metric distance is investigated for Gaussian processes, and they prove that the cluster centers are linear combinations of functional principal component analysis (FPCA) eigenfunctions. In addition, Ferraty and Vieu (2006) propose to use a hierarchical clustering algorithm combined with the L2-metric distance with the semi-metric distance. Recent developments of clustering methods for functional data are excellently overviewed in Jacques and Preda (in press).

As described in Jacques and Preda (in press), recently, the other clustering methods for functional data have been developed; the new procedure is to identify simultaneously optimal cluster structure of functions and optimal subspaces for clustering. The use of a low-dimensional representation of functions can be of help in providing simpler and more interpretable solutions. Actually, cluster analysis of functional objects is often carried out in combination with dimension reduction (e.g.  Illian et al., 2009, Suyundykov et al., 2010). Bouveyron and Jacques (2011) developed a model-based clustering method for functional data that finds cluster-specific functional subspaces. Yamamoto (2012) proposed a method, called functional principal component k-means (FPCK) analysis, which attempts to find an optimal common subspace for the clustering of multivariate functional data. The method aims to overcome the problem of tandem clustering (Arabie and Hubert, 1994) for functional data, in which first a dimension-reduction technique, such as FPCA (e.g.  Ramsay and Silverman, 2005, Besse and Ramsay, 1986, Boente and Fraiman, 2000), is applied and subsequently the ordinary clustering algorithm is used for the principal component scores. Note that Gattone and Rocci (2012) have also developed a subspace clustering procedure that is essentially equivalent to FPCK, though their method deals with univariate functional data.

The methods of Bouveyron and Jacques (2011) and Yamamoto (2012) can be classified into subspace clustering techniques (Timmerman et al., 2010, Vidal, 2011) for functional data. Like subspace clustering techniques for multivariate matrix data, there are two types of methods for functional data: one intends to find a subspace specific to each cluster (Bouveyron and Jacques, 2011), and the other intends to find a subspace that is common to all clusters (Yamamoto, 2012). Here, we focus on the common subspace clustering.

Yamamoto (2012) shows that in various cases the FPCK method can find both an optimal cluster structure and the subspace for the clustering. The FPCK method, however, has a drawback caused by the definition of its loss function; if no substantial correlation is present in the part of functions which is informative on a cluster structure, FPCK fails in obtaining the cluster structure and a subspace for the structure. The drawback will be explained in more detail in the next section. In this paper, to overcome this drawback, we present a new method that simultaneously finds the cluster structure and reduces the dimension of multivariate functional objects. It will be shown that the proposed method has a mutually complementary relationship with the FPCK method.

This paper is organized as follows. Section  2 defines the notation used in this paper and discusses the drawbacks of FPCK analysis. In Section  3, a new clustering and dimension reduction method for functional objects is described, and an algorithm to implement the method is proposed. In Section  4, the performance of the proposed method is studied using artificial data, and an illustrative application to real data is presented in Section  5. Finally, in Section  6, we conclude the paper with a discussion and make recommendations for future research.

Section snippets

Notation

First we present the notation that we will use throughout this paper. Here, the same notations as Yamamoto (2012) will be used for ease of explanation. Suppose that the nth functional object (n=1,,N) with P variables is represented as xn(t)=(xnp(t)p=1,,P) with a domain TRd. For simplicity, we write xn=(xn(t)tT) to denote the nth observed function. In the rest of paper, for general understanding of the problem, we consider the single-variable case, i.e.,  P=1; in this case, the suffix p in

Criterion of the functional factorial k-means method

To overcome the drawback of FPCK analysis discussed above, we propose a new clustering method with dimension reduction. The notation and settings were explained in Section  2. For ease of explanation, we first consider the case in which there is only one variable, i.e., P=1. Thus, in this section, the suffix p is omitted from the notation. An extension to the multivariate model is straightforward and is described in Appendix A.

A least-squares objective function for the proposed approach, in

Data and evaluation procedures

To investigate the performance of the FFKM method, artificial data, which included a known low-dimensional cluster structure, were analyzed by four different methods: (i) the FFKM method, (ii) the two-step FFKM method (FFKMts), (iii) the FPCK method, and (iv) tandem analysis (TA) that consisted of FPCA using a basis function expansion (Ramsay and Silverman, 2005) followed by a standard k-means cluster analysis of the principal component scores on the first L principal components. Note that the

Empirical example

In this section, we perform an empirical analysis to demonstrate the use of the FFKM method and to compare its performance with that of the existing methods, the FPCK and tandem analysis (TA). We used the well-known phoneme dataset for a speech-recognition problem, as described by Hastie et al. (1995). The data are log-periodograms of 32 ms duration that correspond to five phonemes, as follows: “sh” as in “she”, “dcl” as in “dark”, “iy” as the vowel in “she”, “aa” as the vowel in “dark”, and

Discussion

In this article, we explained the drawbacks of the FPCK method and proposed a new method, FFKM analysis, to overcome the problem. The FFKM method aims to simultaneously classify functional objects into optimal clusters and find a subspace that best describes the classification and dimension reduction of the data. The ALS algorithm was proposed to efficiently solve the minimization problem of the least-squares objective function. Analysis of artificial data reveals that the FFKM method can give

Acknowledgments

We thank the Associate Editor and two anonymous reviewers for their constructive comments that helped to improve the quality of this article. This work was supported by JSPS Grant-in-Aid for JSPS Fellows Number 24-2676.

References (33)

  • C. Bouveyron et al.

    Model-based clustering of time series in group-specific functional subspaces

    Adv. Data Anal. Classif.

    (2011)
  • N. Dunford et al.

    Linear Operators, Spectral Theory, Self Adjoint Operators in Hilbert Space, Part 2

    (1988)
  • F. Ferraty et al.

    Nonparametric Functional Data Analysis

    (2006)
  • S.A. Gattone et al.

    Clustering curves on a reduced subspace

    J. Comput. Graph. Statist.

    (2012)
  • P.J. Green et al.

    Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach

    (1994)
  • T. Hastie et al.

    Penalized discriminant analysis

    Ann. Statist.

    (1995)
  • Cited by (24)

    • Hierarchical clustered multiclass discriminant analysis via cross-validation

      2023, Computational Statistics and Data Analysis
      Citation Excerpt :

      In general, cluster analysis and classification problem have different purposes; cluster analysis is used to find collections of classes on the basis of the similarity of two classes (e.g., Euclidean distance and Mahalanobis distance), whereas LDA is conducted to construct the classification rule. Combining two different methods does not usually guarantee the minimization of the overall error rate (e.g., Yamamoto and Terada, 2014; Kawano et al., 2015, 2018). Therefore, it is crucial to develop a clustering technique that aims to minimize the error rate.

    • Ensemble calibration model of near-infrared spectroscopy based on functional data analysis

      2022, Spectrochimica Acta - Part A: Molecular and Biomolecular Spectroscopy
      Citation Excerpt :

      A one-step method is used to optimize the curve generation process and the clustering process in one objective function simultaneously, such as linear mixed clustering of P-spline [31], Bayesian mixed effect model clustering [32], etc. The multi-step method is a method in which the curve generation process and the clustering process are carried out independently, such as coefficient clustering iterative algorithm based on functional principal component [33], functional factor k-means algorithm [34], functional subspace segmentation clustering algorithm [35], etc. The Monte Carlo method is a method of probability statistics, which uses statistics or random numbers to explore some answers.

    • Independent component analysis for multivariate functional data

      2020, Journal of Multivariate Analysis
      Citation Excerpt :

      Although univariate functional data analysis is currently exceedingly popular, its multivariate counterpart has received in comparison relatively little attention in the literature. Some previous contributions to the field include: [3,11,16,21,39] discussed multivariate functional principal component analysis; [22,45,50–52] developed multivariate functional clustering; [28] discussed sufficient dimension reduction methodology where both the predictor and the response can be multivariate functions; [14,17,42] used different measures of outlyingness to identify multivariate functional outliers. Consider next the conceptual and theoretical differences between multivariate-functional and univariate-functional extensions of independent component analysis.

    • Review of Clustering Methods for Functional Data

      2023, ACM Transactions on Knowledge Discovery from Data
    View all citing articles on Scopus
    1

    Tel.: +81 80 9098 3204.

    View full text