Effect of using principal coordinates and principal components on retrieval of clusters

https://doi.org/10.1016/j.csda.2005.01.013Get rights and content

Abstract

Principal coordinate analysis is a more powerful technique than principal component analysis to ensure identification on groups of objects if some conditions are satisfied. The results of using principal coordinates prior to cluster analysis were investigated. Three different methods of standardization were examined and compared with no standardization using both principal coordinates and principal components. The retrieval abilities of the known agglomerative clustering algorithms were improved by using principal coordinates. The results of applying principal coordinates based on the correlation coefficient instead of Euclidean distance prior to clustering algorithms were less sensitive to changes in noise.

Introduction

A distinction that is often made between cluster analysis and principal component analysis (PCA) is that cluster analysis is concerned with the classification of individuals, while principal component techniques assess relationships between variables and could be concerned with the classification of these variables. If a large number of variables are involved, it might be a practice to use principal components with larger eigenvalues to reduce the number of variables before applying cluster analysis.

Baxter (1995) empirically investigates different approaches concerning the transformation and standardization of data before applying PCA. Arnold and Collins (1993) attempt to interpret the final axes to identify the quality or attribute that is being measured by the orthogonally transformed variables. The interpretation of the distance between the ith and jth objects of N samples is discussed for some commonly used types of analysis, including PCA of an N×p data matrix and principal coordinate analysis (POA) of an N×N symmetric matrix of the Euclidean inter-point distances (Gower, 1966). With certain criteria, PCA on the p×p symmetric matrix and POA on the N×N symmetric matrix are defined as being dual to one another when they both lead to a set of N data points with the same inter-point distances.

Matrices formed with elements dij from the Euclidean distances and the distances obtained from the correlation coefficient are used. The use of the correlation coefficient is extensively suggested (Eisen et al., 1998, Wu, 2001, Hadjiargyrou et al., 2002) to establish clusters when grouping genes. The distance is calculated from the similarity using the formula dij=21-γij, where γij is the correlation coefficient between the ith and jth objects in the set of data. The results of using principal components and principal coordinates prior to the agglomerative clustering algorithms defined as the (β,π)-family (DuBien and Warde, 1987), are compared using three different standardizations.

We show that the retrieval ability of clustering algorithm can be improved and be less sensitive to changes of noise by using principal coordinates prior to cluster analysis. Rand's (1971) C statistic, 0.0C1.0, is used to compare the retrieval abilities of clustering algorithms. If C is equal to 1.0, the clusters generated by a clustering algorithm have perfect reproducibility with the clusters defined on the set of data.

Section snippets

Agglomerative clustering algorithms

Suppose p variables are observed on each data point in a sample of size N. The primitive concepts of cluster analysis are data points to be clustered, the set of all data points to be clustered which is the object space, and cluster which is an operationally determined collection of data points. The N×p matrix of measurements is X=XN=X1,X2,,XN, where Xi represents a p×1 vector of measurements on the ith data points. Thus, XN indicates that there are N data points in the object space in X. For

Principal component analysis

Principal components performed on an N×p data matrix yield linearly transformed random variables which have special properties in terms of variances. In effect, transforming the original vector variable to the vector of principal components amounts to a rotation of coordinate axes to a new coordinate system that has inherent statistical properties. The principal components turn out to be the characteristic vectors of the covariance matrix. Thus, the study of principal components can be

Design of simulation study

For convenience, the number of data points in X is N=60, the number of variables is p=9, and the number of clusters is k=3 in this study. Then a brief summary of data structure may be outlined as follows: XgiMVNμg,Σ,where g=1,,k,i=1,2,,N. The number of data points are split into k=3 populations of size n1;n2;n3={(20;20;20),(25;20;15)}, and the mean vectors μg, g=1,2,3, are constrained by an equilateral triangle spatial configuration,μ1=0.0δcδcδc0.0δcδcδc0.0,μ2=δc0.0δcδcδc0.00.0δcδc,μ3=δcδc

Example

The use of principal components and coordinates prior to applying the clustering algorithm on the cell cycle data from Spellman et al. (1998) is presented. The primary data was obtained at http://cellcycle-www.stanford.edu. A total of 800 yeast (Saccharomyces cerevisiae) genes are identified as being periodically regulated and meeting an objective minimum criterion for cell cycle regulation according to their normalization procedure on the primary data.

Among the identified 800 genes, 630

Concluding remarks

The use of principal coordinates instead of principal components prior to cluster analysis has been investigated and compared. Principal coordinate analysis (POA) is different from principal component analysis (PCA), since principal coordinates does not include information on the variables (Gower and Harding, 1988) while principal components does. However, each object is uniquely identified by principal coordinates.

In applying the procedure, three different methods of standardization were

References (14)

There are more references available in the full text version of this article.

Cited by (0)

View full text