Effect of using principal coordinates and principal components on retrieval of clusters
Introduction
A distinction that is often made between cluster analysis and principal component analysis (PCA) is that cluster analysis is concerned with the classification of individuals, while principal component techniques assess relationships between variables and could be concerned with the classification of these variables. If a large number of variables are involved, it might be a practice to use principal components with larger eigenvalues to reduce the number of variables before applying cluster analysis.
Baxter (1995) empirically investigates different approaches concerning the transformation and standardization of data before applying PCA. Arnold and Collins (1993) attempt to interpret the final axes to identify the quality or attribute that is being measured by the orthogonally transformed variables. The interpretation of the distance between the ith and jth objects of N samples is discussed for some commonly used types of analysis, including PCA of an data matrix and principal coordinate analysis (POA) of an symmetric matrix of the Euclidean inter-point distances (Gower, 1966). With certain criteria, PCA on the symmetric matrix and POA on the symmetric matrix are defined as being dual to one another when they both lead to a set of N data points with the same inter-point distances.
Matrices formed with elements from the Euclidean distances and the distances obtained from the correlation coefficient are used. The use of the correlation coefficient is extensively suggested (Eisen et al., 1998, Wu, 2001, Hadjiargyrou et al., 2002) to establish clusters when grouping genes. The distance is calculated from the similarity using the formula , where is the correlation coefficient between the ith and jth objects in the set of data. The results of using principal components and principal coordinates prior to the agglomerative clustering algorithms defined as the -family (DuBien and Warde, 1987), are compared using three different standardizations.
We show that the retrieval ability of clustering algorithm can be improved and be less sensitive to changes of noise by using principal coordinates prior to cluster analysis. Rand's (1971) C statistic, , is used to compare the retrieval abilities of clustering algorithms. If C is equal to 1.0, the clusters generated by a clustering algorithm have perfect reproducibility with the clusters defined on the set of data.
Section snippets
Agglomerative clustering algorithms
Suppose p variables are observed on each data point in a sample of size N. The primitive concepts of cluster analysis are data points to be clustered, the set of all data points to be clustered which is the object space, and cluster which is an operationally determined collection of data points. The matrix of measurements is , where represents a vector of measurements on the ith data points. Thus, indicates that there are N data points in the object space in . For
Principal component analysis
Principal components performed on an data matrix yield linearly transformed random variables which have special properties in terms of variances. In effect, transforming the original vector variable to the vector of principal components amounts to a rotation of coordinate axes to a new coordinate system that has inherent statistical properties. The principal components turn out to be the characteristic vectors of the covariance matrix. Thus, the study of principal components can be
Design of simulation study
For convenience, the number of data points in is , the number of variables is , and the number of clusters is in this study. Then a brief summary of data structure may be outlined as follows: where . The number of data points are split into populations of size , and the mean vectors , , are constrained by an equilateral triangle spatial configuration,
Example
The use of principal components and coordinates prior to applying the clustering algorithm on the cell cycle data from Spellman et al. (1998) is presented. The primary data was obtained at http://cellcycle-www.stanford.edu. A total of 800 yeast (Saccharomyces cerevisiae) genes are identified as being periodically regulated and meeting an objective minimum criterion for cell cycle regulation according to their normalization procedure on the primary data.
Among the identified 800 genes, 630
Concluding remarks
The use of principal coordinates instead of principal components prior to cluster analysis has been investigated and compared. Principal coordinate analysis (POA) is different from principal component analysis (PCA), since principal coordinates does not include information on the variables (Gower and Harding, 1988) while principal components does. However, each object is uniquely identified by principal coordinates.
In applying the procedure, three different methods of standardization were
References (14)
- et al.
Transcriptional profiling of bone regenerationinsight into the molecular complexity of wound repair
J. Biol. Chem.
(2002) - et al.
Effective dimensionality of large-scale expression data using principal component analysis
BioSystems
(2002) - et al.
Interpretation of transformed axes in multivariate analysis
Appl. Statist.
(1993) Standardization and transformation in principal component analysis, with applications to archaeometry
Appl. Statist.
(1995)- et al.
A comparison of agglomerative clustering methods with respect to noise
Comm. Statist. Theory Methods
(1987) - et al.
Cluster analysis and display of genome-wide expression patterns
Proc. Natl. Acad. Sci. USA
(1998) - Gnanadesikan, R., 1997. Methods for Statistical Data Analysis of Multivariate Observations. Wiley, New York, pp....