Independent Component Analysis for the objective classification of globular clusters of the galaxy NGC 5128

https://doi.org/10.1016/j.csda.2012.06.008Get rights and content

Abstract

Independent Component Analysis (ICA) is closely related to Principal Component Analysis (PCA) and factor analysis. Whereas ICA finds a set of source data that are mutually independent, PCA finds a set of data that are mutually uncorrelated. The assumption that data from different physical processes are uncorrelated does not always imply the reverse case that uncorrelated data are coming from different physical processes. This is because lack of correlation is a weaker property than independence.

In the present case an objective classification of the globular clusters (GCs) of NGC 5128 has been carried out. Components responsible for significant variation have been obtained through both Principal Component Analysis (PCA) and Independent Component Analysis (ICA) and the classification has been done by K-means clustering. The set of observable parameters includes structural parameters, spectroscopically determined Lick indices and radial velocities from the literature.

We propose that GCs of NGC 5128 consist of two groups. One group originated in the original cluster formation event that coincided with the formation of the elliptical galaxy and the other group emerged from an accreted spiral galaxy. This is unlike the previous result (Chattopadhyay et al., 2009) which accounts for a third group originating from the accretion of tidally stripped dwarf galaxies.

Introduction

For many real life situations the number of variables under consideration as well as the number of observations are very large. In order to analyze such multivariate data, it is necessary to reduce the dimension properly. A smaller dimension is necessary for further analysis like classification or clustering. In statistics, Principal Component Analysis (PCA) is the most popular among the dimension reduction techniques. Although basically PCA is an exploratory technique, for making inferences it is necessary to make a normality assumption regarding the underlying multivariate distribution. The eigenvalues and eigenvectors of the covariance or correlation matrix are the main contributors of a PCA. The eigenvectors determine the directions of maximum variability whereas the eigenvalues specify the variances. In practice, decisions regarding the quality of the Principal Component approximation should be made on the basis of eigenvalue–eigenvector pairs. In order to study the sampling distribution of their estimates the multivariate normality assumptions become necessary as otherwise it is too difficult. Principal components (PCs) are a sequence of projections of the data. The components are constructed in such a way that they are uncorrelated and ordered in variance. The PCs of a p-dimensional data set provide a sequence of best linear approximations. As only a few (say, mp) of such linear combinations may explain a larger percentage of variation in the data, one can take only those m components instead of p variables for further analysis.

More recently, independent component analysis (ICA) has emerged as a strong competitor to PCA and factor analysis. ICA was primarily developed for non-Gaussian data in order to find independent components (rather than uncorrelated as in PCA) responsible for a larger part of the variation. ICA separates statistically independent component data, which is the original source data, from an observed set of data mixtures. All information in the multivariate data sets are not equally important. We need to extract the most useful information. ICA extracts and reveals useful information from the whole data set. This technique has been applied in various fields like speech processing, brain imaging, stock predictions etc.

Although ICA has already been used for the analysis of astronomical data (e.g., Maino et al. (2002), Funaroa et al. (2003), Capozziello and Funaro (2005) etc.), till now it has not been used for the purpose of the clustering of data. In the present study we have used both PCA and ICA to reduce the dimension of a data set related to the globular clusters of the galaxy NGC 5128 in order to make an objective classification. A Globular Cluster (GC) is generally a spherical collection of metal-poor stars that orbits a galactic core as a satellite. Studies on GCs are important for the understanding of stellar evolution and the formation of galaxies and the underlying cosmology. Although their origin and formation scenario are speculated to be connected to that of their host galaxy, an elaborate investigation is still in process. The stars in the GCs are among the oldest stars in the galaxy. The brightness and distinctive appearance of GCs make them relatively easy to detect at large distances. Classical formation of galaxies can be divided into five major categories: (1) the monolithic collapse model, (2) the major merger model, (3) the multiphase dissipational collapse model, (4) the dissipationless merger model and (5) accretion and in situ hierarchical merging.

According to the monolithic collapse model, an elliptical galaxy is formed through the collapse of an isolated massive gas cloud at high redshift (Larson, 1975, Carlberg, 1984, Arimato and Yoshii, 1987). In this model, the color distribution of GCs is unimodal, and the rotation of GCs is produced by the tidal force from satellite galaxies (Peebles, 1969). In the major merger model, elliptical galaxies are formed by the merger of two or more disk galaxies (Toomre, 1977, Ashman and Zepf, 1992, Zepf et al., 2000). Younger GCs are formed out of the shocked gas in the disk, while blue GCs come from the halos of the merging galaxies (Bekki et al., 2002). As a result, the color distribution is bimodal. In this scenario, the kinematic properties of the GCs depend weakly on the orbital configuration of the merging galaxies, but the metal-rich GCs are generally located in the inner region of the galaxy, and the metal-poor ones in the outer regions.

The multiphase dissipational collapse has been proposed by Forbes et al. (1997). According to this model, the GCs form in distinct star formation episodes through dissipational collapse. In addition, there is tidal stripping of GCs from satellite dwarf galaxies. Blue (metal-poor) GCs form in the initial phase and red (metal-rich) GCs form from the enriched medium at a later epoch, thus producing a bimodal color distribution of the GCs. This model predicts that the system of blue GCs has no rotation and a high-velocity dispersion, while the red GCs show some rotation depending on the degree of dissipation. Côté et al. (1998) proposed a model in which the GC color bimodality is due to the capture of metal-poor GCs through merger or tidal stripping. The metal-rich GCs are the initial population of GCs in the galaxy and are more centrally concentrated than the captured GCs. The main difference with the previous model is that no age difference is expected between the blue and red GC populations. The very different origins for the two populations imply rather different orbital properties; in particular, the metal-poor GCs should show a larger velocity dispersion than the metal-rich ones, comparable in the outer region to that of the neighboring galaxies.

From the above discussion, it appears that there are kinematic differences among the subpopulations of GCs in different galaxies. These differences can be used as an observational constraint on the galaxy formation model. In the above studies, the GCs are classified as metal-rich and metal-poor on the basis of the value of a single variable [Fe/H] > or <1 which is subjective in nature and also inappropriate in a multivariate setup. Concentrating on a single variable means that one ignores the joint effect of several parameters.

NGC 5128 (Fig. 1) (Centaurus A) is a prominent galaxy in the constellation of Centaurus. NGC 5128 is one of the closest giant elliptical galaxies to Earth. Its active galactic nucleus has been extensively studied by the professional astronomers (Beasley et al., 2008). In a previous work (Chattopadhyay et al., 2009), we have first used a modified technique of PCA (Salibián-Barrera et al., 2006) to search for the optimum set of parameters which gives maximum variations for the GCs (Fig. 2) in NGC 5128. This can be considered as a robust PCA based on a multivariate MM estimator. In that work, for cluster analysis (CA) we have used two methods; one is based on mixture models (Qiu and Tamhane, 2007) and the other one is K-means (MacQueen, 1967). To find the optimum number of clusters we used the method developed by Sugar and James (2003). The robust PCA method and cluster analysis based on mixture models have been discussed in brief in Appendix A and Appendix B respectively.

Although the above mentioned modified PCA and mixture model based clustering methods are quite robust, they perform better when the sample size is quite large in comparison with the dimension of the data set. But here the number of GCs having values of all 15 variables (parameters) is only 130. Further tests for normality show that the samples are from a non-Gaussian distribution.

In the present study we have done K-means clustering on the basis of ICs as well as PCs to identify the proper method applicable to the present data set on the basis of within cluster sum of squares.

In this paper Section 2 is related to different features of ICA while a theoretical comparison between PCA and ICA has been discussed in Section 3. The data analysis part as well as properties of the three groups and conclusions are illustrated in Section 4.

The explanations of all the structural and photometric parameters used in this paper are listed in Table 1.

Section snippets

Independent component analysis

Suppose there are n observations on each of p correlated variables. Let us denote the data matrix by Xn×p. By singular value decomposition one can write X=UDV. Writing S=nU and A=DV/n, we have X=SA and hence each of the columns of X is a linear combination of the columns of S. Now since U is orthogonal and assuming that the columns of X have mean zero, it is easy to show that the columns of S have zero mean, unit variance and they are uncorrelated. In terms of random variables we can

Independent component analysis versus principal component analysis

Both independent component analysis and principal component analysis are used for analyzing large data sets. Whereas ICA finds a set of source data that are mutually independent, PCA finds a set of data that are mutually uncorrelated. ICA was originally developed for separating mixed audio signals into independent sources. In this paper we make the comparison by analyzing GC data.

The purpose of PCA is to reduce the original data set of two or more sequentially observed variables by identifying

Data analysis

For data analysis we have used R software and all the necessary programs are written using R script.

Acknowledgments

The authors are grateful to the referees for their comments which significantly improved the quality of the paper.

References (43)

  • P. Comon

    Independent component analysis, a new concept?

    Signal Processing

    (1994)
  • A. Hyva˝rinen et al.

    Independent component analysis: algorithms and applications

    Neural Networks

    (2000)
  • D. Qiu et al.

    A comparative study of the K-means algorithm and the normal mixture model for clustering: univariate case

    Journal of Statistical Planning and Inference

    (2007)
  • H. Albazzaz et al.

    Statistical process control charts for batch operations based on independent component analysis

    Industrial & Engineering Chemistry Research

    (2004)
  • N. Arimato et al.

    Chemical and photometric properties of a galactic wind model for elliptical galaxies

    Astronomy & Astrophysics

    (1987)
  • K.M. Ashman et al.

    The formation of globular clusters in merging and interacting galaxies

    The Astrophysical Journal

    (1992)
  • P. Barmby et al.

    M31 globular clusters: colors and metallicities

    The Astronomical Journal

    (2000)
  • M.A. Beasley et al.

    A 2dF spectroscopic study of globular clusters in NGC 5128: probing the formation history of the nearest giant elliptical

  • K. Bekki et al.

    Globular cluster formation from gravitational tidal effects of merging and interacting galaxies

  • S. Capozziello et al.

    Separation of artifacts and events in astrophysical images using independent component analysis

    International Journal Of Computational Cognition

    (2005)
  • R.G. Carlberg

    Dissipative formation of an elliptical galaxy

    The Astrophysical Journal

    (1984)
  • S.A. Cellone et al.

    Washington photometry of low surface brightness dwarf galaxies in the Fornax cluster: constraints on their stellar populations

    The Astrophysical Journal

    (1996)
  • A.K. Chattopadhyay et al.

    Study of NGC 5128 globular clusters under multivariate statistical paradigm

    The Astrophysical Journal

    (2009)
  • P. Côté et al.

    The formation of giant elliptical galaxies and their globular cluster systems

    The Astrophysical Journal

    (1998)
  • D.A. Forbes et al.

    On the origin of globular clusters in elliptical and cD galaxies

    The Astronomical Journal

    (1997)
  • M. Funaroa et al.

    Independent component analysis for artefact separation in astrophysical images

    Neural Networks

    (2003)
  • D. Geisler et al.

    Washington photometry of the globular cluster system of NGC 4472, I, analysis of the metallicities

    The Astronomical Journal

    (1996)
  • J.A. Graham

    The structure and evolution of NGC 5128

    The Astrophysical Journal

    (1979)
  • T. Hastie et al.
  • A. Hyva˝rinen et al.

    Independent Component Analysis

    (2001)
  • F.P. Israel

    Centaurus A — NGC 5128

    Astronomy & Astrophysics Review

    (1998)
  • Cited by (24)

    • Comparison among different Clustering and Classification Techniques: Astronomical data-dependent study

      2023, New Astronomy
      Citation Excerpt :

      In astrostatistics, applications of dimension reduction and clustering techniques are quite common. Chattopadhyay et al. (2013a); Chattopadhyay et al. (2013b, 2012) considered such problems in their work. The paper is organized as follows: In Section 2, we discuss the Supervised and Unsupervised clustering and classification techniques, Section 3 gives an idea about the data set under consideration.

    • Bayesian predictive kernel discriminant analysis

      2013, Pattern Recognition Letters
    • Multivariate Analysis of the Globular Clusters in M87

      2015, Publications of the Astronomical Society of Australia
    View all citing articles on Scopus
    View full text