Learning-based EM clustering for data on the unit hypersphere with application to exoplanet data
Graphical abstract
Introduction
In 1918, von Mises [1] first proposed a distribution on circular data and then Watson and Williams [2] studied inference problems for the von Mises distribution by constructing statistical methods for circular data. Studies involving (2-dimensional) circular data had been extended to (3-dimensional) spherical data and (high-dimensional) data on the unit hypersphere where they were also widely applied in geology, biology, meteorology, medicine and oceanography [3], [4], [5], [6], [7], [8], [9].
Clustering involves finding clusters within a data set that are characterized by the greatest similarity within the same cluster and the greatest dissimilarity between different clusters. It is a useful tool for data analysis that is a branch of statistical multivariate analysis and unsupervised learning for pattern recognition [10], [11]. From the statistical point of view, clustering methods can be divided into two categories: probability model-based approaches [12], [13], [14] and nonparametric approaches [15], [16]. For probability model-based approaches, which assume that a data set follows a mixture of probability distributions, and EM (expectation & maximization) algorithms [17], which allow the estimation of distribution parameters can be used for clustering in a data set. Nonparametric methods require an objective function of similarity or dissimilarity to partition clusters. In nonparametric approaches, an objective function of similarity or dissimilarity is generally required in which partition clustering methods are the most popular. Frequently used partition methods are k-means [18], [19], [20], fuzzy c-means (FCM) [21], [22], [23], mean shift [24], [25], [26] and possibilistic c-means [27], [28].
EM is a commonly used algorithm for clustering grouped data using a mixture model. It can be also applied to directional data. Banerjee et al. [6] proposed an EM algorithm using a mixture of von Mises-Fisher distributions, called it soft-moVMF. However, soft-moVMF [6] is sensitive to initial values. In the study by Banerjee et al. [6], no appropriate initialization technique was proposed and inappropriate initializations usually lead to bad clustering results for soft-moVMF. Soft-moVMF [6] does not work when there is an unknown number of clusters. It needs to assign a number, c, of clusters a priori. In general, the cluster number, c, is unknown. In this case, some validity functions, such as Akaike information criterion (AIC) [29], Bayesian information criterion (BIC) [30], [31], the gap statistic [32], [33] and fuzzy cluster validity [21], [34], can be used to find the number of clusters. However, they are supposed to be extra indices that are not embedded into the iterations of algorithms. There are many methods for determining the number of clusters, such as Peck et al. [35], Li et al. [36], Josse and Husson [37]. Although these algorithms can find a number of clusters, they are dependent on initialization and parameter selection. Recently, Yang et al. [38] proposed a robust EM algorithm for Gaussian mixture models that is robust to initialization and parameter selection and which automatically produces a suitable number of clusters. Since Yang et al. [38] proposed EM using Gaussian mixture models for data in Euclidean space, this study extends this method for clustering data on the unit hypersphere using mixtures of von Mises-Fisher distributions. This is termed a learning-based EM algorithm for data on the unit hypersphere. The proposed algorithm for this type of hyper-spherical data is free of initialization and automatically determines the number of clusters.
The remainder of this paper is organized as follows. Section 2 presents the proposed learning-based EM algorithm for data on the unit hypersphere. In Section 3, some experimental examples are used to compare the proposed learning-based EM algorithm with soft-moVMF and soft-moVMF + BIC. The results demonstrate the superiority and utility of the proposed method. Section 4 gives a real application of the learning-based EM algorithm to exoplanet data. Finally, conclusions are stated in Section 5.
Section snippets
Learning-based EM algorithm for data on the unit hypersphere
Suppose that the data set is a random sample from a mixture probability model on the d-variate unit hypersphere. The density of the mixture model is , where fi and αi are the probability density function and the mixing proportion for the ith subpopulation, respectively. If z1, z2, .., zc are the indicator functions, such that zij = zi(xj) = 1, if xj arises from the subpopulation i, and zij = 0 if xjarises from other subpopulations, for i = 1, 2, .., c
Numerical comparisons and experimental results
In this section, the proposed learning-based EM algorithm is compared with existing model-based clustering. Banerjee et al. [6] established the EM algorithm using a mixture of von Mises-Fisher distributions, which is called soft-moVMF. However, for the soft-moVMF algorithm, the cluster number must be given a priori and initial values can also influence clustering results. Since Bayesian information criterion (BIC, see Schwarz [30]) can be used to determine the number of clusters, soft-moVMF
Application to extrasolar planets
The detection of extrasolar planets (exoplanets) during the last years is one of the greatest discoveries in the history of Astronomy. For researchers in celestial mechanics, these indirectly detected objects show unexpected dynamical features and present a new field of study. A simple and comprehensive taxonomy is necessary for a large number of exoplanets. There has been some labeling of exoplanets. Close-in planets, for example, have been identified and labeled, as have giant planets, Jovian
Conclusions
In this paper we propose a learning-based EM algorithm for data on the unit hypersphere. The proposed algorithm is used to estimate the parameters for von Mises-Fisher distributions, without initialization, and automatically obtains a suitable number of clusters. It is a fully unsupervised model-based clustering method for data on the unit hypersphere. In comparisons with existing model-based clustering, the results from numerical and real data sets demonstrate the effectiveness and superiority
Acknowledgements
The authors would like to thank the anonymous referees for their helpful comments in improving the presentation of this paper. This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 105-2118-M-033-004-MY2.
References (56)
- et al.
An unsupervised clustering algorithm for data on the unit hypersphere
Appl. Soft Comput.
(2016) - et al.
Bias-correction fuzzy clustering algorithms
Inf. Sci.
(2015) - et al.
Robust-learning fuzzy c-means clustering algorithm with unknown number of clusters
Pattern Recogn.
(2017) - et al.
Mean shift-based clustering
Pattern Recogn.
(2007) - et al.
Robust cluster validity indexes
Pattern Recogn.
(2009) - et al.
Selecting the number of components in principal component analysis using cross-validation approximations
Comput. Stat. Data Anal.
(2012) - et al.
A robust EM clustering algorithm for Gaussian mixture models
Pattern Recogn.
(2012) - et al.
Clustering by competitive agglomeration
Pattern Recogn.
(1997) 0ber die Ganzzahligkeit der Atomgewicht und verwandte Fragen
Phys. Z.
(1918)- et al.
On the construction of significance tests on the circle and the sphere
Biometrika
(1956)
Circular Statistics in Biology
Statistical Analysis of Spherical Data
Directional Statistics
Clustering on the unit hypersphere using von Mises-Fisher distributions
J. Mach. Learn. Res.
Model-based clustering on the unit sphere with an illustration using gene expression profiles
Biostatistics
Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions
Mach. Learn.
Pattern Classification and Scene Analysis
Finding Groups in Data: An Introduction to Cluster Analysis
Model-based clustering, discriminant analysis, and density estimation
J. Am. Stat. Assoc.
Finite mixture models and model-based clustering
Stat. Surv.
Fitting mixtures of Kent distributions to aid in joint set identifications
J. Am. Stat. Assoc.
Bootstrapping for significance of compact clusters in multidimensional datasets
J. Am. Stat. Assoc.
A k-mean-directions algorithm for fast clustering of data on the sphere
J. Comput. Graph. Stat.
Maximum likelihood from incomplete data via the EM algorithm (with discussion)
J. R. Stat. Soc.—B
A k-means clustering algorithm
Appl. Stat.
Some methods for classification and analysis of multivariate observations
Quantization and the method of k-means
IEEE Trans. Inf. Theory
Pattern Recognition with Fuzzy Objective Function Algorithms
Cited by (4)
Directional statistics-based quality measure for spotlight color images
2020, Signal, Image and Video ProcessingA Learning Based EM Clustering for Circular Data with Unknown Number of Clusters
2020, Proceedings of Engineering and Technology InnovationPossiblistic C-Means Clustering on Directional Data
2019, Proceedings - 2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics, CISP-BMEI 2019The identification of outliers in wrapped normal data by using ga statistics
2019, International Journal of Innovative Technology and Exploring Engineering