Learning-based EM clustering for data on the unit hypersphere with application to exoplanet data

doi:10.1016/j.asoc.2017.06.037

Applied Soft Computing

Volume 60, November 2017, Pages 101-114

https://doi.org/10.1016/j.asoc.2017.06.037 Get rights and content

Highlights

•
We construct a learning-based EM clustering algorithm for hyper-spherical data.
•
The proposed method is robust to outliers without initializations, and not necessary to assign a number of clusters.
•
Numerical and real examples with comparisons are given to demonstrate the effectiveness and superiority of the proposed method.
•
The proposed method is applied to cluster exoplanet data in extrasolar planets.

Abstract

This study focuses on clustering algorithms for data on the unit hypersphere. This type of directional data lain on the surface of a unit hypersphere is used in geology, biology, meteorology, medicine and oceanography. The EM algorithm with mixtures of von Mises-Fisher distributions is often used for model-based clustering for data on the unit hypersphere. However, the EM algorithm is sensitive to initial values and outliers and a number of clusters must be assigned a priori. In this paper, we propose an effective approach, called a learning-based EM algorithm with von Mises-Fisher distributions, to cluster this type of hyper-spherical data. The proposed clustering method is robust to outliers, without the need for initialization, and automatically determines the number of clusters. Thus, it becomes a fully-unsupervised model-based clustering method for data on the unit hypersphere. Some numerical and real examples with comparisons are given to demonstrate the effectiveness and superiority of the proposed method. We also apply the proposed learning-based EM algorithm to cluster exoplanet data in extrasolar planets. The clustering results have several important implications for exoplanet data and allow an interpretation of exoplanet migration.

Graphical abstract

Introduction

In 1918, von Mises [1] first proposed a distribution on circular data and then Watson and Williams [2] studied inference problems for the von Mises distribution by constructing statistical methods for circular data. Studies involving (2-dimensional) circular data had been extended to (3-dimensional) spherical data and (high-dimensional) data on the unit hypersphere where they were also widely applied in geology, biology, meteorology, medicine and oceanography [3], [4], [5], [6], [7], [8], [9].

Clustering involves finding clusters within a data set that are characterized by the greatest similarity within the same cluster and the greatest dissimilarity between different clusters. It is a useful tool for data analysis that is a branch of statistical multivariate analysis and unsupervised learning for pattern recognition [10], [11]. From the statistical point of view, clustering methods can be divided into two categories: probability model-based approaches [12], [13], [14] and nonparametric approaches [15], [16]. For probability model-based approaches, which assume that a data set follows a mixture of probability distributions, and EM (expectation & maximization) algorithms [17], which allow the estimation of distribution parameters can be used for clustering in a data set. Nonparametric methods require an objective function of similarity or dissimilarity to partition clusters. In nonparametric approaches, an objective function of similarity or dissimilarity is generally required in which partition clustering methods are the most popular. Frequently used partition methods are k-means [18], [19], [20], fuzzy c-means (FCM) [21], [22], [23], mean shift [24], [25], [26] and possibilistic c-means [27], [28].

EM is a commonly used algorithm for clustering grouped data using a mixture model. It can be also applied to directional data. Banerjee et al. [6] proposed an EM algorithm using a mixture of von Mises-Fisher distributions, called it soft-moVMF. However, soft-moVMF [6] is sensitive to initial values. In the study by Banerjee et al. [6], no appropriate initialization technique was proposed and inappropriate initializations usually lead to bad clustering results for soft-moVMF. Soft-moVMF [6] does not work when there is an unknown number of clusters. It needs to assign a number, c, of clusters a priori. In general, the cluster number, c, is unknown. In this case, some validity functions, such as Akaike information criterion (AIC) [29], Bayesian information criterion (BIC) [30], [31], the gap statistic [32], [33] and fuzzy cluster validity [21], [34], can be used to find the number of clusters. However, they are supposed to be extra indices that are not embedded into the iterations of algorithms. There are many methods for determining the number of clusters, such as Peck et al. [35], Li et al. [36], Josse and Husson [37]. Although these algorithms can find a number of clusters, they are dependent on initialization and parameter selection. Recently, Yang et al. [38] proposed a robust EM algorithm for Gaussian mixture models that is robust to initialization and parameter selection and which automatically produces a suitable number of clusters. Since Yang et al. [38] proposed EM using Gaussian mixture models for data in Euclidean space, this study extends this method for clustering data on the unit hypersphere using mixtures of von Mises-Fisher distributions. This is termed a learning-based EM algorithm for data on the unit hypersphere. The proposed algorithm for this type of hyper-spherical data is free of initialization and automatically determines the number of clusters.

The remainder of this paper is organized as follows. Section 2 presents the proposed learning-based EM algorithm for data on the unit hypersphere. In Section 3, some experimental examples are used to compare the proposed learning-based EM algorithm with soft-moVMF and soft-moVMF + BIC. The results demonstrate the superiority and utility of the proposed method. Section 4 gives a real application of the learning-based EM algorithm to exoplanet data. Finally, conclusions are stated in Section 5.

Section snippets

Learning-based EM algorithm for data on the unit hypersphere

Suppose that the data set $X = {x_{j} | | | x_{j} | | = 1, j = 1,2,..., n}$ is a random sample from a mixture probability model on the d-variate unit hypersphere. The density of the mixture model is $f (x; θ) = \sum_{i = 1}^{c} α_{i} f_{i} (x; θ_{i})$ , where f_i and α_i are the probability density function and the mixing proportion for the ith subpopulation, respectively. If z₁, z₂, .., z_c are the indicator functions, such that z_ij = z_i(x_j) = 1, if x_j arises from the subpopulation i, and z_ij = 0 if x_jarises from other subpopulations, for i = 1, 2, .., c

Numerical comparisons and experimental results

In this section, the proposed learning-based EM algorithm is compared with existing model-based clustering. Banerjee et al. [6] established the EM algorithm using a mixture of von Mises-Fisher distributions, which is called soft-moVMF. However, for the soft-moVMF algorithm, the cluster number must be given a priori and initial values can also influence clustering results. Since Bayesian information criterion (BIC, see Schwarz [30]) can be used to determine the number of clusters, soft-moVMF

Application to extrasolar planets

The detection of extrasolar planets (exoplanets) during the last years is one of the greatest discoveries in the history of Astronomy. For researchers in celestial mechanics, these indirectly detected objects show unexpected dynamical features and present a new field of study. A simple and comprehensive taxonomy is necessary for a large number of exoplanets. There has been some labeling of exoplanets. Close-in planets, for example, have been identified and labeled, as have giant planets, Jovian

Conclusions

In this paper we propose a learning-based EM algorithm for data on the unit hypersphere. The proposed algorithm is used to estimate the parameters for von Mises-Fisher distributions, without initialization, and automatically obtains a suitable number of clusters. It is a fully unsupervised model-based clustering method for data on the unit hypersphere. In comparisons with existing model-based clustering, the results from numerical and real data sets demonstrate the effectiveness and superiority

Acknowledgements

The authors would like to thank the anonymous referees for their helpful comments in improving the presentation of this paper. This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 105-2118-M-033-004-MY2.

References (56)

M.S. Yang et al.
An unsupervised clustering algorithm for data on the unit hypersphere
Appl. Soft Comput.
(2016)
M.S. Yang et al.
Bias-correction fuzzy clustering algorithms
Inf. Sci.
(2015)
M.S. Yang et al.
Robust-learning fuzzy c-means clustering algorithm with unknown number of clusters
Pattern Recogn.
(2017)
K.L. Wu et al.
Mean shift-based clustering
Pattern Recogn.
(2007)
K.L. Wu et al.
Robust cluster validity indexes
Pattern Recogn.
(2009)
J. Josse et al.
Selecting the number of components in principal component analysis using cross-validation approximations
Comput. Stat. Data Anal.
(2012)
M.S. Yang et al.
A robust EM clustering algorithm for Gaussian mixture models
Pattern Recogn.
(2012)
H. Frigui et al.
Clustering by competitive agglomeration
Pattern Recogn.
(1997)
R. von Mises
0ber die Ganzzahligkeit der Atomgewicht und verwandte Fragen
Phys. Z.
(1918)
G.S. Watson et al.
On the construction of significance tests on the circle and the sphere
Biometrika
(1956)

E. Batschelet

Circular Statistics in Biology

(1981)

N.I. Fisher et al.

Statistical Analysis of Spherical Data

(1987)

K.V. Mardia et al.

Directional Statistics

(2000)

I.S. Banerjee et al.

Clustering on the unit hypersphere using von Mises-Fisher distributions

J. Mach. Learn. Res.

(2005)

J.-L. Dortet-Bernadet et al.

Model-based clustering on the unit sphere with an illustration using gene expression profiles

Biostatistics

(2008)

P. Kasarapu et al.

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

Mach. Learn.

(2015)

R.O. Duda et al.

Pattern Classification and Scene Analysis

(1973)

L. Kaufman et al.

Finding Groups in Data: An Introduction to Cluster Analysis

(1990)

C. Fraley et al.

Model-based clustering, discriminant analysis, and density estimation

J. Am. Stat. Assoc.

(2002)

V. Melnykov et al.

Finite mixture models and model-based clustering

Stat. Surv.

(2010)

D. Peel et al.

Fitting mixtures of Kent distributions to aid in joint set identifications

J. Am. Stat. Assoc.

(2001)

R. Maitra et al.

Bootstrapping for significance of compact clusters in multidimensional datasets

J. Am. Stat. Assoc.

(2012)

R. Maitra et al.

A k-mean-directions algorithm for fast clustering of data on the sphere

J. Comput. Graph. Stat.

(2010)

A.P. Dempster et al.

Maximum likelihood from incomplete data via the EM algorithm (with discussion)

J. R. Stat. Soc.—B

(1977)

J.A. Hartigan et al.

A k-means clustering algorithm

Appl. Stat.

(1979)

J. MacQueen

Some methods for classification and analysis of multivariate observations

(1967)

D. Pollard

Quantization and the method of k-means

IEEE Trans. Inf. Theory

(1982)

J.C. Bezdek

Pattern Recognition with Fuzzy Objective Function Algorithms

(1981)

Cited by (4)

Directional statistics-based quality measure for spotlight color images
2020, Signal, Image and Video Processing
A Learning Based EM Clustering for Circular Data with Unknown Number of Clusters
2020, Proceedings of Engineering and Technology Innovation
Possiblistic C-Means Clustering on Directional Data
2019, Proceedings - 2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics, CISP-BMEI 2019
The identification of outliers in wrapped normal data by using ga statistics
2019, International Journal of Innovative Technology and Exploring Engineering

View full text

Learning-based EM clustering for data on the unit hypersphere with application to exoplanet data

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Learning-based EM algorithm for data on the unit hypersphere

Numerical comparisons and experimental results

Application to extrasolar planets

Conclusions

Acknowledgements

Appl. Soft Comput.

Inf. Sci.

Pattern Recogn.

Pattern Recogn.

Pattern Recogn.

Comput. Stat. Data Anal.

Pattern Recogn.

Pattern Recogn.

0ber die Ganzzahligkeit der Atomgewicht und verwandte Fragen

Phys. Z.

On the construction of significance tests on the circle and the sphere

Biometrika

Circular Statistics in Biology

Statistical Analysis of Spherical Data

Directional Statistics

Clustering on the unit hypersphere using von Mises-Fisher distributions

J. Mach. Learn. Res.

Model-based clustering on the unit sphere with an illustration using gene expression profiles

Biostatistics

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

Mach. Learn.

Pattern Classification and Scene Analysis

Finding Groups in Data: An Introduction to Cluster Analysis

Model-based clustering, discriminant analysis, and density estimation

J. Am. Stat. Assoc.

Finite mixture models and model-based clustering

Stat. Surv.

Fitting mixtures of Kent distributions to aid in joint set identifications

J. Am. Stat. Assoc.

Bootstrapping for significance of compact clusters in multidimensional datasets

J. Am. Stat. Assoc.

A k-mean-directions algorithm for fast clustering of data on the sphere

J. Comput. Graph. Stat.

Maximum likelihood from incomplete data via the EM algorithm (with discussion)

J. R. Stat. Soc.—B

A k-means clustering algorithm

Appl. Stat.

Some methods for classification and analysis of multivariate observations

Quantization and the method of k-means

IEEE Trans. Inf. Theory

Pattern Recognition with Fuzzy Objective Function Algorithms