Elsevier

Applied Soft Computing

Volume 60, November 2017, Pages 101-114
Applied Soft Computing

Learning-based EM clustering for data on the unit hypersphere with application to exoplanet data

https://doi.org/10.1016/j.asoc.2017.06.037Get rights and content

Highlights

  • We construct a learning-based EM clustering algorithm for hyper-spherical data.

  • The proposed method is robust to outliers without initializations, and not necessary to assign a number of clusters.

  • Numerical and real examples with comparisons are given to demonstrate the effectiveness and superiority of the proposed method.

  • The proposed method is applied to cluster exoplanet data in extrasolar planets.

Abstract

This study focuses on clustering algorithms for data on the unit hypersphere. This type of directional data lain on the surface of a unit hypersphere is used in geology, biology, meteorology, medicine and oceanography. The EM algorithm with mixtures of von Mises-Fisher distributions is often used for model-based clustering for data on the unit hypersphere. However, the EM algorithm is sensitive to initial values and outliers and a number of clusters must be assigned a priori. In this paper, we propose an effective approach, called a learning-based EM algorithm with von Mises-Fisher distributions, to cluster this type of hyper-spherical data. The proposed clustering method is robust to outliers, without the need for initialization, and automatically determines the number of clusters. Thus, it becomes a fully-unsupervised model-based clustering method for data on the unit hypersphere. Some numerical and real examples with comparisons are given to demonstrate the effectiveness and superiority of the proposed method. We also apply the proposed learning-based EM algorithm to cluster exoplanet data in extrasolar planets. The clustering results have several important implications for exoplanet data and allow an interpretation of exoplanet migration.

Introduction

In 1918, von Mises [1] first proposed a distribution on circular data and then Watson and Williams [2] studied inference problems for the von Mises distribution by constructing statistical methods for circular data. Studies involving (2-dimensional) circular data had been extended to (3-dimensional) spherical data and (high-dimensional) data on the unit hypersphere where they were also widely applied in geology, biology, meteorology, medicine and oceanography [3], [4], [5], [6], [7], [8], [9].

Clustering involves finding clusters within a data set that are characterized by the greatest similarity within the same cluster and the greatest dissimilarity between different clusters. It is a useful tool for data analysis that is a branch of statistical multivariate analysis and unsupervised learning for pattern recognition [10], [11]. From the statistical point of view, clustering methods can be divided into two categories: probability model-based approaches [12], [13], [14] and nonparametric approaches [15], [16]. For probability model-based approaches, which assume that a data set follows a mixture of probability distributions, and EM (expectation & maximization) algorithms [17], which allow the estimation of distribution parameters can be used for clustering in a data set. Nonparametric methods require an objective function of similarity or dissimilarity to partition clusters. In nonparametric approaches, an objective function of similarity or dissimilarity is generally required in which partition clustering methods are the most popular. Frequently used partition methods are k-means [18], [19], [20], fuzzy c-means (FCM) [21], [22], [23], mean shift [24], [25], [26] and possibilistic c-means [27], [28].

EM is a commonly used algorithm for clustering grouped data using a mixture model. It can be also applied to directional data. Banerjee et al. [6] proposed an EM algorithm using a mixture of von Mises-Fisher distributions, called it soft-moVMF. However, soft-moVMF [6] is sensitive to initial values. In the study by Banerjee et al. [6], no appropriate initialization technique was proposed and inappropriate initializations usually lead to bad clustering results for soft-moVMF. Soft-moVMF [6] does not work when there is an unknown number of clusters. It needs to assign a number, c, of clusters a priori. In general, the cluster number, c, is unknown. In this case, some validity functions, such as Akaike information criterion (AIC) [29], Bayesian information criterion (BIC) [30], [31], the gap statistic [32], [33] and fuzzy cluster validity [21], [34], can be used to find the number of clusters. However, they are supposed to be extra indices that are not embedded into the iterations of algorithms. There are many methods for determining the number of clusters, such as Peck et al. [35], Li et al. [36], Josse and Husson [37]. Although these algorithms can find a number of clusters, they are dependent on initialization and parameter selection. Recently, Yang et al. [38] proposed a robust EM algorithm for Gaussian mixture models that is robust to initialization and parameter selection and which automatically produces a suitable number of clusters. Since Yang et al. [38] proposed EM using Gaussian mixture models for data in Euclidean space, this study extends this method for clustering data on the unit hypersphere using mixtures of von Mises-Fisher distributions. This is termed a learning-based EM algorithm for data on the unit hypersphere. The proposed algorithm for this type of hyper-spherical data is free of initialization and automatically determines the number of clusters.

The remainder of this paper is organized as follows. Section 2 presents the proposed learning-based EM algorithm for data on the unit hypersphere. In Section 3, some experimental examples are used to compare the proposed learning-based EM algorithm with soft-moVMF and soft-moVMF + BIC. The results demonstrate the superiority and utility of the proposed method. Section 4 gives a real application of the learning-based EM algorithm to exoplanet data. Finally, conclusions are stated in Section 5.

Section snippets

Learning-based EM algorithm for data on the unit hypersphere

Suppose that the data set X={xj|||xj||=1,j=1,2,...,n} is a random sample from a mixture probability model on the d-variate unit hypersphere. The density of the mixture model is f(x;θ)=i=1cαifi(x;θi), where fi and αi are the probability density function and the mixing proportion for the ith subpopulation, respectively. If z1, z2, .., zc are the indicator functions, such that zij = zi(xj) = 1, if xj arises from the subpopulation i, and zij = 0 if xjarises from other subpopulations, for i = 1, 2, .., c

Numerical comparisons and experimental results

In this section, the proposed learning-based EM algorithm is compared with existing model-based clustering. Banerjee et al. [6] established the EM algorithm using a mixture of von Mises-Fisher distributions, which is called soft-moVMF. However, for the soft-moVMF algorithm, the cluster number must be given a priori and initial values can also influence clustering results. Since Bayesian information criterion (BIC, see Schwarz [30]) can be used to determine the number of clusters, soft-moVMF

Application to extrasolar planets

The detection of extrasolar planets (exoplanets) during the last years is one of the greatest discoveries in the history of Astronomy. For researchers in celestial mechanics, these indirectly detected objects show unexpected dynamical features and present a new field of study. A simple and comprehensive taxonomy is necessary for a large number of exoplanets. There has been some labeling of exoplanets. Close-in planets, for example, have been identified and labeled, as have giant planets, Jovian

Conclusions

In this paper we propose a learning-based EM algorithm for data on the unit hypersphere. The proposed algorithm is used to estimate the parameters for von Mises-Fisher distributions, without initialization, and automatically obtains a suitable number of clusters. It is a fully unsupervised model-based clustering method for data on the unit hypersphere. In comparisons with existing model-based clustering, the results from numerical and real data sets demonstrate the effectiveness and superiority

Acknowledgements

The authors would like to thank the anonymous referees for their helpful comments in improving the presentation of this paper. This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 105-2118-M-033-004-MY2.

References (56)

  • E. Batschelet

    Circular Statistics in Biology

    (1981)
  • N.I. Fisher et al.

    Statistical Analysis of Spherical Data

    (1987)
  • K.V. Mardia et al.

    Directional Statistics

    (2000)
  • I.S. Banerjee et al.

    Clustering on the unit hypersphere using von Mises-Fisher distributions

    J. Mach. Learn. Res.

    (2005)
  • J.-L. Dortet-Bernadet et al.

    Model-based clustering on the unit sphere with an illustration using gene expression profiles

    Biostatistics

    (2008)
  • P. Kasarapu et al.

    Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

    Mach. Learn.

    (2015)
  • R.O. Duda et al.

    Pattern Classification and Scene Analysis

    (1973)
  • L. Kaufman et al.

    Finding Groups in Data: An Introduction to Cluster Analysis

    (1990)
  • C. Fraley et al.

    Model-based clustering, discriminant analysis, and density estimation

    J. Am. Stat. Assoc.

    (2002)
  • V. Melnykov et al.

    Finite mixture models and model-based clustering

    Stat. Surv.

    (2010)
  • D. Peel et al.

    Fitting mixtures of Kent distributions to aid in joint set identifications

    J. Am. Stat. Assoc.

    (2001)
  • R. Maitra et al.

    Bootstrapping for significance of compact clusters in multidimensional datasets

    J. Am. Stat. Assoc.

    (2012)
  • R. Maitra et al.

    A k-mean-directions algorithm for fast clustering of data on the sphere

    J. Comput. Graph. Stat.

    (2010)
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm (with discussion)

    J. R. Stat. Soc.—B

    (1977)
  • J.A. Hartigan et al.

    A k-means clustering algorithm

    Appl. Stat.

    (1979)
  • J. MacQueen

    Some methods for classification and analysis of multivariate observations

    (1967)
  • D. Pollard

    Quantization and the method of k-means

    IEEE Trans. Inf. Theory

    (1982)
  • J.C. Bezdek

    Pattern Recognition with Fuzzy Objective Function Algorithms

    (1981)
  • Cited by (4)

    View full text