Unsupervised fuzzy model-based Gaussian clustering
Introduction
Clustering is an important tool in data science. It is a method for finding cluster structure in a data set that is characterized by the greatest similarity within the same cluster and the greatest dissimilarity between different clusters [1], [18], [40]. Cluster analysis has generally become a branch of statistical multivariate analysis, and also been widely applied in many areas, including numerical taxonomy, image processing, pattern recognition, medicine, gene expression data, economics, ecology, marketing, and artificial intelligence [1], [16], [23]. From the statistical point of view, clustering methods are generally divided into two categories. One is a (probability) model-based approach. The other is a nonparametric approach. The model-based approach follows that the data points are from a mixture model of probability distributions so that a mixture likelihood approach to clustering is used [23], [37]. For the nonparametric approach, clustering methods involve mainly an objective function of similarity or dissimilarity measures [13], [16]. In nonparametric approaches, prototype-based clustering algorithms are most used, such as k-means, fuzzy c-means and possibilistic c-means [1], [7], [18], [20]. Another nonparametric approach is spectral clustering [26], [36] that is equivalent to nonnegative matrix factorization under certain conditions [34], where applications using this clustering method had been considered, such as making new data representation [32] and feature selection [24], [25], [33].
In the literature, Banfield and Raftery [6] first proposed a model-based Gaussian (MB-Gauss) clustering. Eigenvalue decomposition is used for the covariance matrix so that they can specify which feature (orientation, size and shape) to be common to all clusters and which feature to be different between clusters. These MB-Gauss clustering methods had been widely studied and applied in various areas [5], [6], [48]. For example, Wehrens et al. [37] applied it to image segmentation; Yeung et al. [44] and Young et al. [45] used it for gene expression data; Akilan et al. [2] utilized it for background subtraction. Since Zadeh [46] proposed fuzzy sets to introduce the idea of partial memberships described by membership functions, it has been successfully applied to clustering where Ruspini [29] first proposed fuzzy c-partitions as a fuzzy approach to clustering. As known, there is still no attempt in the literature to make a connection between fuzzy sets and MB-Gauss clustering. This study first uses fuzzy c-partitions to extend MB-Gauss to fuzzy MB-Gauss (F-MB-Gauss) clustering. However, both MB-Gauss and F-MB-Gauss clustering algorithms are still sensitive to initial values and also require a number of clusters to be assigned a priori. In general, Bayesian information criterion (BIC) is a common way for the MB-Gauss clustering algorithm to determine the number of clusters [6], [9], [45], [48]. Up to now, no prior literature has shown MB-Gauss to be simultaneously robust to initializations and obtaining a number of clusters. It may be due to the difficulty for constructing this unsupervised type of clustering. In the paper, we then construct an unsupervised learning schema for F-MB-Gauss clustering such that it becomes free of initialization and can simultaneously obtain an optimal number of clusters. The proposed unsupervised learning schema should be important for most model-based clustering methods. Therefore, a novel unsupervised F-MB-Gauss (UF-MB-Gauss) clustering algorithm is proposed in the paper.
In fact, the literature contains other clustering methods in nonparametric approaches that can automatically determine the number of clusters. These are FCM with focal point (FCMFP) algorithm [14]; robust-learning FCM (RL-FCM) algorithm [42]; clustering by fast search (C-FS) [27]; automatic merging possibilistic clustering method (AMPCM) [40]; and adaptive possibilistic clustering algorithm (APCM) [38]. These methods will be reviewed in the next section as other related works. Comparisons of the proposed UF-MB-Gauss clustering algorithm with these existing clustering methods will be made. The proposed algorithm is then applied to real data sets and image segmentation to demonstrate its usefulness and effectiveness. Totally, the contributions of the paper can be summarized as follows:
- •
The MB-Gauss is extended to fuzzy model-based Gaussian (F-MB-Gauss) clustering.
- •
An unsupervised learning schema is constructed for F-MB-Gauss clustering such that the proposed unsupervised F-MB-Gauss (UF-MB-Gauss) algorithm becomes free of initialization and parameter selections with simultaneously obtaining an optimal number of clusters.
- •
Comparisons with real applications demonstrate the superiority and usefulness of the proposed UF-MB-Gauss clustering algorithm.
The remainder of the paper is organized as follows. Section 2 presents the literature review on MB-Gauss clustering and other works, such as FCMFP [14], RL-FCM [42], C-FS [27], AMPCM [40], and APCM [38]. In Section 3, MB-Gauss clustering is first extended to fuzzy MB-Gauss clustering and an unsupervised learning schema is proposed for F-MB-Gauss clustering. An unsupervised F-MB-Gauss clustering is then constructed. Section 4 presents numerical and real comparisons of the proposed algorithm with existing clustering methods. The proposed algorithm is also applied to image segmentation. Finally, conclusions are stated in Section 5.
Section snippets
Model-based Gaussian clustering
Banfield and Raftery [6] proposed a model-based Gaussian (MB-Gauss) clustering, which is a mixture likelihood approach to clustering for Gaussian distributions that follows classification maximum likelihood procedures. Scott and Symons [31] first developed classification maximum likelihood procedures, and Symons [35] then derived some clustering algorithms from these procedures. Banfield and Raftery [6] proposed MB-Gauss clustering by extending these classification maximum likelihood procedures
Unsupervised fuzzy model-based Gaussian clustering
Recall that the model-based Gaussian (MB-Gauss) clustering for a data set into a hard c-partition with the indicator functions has the following objective function:
The fuzzy set, proposed by Zadeh [46] in 1965, is an extension to allow zi(x) to be a membership function assuming values in the interval [0,1]. Ruspini [29] introduced a fuzzy c-partition by the extension to allow the indicator
Comparisons and experimental results
In this section, several numerical and real data, and also some benchmark data sets with large instances, attributes, and classes are used to illustrate the performance of the proposed UF-MB-Gauss clustering algorithm. Comparison between the proposed UF-MB-Gauss and MB-Gauss-II is also made. Bayesian information criterion (BIC) [17], [30] is a common method for determining the number of clusters for mixture models; hence, it is compared with the proposed UF-MB-Gauss for determining the optimal
Conclusions and discussion
In view of the dependence of model-based Gaussian (MB-Gauss) clustering and its extensions on initializations and the requirement for a number of clusters to be assigned a priori, this study proposed a fuzzy model-based Gaussian (F-MB-Gauss) clustering, and then constructed an unsupervised learning for F-MB-Gauss clustering. Mixing proportions of clusters were first considered for F-MB-Gauss, and a penalty term was then added with the average information for the occurrence of each data point
Acknowledgments
The authors would like to thank the anonymous referees for their helpful comments in improving the presentation of this paper. This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 107-2118-M-033-002-MY2.
References (48)
- et al.
Fusion-based foreground enhancement for background subtraction using multivariate multi-model Gaussian distribution
Inf. Sci.
(2018) - et al.
Model-based cluster analysis
Pattern Recognit.
(1993) - et al.
Model-based clustering of high-dimensional data: a review
Comput. Stat. Data Anal.
(2014) - et al.
Linear dimensionality reduction for classification via a sequential Bayes error minimisation with an application to flow meter diagnostics
Expert Syst. Appl.
(2018) - et al.
Dual-graph regularized non-negative matrix factorization with sparse and orthogonal constraints
Eng. Appl. Artif. Intell.
(2018) - et al.
Feature selection based dual-graph sparse non-negative matrix factorization for local discriminative clustering
Neurocomputing
(2018) A new approach to clustering
Inf. Control
(1969)- et al.
Self-representation based dual-graph regularized feature selection clustering
Neurocomputing
(2016) - et al.
Global discriminative-based nonnegative spectral clustering
Pattern Recognit.
(2016) - et al.
Robust-learning fuzzy c-means clustering algorithm with unknown number of clusters
Pattern Recognit.
(2017)