Interpretable interval type-2 fuzzy predicates for data clustering: A new automatic generation method based on self-organizing maps
Introduction
Data clustering refers to grouping data according to a similar criterion [1] revealing hidden structures in data. It has multiple applications in very different fields such as: data mining, marketing, machine learning, bioinformatics, image segmentation, pattern recognition, among others; and new methods are continually proposed [2], [3], [4], [5]. Clustering methods are primarily designed to assign clusters to data, usually not requiring prior information about the expected results except for the number of clusters to be obtained, which is typically required. As a result, the outcome of the most common methods typically consists of a vector containing the corresponding cluster for each datum, including the cluster centroids or prototypes of the type of data corresponding to each of the clusters (called cluster prototypes).
Despite traditional approaches of data clustering are only used to group data; other potential applications can emerge. In fact, clustering can be addressed as a set of data analysis techniques which discover groups of similar data and their results can be exploited extracting information about them. Such information could be related to what the common properties of the data inside a same cluster are and how these properties differ from a cluster to another [3].
In this regard, Fuzzy Logic (FL), conceived as a natural extension of Boolean logic which introduces degrees of truth between 0 (completely false) and 1 (completely true), is able to model linguistic expressions and concepts, including imprecision and vagueness, being excellent for modeling and implementing human reasoning expressed by linguistic expressions, achieving interpretable clustering.
Typical FL models are based on fuzzy inference systems using IF-THEN rules considering approaches such as Mamdani and Takagi–Sugeno–Kang [6], [7], [8], [9], [10], [11], [12] applied in image classification, image segmentation, speech recognition, control, among others. Although widely used, a fuzzy inference system requires defining fuzzification, aggregation and defuzzification operators and its outcome is a continuous variable. Considering data clustering applications, these characteristics of the fuzzy inference systems become difficult to understand the relation between the data and their properties and the system outcome, i.e. the assigned cluster [2], [3]. On the other hand, models based on fuzzy predicates extend the Boolean predicates, modeling degrees of truth of predicates with values between 0 and 1 [13]. When applied on data clustering, fuzzy predicate models allow to implement knowledge about the clustering, explaining which values of each feature are related to each of the clusters and modeling these relationships using membership functions and predicates. Such models have been successfully applied in data clustering [2], [3], [14], [15], having the following features:
- •
Each cluster is explained by a fuzzy predicate interpreted as “The datum belongs to the cluster k”, being k a cluster; explaining which values (which characteristics) of each feature are related to each of the clusters.
- •
In order to assign clusters to data, degrees of truth of the predicates are computed for each datum using the membership functions and the fuzzy operators and in each case the cluster corresponding to the predicate with the maximum degree of truth is assigned to the datum.
- •
Resulting degrees of truth quantify in what grade each datum meets the characteristics required to belong to a cluster (i.e. how a datum is represented by their prototypes).
In the traditional approach, it is required knowledge of experts to define both the membership functions and the predicates. Once designed, the fuzzy predicates apply that knowledge for assigning clusters to data. Nevertheless, in recent works an approach alternative to the traditional one has been studied for data clustering which is based on the automatic generation of the membership functions and the fuzzy predicates by analyzing the data to be clustered. Such approach has an enormous and immediate advantage in relation with the traditional approach: it not only allows the data clustering, but also provides knowledge about the clustering obtained by interpreting the membership functions and the predicates generated. As a consequence, relevant information about data can be obtained, even when no prior information about the problem addressed is available [3].
In this regard, in the previous works [2], [3] we proposed two methods for data clustering through fuzzy predicates in which membership functions and fuzzy predicates are automatically generated from the data to be clustered. In [2], a method based on Self-Organizing Maps (SOMs) (a set of wide known unsupervised and nonparametric neural networks with remarkable abilities for dealing with noise, outliers, and missing values) is proposed called SOM-based Fuzzy Predicate Clustering (SFPC). In the SFPC, a SOM is automatically trained and set and, then, Fuzzy C-Means (FCM) [16] clustering is applied to the codebook of the SOM, extracting cluster prototypes. From these prototypes, membership functions and fuzzy predicates are defined, linguistically explaining the clusters. The method includes a variant where several SOMs are generated from data subsets and predicates obtained from the different SOMs are combined, which could be applied to distributed clustering. Predicates are used to perform the data clustering and some analysis of the interpretation of the membership functions and predicates is given.
In [3], the method called Type-2 Data-based Fuzzy Predicate Clustering (T2-DFPC) is introduced. Unlike the SFPC, interval type-2 FL is used which defines a degree of truth by an interval in [0, 1] called interval of truth values, instead of a number between 0 and 1 as in the case of type-1 FL, which adds additional degrees of freedom considering data clustering. Interval type-2 FL provides more appropriate models than type-1 FL for dealing with vagueness and imprecision about the data characteristics and can reduce the effect on cluster assignments in data affected by noise [3], [13], [17]. In the T2-DFPC, the cluster prototypes are extracted directly from data without using SOMs, combining FCM with the Bayesian Information Criterion (BIC) [18], [19], defining automatically the proper number of clusters in each case. Before the cluster prototype extraction, a random partition is performed on the data, obtaining disjoint subsets. The method is also suitable for distributed clustering. The T2-DFPC includes an analysis of the obtained membership functions and predicates, describing how the knowledge can be extracted. Additionally, it is also proposed a measure of intervals of truth values defining a methodology for interval comparing which allows the cluster assignment when interval type-2 fuzzy predicates are used.
Based on these two previous methods, in the present paper a new clustering method called Type-2 SOM-based Fuzzy Predicate Clustering (T2-SFPC) is proposed, which automatically generates interval type-2 membership functions and fuzzy predicates, allowing data clustering and knowledge discovery. The method proposed uses SOMs in order to obtain cluster prototypes, exploiting their advantages for noise, outliers, and missing values dealing, as in the SFPC; following the methodology used in that method for the automatic configuration and training of the SOMs. However, unlike the previous SFPC, in the method T2-SFPC, M SOMs are automatically configured and trained from M disjoint subsets defined by a random partition on the data, where is a method parameter. Once the M SOMs are defined, the clustering approach combining FCM with the BIC is applied not requiring knowing the number of clusters to be obtained. Once cluster prototypes are extracted, interval type-2 membership functions and fuzzy predicates are generated in a different way to that proposed in the T2-DFPC [3]. Specifically, the new proposal includes parametrizable interval type-2 membership functions, i.e. it is possible the optimization of the parameters of the membership functions provided that a specific goal is defined, for instance adopting a clustering quality measure. As a result of the proposed method, one fuzzy predicate is defined for each cluster. The clustering assignment is performed using the methodology introduced in [3] by means of the measure of interval of truth values.
The contribution of the proposed method is a new general methodology for data clustering, which can be applied to most of the clustering problems. The interval type-2 membership functions merge all knowledge extracted of the cluster prototypes from the M SOMs. The method T2-SFPC preserves all the characteristics of the previous methods, mainly those related to the SOM abilities for discovering natural data groupings and, also, to the knowledge discovery capabilities studied in the T2-DFPC. As in the T2-DFPC, linguistic expressions extracted from the predicates can be adapted to match the terminology of the domain experts, not requiring any prior knowledge about the dataset or the clustering problem addressed. Tests performed considering widely different datasets reveal better results from the proposed T2-SFPC than those obtained both from the SFPC, the T2-DFPC and classical clustering methods, meaning that the proposed method is an excellent clustering method choice when both data clustering and knowledge extraction from the clustering results are needed.
The rest of this paper is structured as follow. In Section 2, it is presented an analysis of the main existing papers concerning to SOMs used for data analysis as well as FL applications. Important concepts related to SOMs and interval type-2 FL are presented in Section 3 and, after that, the method proposed called T2-SFPC is explained in detail. In Section 4, experiments performed to the assessment of the proposed method are described and their results are presented, including an example of the interpretation of the obtained clustering in the case of segmentation of brain magnetic resonance images. Finally, in Sections 5 and 6, discussion and conclusions are presented, commenting on the results and the limitations of the method proposed as well as future work.
Section snippets
Related works
In the present Section, both some existing clustering approaches based on SOMs and some applications of FL concerned to the method proposed are described. Given the wide number of papers related to these issues, the present analysis is intended to cover the most relevant papers on the topic. Nevertheless, further reviews can be consulted in [2], [7], [11], [20]. Descriptions of the methods SFPC and T2-DFPC are omitted, as these were given in the previous Section.
As it has been mentioned, in
Methods
In this Section, concepts related both with SOMs and interval type-2 FL in data clustering are revised. As both topics are well known, only the most important concepts are presented. Then, the method proposed called Type-2 SOM-based Fuzzy Predicate Clustering (T2-SFPC) is explained in detail.
Experiments
In this Section, experiments done in order to assess the method T2-SFPC and the corresponding results are presented and described in detail. At the end, an illustrative example of the interpretation and knowledge extraction from the membership functions and the fuzzy predicates generated with the T2-SFPC is given, considering the segmentation of brain magnetic resonance images.
Discussion
As it was mentioned, a major contribution of this approach is the interpretability of the clusters. Once the clusters have been found, the membership functions can be analyzed to obtain useful linguistic interpretation of the groups. Specific terminology coming from the field of the data can be used, giving the possibility of getting some new knowledge, as it was analyzed in Section 4.2.
Analyzing the numerical results obtained during the method assessment, performance achieved by the method
Conclusion
In this paper, it is proposed a new SOM-based method for the automatic generation of a clustering system based on interval type-2 fuzzy predicates, called Type-2 SOM-based Fuzzy Predicate Clustering (T2-SFPC), which is based on two previous methods based on fuzzy predicates: the SOM-based Fuzzy Predicate Clustering (SFPC) and the Type-2 Data-based Fuzzy Predicate Clustering (T2-DFPC). The method proposed exploits all the advantages of the SOMs for dealing with noise, outliers, and missing
Acknowledgement
Authors acknowledge support from Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) from Argentina.
References (53)
- et al.
Discovering knowledge from data clustering using automatically-defined interval type-2 fuzzy predicates
Expert Syst. Appl.
(2017) - et al.
The brain MR Image segmentation techniques and use of diagnostic packages
Acad. Radiol.
(2010) - et al.
A review on the applications of type-2 fuzzy logic in classification and pattern recognition
Expert Syst. Appl.
(2013) A new approach to clustering
Inf. Control.
(1969)- et al.
Improving MR brain image segmentation using self-organising maps and entropy-gradient clustering
Inf. Sci. (Ny).
(2014) Extending the Kohonen self-organizing map networks for clustering analysis
Comput. Stat. Data Anal.
(2001)The concept of a linguistic variable and its application to approximate reasoning
Inf. Sci.
(1975)- et al.
Uncertainty measures for interval type-2 fuzzy sets
Inf. Sci. (Ny).
(2007) - et al.
A review of fuzzy set aggregation connectives
Inf. Sci. (Ny).
(1985) - et al.
A probabilistic theory of clustering
Pattern Recognit.
(2004)
Model-based evaluation of clustering validation measures
Pattern Recognit.
Relationship between the accuracy of classifier error estimation and complexity of decision boundary
Pattern Recognit.
Medical data mining by fuzzy modeling with selected features
Artif. Intell. Med.
Modeling wine preferences by data mining from physicochemical properties
Decis. Support Syst.
Data clustering: a review
ACM Comput. Surv
Automatic design of interpretable fuzzy predicate systems for clustering using self-organizing maps
Neurocomputing
Robust spatially constrained fuzzy c-means algorithm for brain MR image segmentation
Pattern Recognit.
Enhanced fuzzy system models with improved fuzzy clustering algorithm
IEEE Trans. Fuzzy Syst.
A survey of medical images and signal processing problems solved successfully by the application of Type-2 Fuzzy Logic
J. Phys. Conf. Ser.
Knowledge-leverage-based fuzzy system and its modeling
IEEE Trans. Fuzzy Syst.
A self-organizing ts-type fuzzy network with support vector learning and its application to classification problems
IEEE Trans. Fuzzy Syst.
FRBC: a fuzzy rule-based clustering algorithm
IEEE Trans. Fuzzy Syst.
Type-2 fuzzy hidden Markov models and their application to speech recognition
IEEE Trans. Fuzzy Syst.
Type-2 fuzzy logic in decision support systems
Using SOM as a tool for automated design of clustering systems based on fuzzy predicates
A framework for tissue discrimination in Magnetic Resonance brain images based on predicates analysis and compensatory fuzzy logic
Int. J. Intell. Comput. Med. Sci. Image Process. IC-MED
Cited by (20)
Automorphisms on normal and convex fuzzy truth values revisited
2022, Fuzzy Sets and SystemsCitation Excerpt :In the last decade, many researchers have studied type-2 fuzzy sets theory and its application to several fields of science (see, for example, [1], [2], [3], [4] [6], [10], [24], [23]) as shown in [13].
A density-based evolutionary clustering algorithm for intelligent development
2021, Engineering Applications of Artificial IntelligenceCitation Excerpt :If the learning effect meets the requirements within the specified number of iterations, the learning ends; otherwise, the clustering model’s parameters are reset. These methods include partition-based clustering methods (Frey and Dueck, 2007; Likas et al., 2003; Xu and Lange, 2019; Dutta et al., 2017; Bai and Liang, 2020; Li et al., 2018; Sinha, 2018; Xu et al., 2016; Priyanka et al., 2019), including the Affinity Propagation algorithm (Frey and Dueck, 2007), K-means (Likas et al., 2003) and their optimization algorithms (Xu and Lange, 2019; Huang et al., 2005; Laszlo and Mukherjee, 2006; Dutta et al., 2017; Bai and Liang, 2020; Li et al., 2018; Sinha, 2018; Xu et al., 2016; Wang et al., 2019); model-based clustering methods (Awasthi and Vijayaraghavan, 2018; Chen et al., 2016; Baudry et al., 2010), such as Gaussian mixture models; and neural-network based methods (Comas et al., 2017; Kang et al., 2019; Caron et al., 2018; Chang et al., 2020; Eisenach et al., 2020), such as the SOM algorithm (Comas et al., 2017). ( 2) Before clustering samples, the algorithm needs to preprocess the data to find the parameters of the model.
The Stratic Defuzzifier for discretised general type-2 fuzzy sets
2021, Information SciencesCitation Excerpt :They exist in two forms, the interval, whose secondary membership grades are uniformly 1, and the general, with secondary membership grades in [0, 1]. Interval type-2 Fuzzy Inferencing Systems (FISs) are computationally simpler than their general counterparts [35]; for them varied applications have been developed [11,5,38,1,7,8,39,42,3,9,46]. As yet, owing to its colossal computational complexity, relatively few general type-2 fuzzy logic applications have been developed [30,12,29,6,13,40,4,41,10], though this number is growing.
A characterization for some type-2 fuzzy strong negations
2020, Knowledge-Based SystemsNovel green supplier selection method by combining quality function deployment with partitioned Bonferroni mean operator in interval type-2 fuzzy environment
2019, Information SciencesCitation Excerpt :The key difference between the two is that, while the memberships of T1FSs are crisp values, the memberships of T2FSs are T1FSs; so T2FSs can more easily express vagueness and imprecision than T1FSs, and T2FSs are attracting increasing attention from researchers. Thus far, interval T2FSs (IT2FSs) [38] have been the most actively implemented T2FSs, and have also been successfully applied to control [31], identification, and prediction [18], order allocation [20], data clustering [13], transportation mode selection [30], and global supplier selection [22], among others. In particular, several researchers have proposed extended MCDM methods using IT2FSs, and applied these to the GSS field.
Enhancements of rule-based models through refinements of Fuzzy C-Means
2019, Knowledge-Based SystemsCitation Excerpt :In fuzzy rule-based models and fuzzy modeling [1–3], we have been witnessing a number of refinements of architectures and algorithmic nature [4–12]. As an underlying design process includes a phase of the formation of fuzzy sets in a multivariable input space (and eventually a single output space), some focused efforts were placed on the use of fuzzy clustering; one can refer here to intensive studies [13–15]. Fuzzy clustering plays a pivotal role in fuzzy rule-based models.