Elsevier

Knowledge-Based Systems

Volume 133, 1 October 2017, Pages 234-254
Knowledge-Based Systems

Interpretable interval type-2 fuzzy predicates for data clustering: A new automatic generation method based on self-organizing maps

https://doi.org/10.1016/j.knosys.2017.07.012Get rights and content

Highlights

  • A new clustering based on interval type-2 fuzzy predicates and SOMs is proposed.

  • SOMs are automatically configured and trained.

  • Fuzzy predicates are generated using cluster prototypes extracted from SOMs.

  • Linguistic knowledge is obtained from the predicates automatically generated.

  • The proposed method overcome existing clustering methods based on fuzzy predicates.

Abstract

In previous works, we proposed two methods for data clustering based on automatically discovered fuzzy predicates which were referred to as SOM-based Fuzzy Predicate Clustering (SFPC) [Meschino et al., Neurocomputing, 147, 47–59 (2015)] and Type-2 Data-based Fuzzy Predicate Clustering (T2-DFPC) [Comas et al., Expert Syst. Appl., 68, 136–150 (2017)]. In such methods, fuzzy predicates allow both data clustering and knowledge discovering about the obtained clusters. This last feature constitutes novelty comparing to other existing approaches and it is a major contribution in the data clustering field. Based on these previous methods, in the present paper a new automatic clustering method based on fuzzy predicates is proposed which uses Self-Organizing Maps (SOMs) and is called Type-2 SOM-based Fuzzy Predicate Clustering (T2-SFPC). The new method does not require any prior knowledge about the clustering addressed. First, a random partition is defined on the dataset to be clustered and SOMs are configured and trained using the resulting data subsets. Second, an automatic clustering approach is applied on the SOM codebooks, discovering representative data of the different clusters, which are called cluster prototypes. Third, interval type-2 membership function formed by Gaussian-shape sub-functions and fuzzy predicates are defined, allowing data clustering and its interpretation. The proposed method preserves all the advantages of the previous methods SFPC and T2-DFPC in relation to the knowledge extraction capabilities and their potential application on distributed clustering and parallel computing, but results obtained on several public datasets tested showed more compactness and separation of the clusters defined by the T2-SFPC, outperforming both the previous methods and the several classical clustering approaches tested, considering internal and external validation indices. Additionally, both clustering interpretation and optimization capabilities are improved by the proposed method when compared to the methods SFPC and T2-DFPC.

Introduction

Data clustering refers to grouping data according to a similar criterion [1] revealing hidden structures in data. It has multiple applications in very different fields such as: data mining, marketing, machine learning, bioinformatics, image segmentation, pattern recognition, among others; and new methods are continually proposed [2], [3], [4], [5]. Clustering methods are primarily designed to assign clusters to data, usually not requiring prior information about the expected results except for the number of clusters to be obtained, which is typically required. As a result, the outcome of the most common methods typically consists of a vector containing the corresponding cluster for each datum, including the cluster centroids or prototypes of the type of data corresponding to each of the clusters (called cluster prototypes).

Despite traditional approaches of data clustering are only used to group data; other potential applications can emerge. In fact, clustering can be addressed as a set of data analysis techniques which discover groups of similar data and their results can be exploited extracting information about them. Such information could be related to what the common properties of the data inside a same cluster are and how these properties differ from a cluster to another [3].

In this regard, Fuzzy Logic (FL), conceived as a natural extension of Boolean logic which introduces degrees of truth between 0 (completely false) and 1 (completely true), is able to model linguistic expressions and concepts, including imprecision and vagueness, being excellent for modeling and implementing human reasoning expressed by linguistic expressions, achieving interpretable clustering.

Typical FL models are based on fuzzy inference systems using IF-THEN rules considering approaches such as Mamdani and Takagi–Sugeno–Kang [6], [7], [8], [9], [10], [11], [12] applied in image classification, image segmentation, speech recognition, control, among others. Although widely used, a fuzzy inference system requires defining fuzzification, aggregation and defuzzification operators and its outcome is a continuous variable. Considering data clustering applications, these characteristics of the fuzzy inference systems become difficult to understand the relation between the data and their properties and the system outcome, i.e. the assigned cluster [2], [3]. On the other hand, models based on fuzzy predicates extend the Boolean predicates, modeling degrees of truth of predicates with values between 0 and 1 [13]. When applied on data clustering, fuzzy predicate models allow to implement knowledge about the clustering, explaining which values of each feature are related to each of the clusters and modeling these relationships using membership functions and predicates. Such models have been successfully applied in data clustering [2], [3], [14], [15], having the following features:

  • Each cluster is explained by a fuzzy predicate interpreted as “The datum belongs to the cluster k”, being k a cluster; explaining which values (which characteristics) of each feature are related to each of the clusters.

  • In order to assign clusters to data, degrees of truth of the predicates are computed for each datum using the membership functions and the fuzzy operators and in each case the cluster corresponding to the predicate with the maximum degree of truth is assigned to the datum.

  • Resulting degrees of truth quantify in what grade each datum meets the characteristics required to belong to a cluster (i.e. how a datum is represented by their prototypes).

In the traditional approach, it is required knowledge of experts to define both the membership functions and the predicates. Once designed, the fuzzy predicates apply that knowledge for assigning clusters to data. Nevertheless, in recent works an approach alternative to the traditional one has been studied for data clustering which is based on the automatic generation of the membership functions and the fuzzy predicates by analyzing the data to be clustered. Such approach has an enormous and immediate advantage in relation with the traditional approach: it not only allows the data clustering, but also provides knowledge about the clustering obtained by interpreting the membership functions and the predicates generated. As a consequence, relevant information about data can be obtained, even when no prior information about the problem addressed is available [3].

In this regard, in the previous works [2], [3] we proposed two methods for data clustering through fuzzy predicates in which membership functions and fuzzy predicates are automatically generated from the data to be clustered. In [2], a method based on Self-Organizing Maps (SOMs) (a set of wide known unsupervised and nonparametric neural networks with remarkable abilities for dealing with noise, outliers, and missing values) is proposed called SOM-based Fuzzy Predicate Clustering (SFPC). In the SFPC, a SOM is automatically trained and set and, then, Fuzzy C-Means (FCM) [16] clustering is applied to the codebook of the SOM, extracting cluster prototypes. From these prototypes, membership functions and fuzzy predicates are defined, linguistically explaining the clusters. The method includes a variant where several SOMs are generated from data subsets and predicates obtained from the different SOMs are combined, which could be applied to distributed clustering. Predicates are used to perform the data clustering and some analysis of the interpretation of the membership functions and predicates is given.

In [3], the method called Type-2 Data-based Fuzzy Predicate Clustering (T2-DFPC) is introduced. Unlike the SFPC, interval type-2 FL is used which defines a degree of truth by an interval in [0, 1] called interval of truth values, instead of a number between 0 and 1 as in the case of type-1 FL, which adds additional degrees of freedom considering data clustering. Interval type-2 FL provides more appropriate models than type-1 FL for dealing with vagueness and imprecision about the data characteristics and can reduce the effect on cluster assignments in data affected by noise [3], [13], [17]. In the T2-DFPC, the cluster prototypes are extracted directly from data without using SOMs, combining FCM with the Bayesian Information Criterion (BIC) [18], [19], defining automatically the proper number of clusters in each case. Before the cluster prototype extraction, a random partition is performed on the data, obtaining disjoint subsets. The method is also suitable for distributed clustering. The T2-DFPC includes an analysis of the obtained membership functions and predicates, describing how the knowledge can be extracted. Additionally, it is also proposed a measure of intervals of truth values defining a methodology for interval comparing which allows the cluster assignment when interval type-2 fuzzy predicates are used.

Based on these two previous methods, in the present paper a new clustering method called Type-2 SOM-based Fuzzy Predicate Clustering (T2-SFPC) is proposed, which automatically generates interval type-2 membership functions and fuzzy predicates, allowing data clustering and knowledge discovery. The method proposed uses SOMs in order to obtain cluster prototypes, exploiting their advantages for noise, outliers, and missing values dealing, as in the SFPC; following the methodology used in that method for the automatic configuration and training of the SOMs. However, unlike the previous SFPC, in the method T2-SFPC, M SOMs are automatically configured and trained from M disjoint subsets defined by a random partition on the data, where MN is a method parameter. Once the M SOMs are defined, the clustering approach combining FCM with the BIC is applied not requiring knowing the number of clusters to be obtained. Once cluster prototypes are extracted, interval type-2 membership functions and fuzzy predicates are generated in a different way to that proposed in the T2-DFPC [3]. Specifically, the new proposal includes parametrizable interval type-2 membership functions, i.e. it is possible the optimization of the parameters of the membership functions provided that a specific goal is defined, for instance adopting a clustering quality measure. As a result of the proposed method, one fuzzy predicate is defined for each cluster. The clustering assignment is performed using the methodology introduced in [3] by means of the measure of interval of truth values.

The contribution of the proposed method is a new general methodology for data clustering, which can be applied to most of the clustering problems. The interval type-2 membership functions merge all knowledge extracted of the cluster prototypes from the M SOMs. The method T2-SFPC preserves all the characteristics of the previous methods, mainly those related to the SOM abilities for discovering natural data groupings and, also, to the knowledge discovery capabilities studied in the T2-DFPC. As in the T2-DFPC, linguistic expressions extracted from the predicates can be adapted to match the terminology of the domain experts, not requiring any prior knowledge about the dataset or the clustering problem addressed. Tests performed considering widely different datasets reveal better results from the proposed T2-SFPC than those obtained both from the SFPC, the T2-DFPC and classical clustering methods, meaning that the proposed method is an excellent clustering method choice when both data clustering and knowledge extraction from the clustering results are needed.

The rest of this paper is structured as follow. In Section 2, it is presented an analysis of the main existing papers concerning to SOMs used for data analysis as well as FL applications. Important concepts related to SOMs and interval type-2 FL are presented in Section 3 and, after that, the method proposed called T2-SFPC is explained in detail. In Section 4, experiments performed to the assessment of the proposed method are described and their results are presented, including an example of the interpretation of the obtained clustering in the case of segmentation of brain magnetic resonance images. Finally, in Sections 5 and 6, discussion and conclusions are presented, commenting on the results and the limitations of the method proposed as well as future work.

Section snippets

Related works

In the present Section, both some existing clustering approaches based on SOMs and some applications of FL concerned to the method proposed are described. Given the wide number of papers related to these issues, the present analysis is intended to cover the most relevant papers on the topic. Nevertheless, further reviews can be consulted in [2], [7], [11], [20]. Descriptions of the methods SFPC and T2-DFPC are omitted, as these were given in the previous Section.

As it has been mentioned, in

Methods

In this Section, concepts related both with SOMs and interval type-2 FL in data clustering are revised. As both topics are well known, only the most important concepts are presented. Then, the method proposed called Type-2 SOM-based Fuzzy Predicate Clustering (T2-SFPC) is explained in detail.

Experiments

In this Section, experiments done in order to assess the method T2-SFPC and the corresponding results are presented and described in detail. At the end, an illustrative example of the interpretation and knowledge extraction from the membership functions and the fuzzy predicates generated with the T2-SFPC is given, considering the segmentation of brain magnetic resonance images.

Discussion

As it was mentioned, a major contribution of this approach is the interpretability of the clusters. Once the clusters have been found, the membership functions can be analyzed to obtain useful linguistic interpretation of the groups. Specific terminology coming from the field of the data can be used, giving the possibility of getting some new knowledge, as it was analyzed in Section 4.2.

Analyzing the numerical results obtained during the method assessment, performance achieved by the method

Conclusion

In this paper, it is proposed a new SOM-based method for the automatic generation of a clustering system based on interval type-2 fuzzy predicates, called Type-2 SOM-based Fuzzy Predicate Clustering (T2-SFPC), which is based on two previous methods based on fuzzy predicates: the SOM-based Fuzzy Predicate Clustering (SFPC) and the Type-2 Data-based Fuzzy Predicate Clustering (T2-DFPC). The method proposed exploits all the advantages of the SOMs for dealing with noise, outliers, and missing

Acknowledgement

Authors acknowledge support from Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) from Argentina.

References (53)

  • M. Brun et al.

    Model-based evaluation of clustering validation measures

    Pattern Recognit.

    (2007)
  • E. Atashpaz-Gargari et al.

    Relationship between the accuracy of classifier error estimation and complexity of decision boundary

    Pattern Recognit.

    (2013)
  • S.N. Ghazavi et al.

    Medical data mining by fuzzy modeling with selected features

    Artif. Intell. Med.

    (2008)
  • P. Cortez et al.

    Modeling wine preferences by data mining from physicochemical properties

    Decis. Support Syst.

    (2009)
  • A.K. Jain et al.

    Data clustering: a review

    ACM Comput. Surv

    (1999)
  • G.J. Meschino et al.

    Automatic design of interpretable fuzzy predicate systems for clustering using self-organizing maps

    Neurocomputing

    (2015)
  • J. Zexuan et al.

    Robust spatially constrained fuzzy c-means algorithm for brain MR image segmentation

    Pattern Recognit.

    (2014)
  • A. Celikyilmaz et al.

    Enhanced fuzzy system models with improved fuzzy clustering algorithm

    IEEE Trans. Fuzzy Syst.

    (2008)
  • D.S. Comas et al.

    A survey of medical images and signal processing problems solved successfully by the application of Type-2 Fuzzy Logic

    J. Phys. Conf. Ser.

    (2011)
  • Z. Deng et al.

    Knowledge-leverage-based fuzzy system and its modeling

    IEEE Trans. Fuzzy Syst.

    (2013)
  • C.-F. Juang et al.

    A self-organizing ts-type fuzzy network with support vector learning and its application to classification problems

    IEEE Trans. Fuzzy Syst.

    (2007)
  • E.G. Mansoori

    FRBC: a fuzzy rule-based clustering algorithm

    IEEE Trans. Fuzzy Syst.

    (2011)
  • J. Zeng et al.

    Type-2 fuzzy hidden Markov models and their application to speech recognition

    IEEE Trans. Fuzzy Syst.

    (2006)
  • D.S. Comas et al.

    Type-2 fuzzy logic in decision support systems

  • G.J. Meschino et al.

    Using SOM as a tool for automated design of clustering systems based on fuzzy predicates

  • G.J. Meschino et al.

    A framework for tissue discrimination in Magnetic Resonance brain images based on predicates analysis and compensatory fuzzy logic

    Int. J. Intell. Comput. Med. Sci. Image Process. IC-MED

    (2008)
  • Cited by (20)

    • Automorphisms on normal and convex fuzzy truth values revisited

      2022, Fuzzy Sets and Systems
      Citation Excerpt :

      In the last decade, many researchers have studied type-2 fuzzy sets theory and its application to several fields of science (see, for example, [1], [2], [3], [4] [6], [10], [24], [23]) as shown in [13].

    • A density-based evolutionary clustering algorithm for intelligent development

      2021, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      If the learning effect meets the requirements within the specified number of iterations, the learning ends; otherwise, the clustering model’s parameters are reset. These methods include partition-based clustering methods (Frey and Dueck, 2007; Likas et al., 2003; Xu and Lange, 2019; Dutta et al., 2017; Bai and Liang, 2020; Li et al., 2018; Sinha, 2018; Xu et al., 2016; Priyanka et al., 2019), including the Affinity Propagation algorithm (Frey and Dueck, 2007), K-means (Likas et al., 2003) and their optimization algorithms (Xu and Lange, 2019; Huang et al., 2005; Laszlo and Mukherjee, 2006; Dutta et al., 2017; Bai and Liang, 2020; Li et al., 2018; Sinha, 2018; Xu et al., 2016; Wang et al., 2019); model-based clustering methods (Awasthi and Vijayaraghavan, 2018; Chen et al., 2016; Baudry et al., 2010), such as Gaussian mixture models; and neural-network based methods (Comas et al., 2017; Kang et al., 2019; Caron et al., 2018; Chang et al., 2020; Eisenach et al., 2020), such as the SOM algorithm (Comas et al., 2017). ( 2) Before clustering samples, the algorithm needs to preprocess the data to find the parameters of the model.

    • The Stratic Defuzzifier for discretised general type-2 fuzzy sets

      2021, Information Sciences
      Citation Excerpt :

      They exist in two forms, the interval, whose secondary membership grades are uniformly 1, and the general, with secondary membership grades in [0, 1]. Interval type-2 Fuzzy Inferencing Systems (FISs) are computationally simpler than their general counterparts [35]; for them varied applications have been developed [11,5,38,1,7,8,39,42,3,9,46]. As yet, owing to its colossal computational complexity, relatively few general type-2 fuzzy logic applications have been developed [30,12,29,6,13,40,4,41,10], though this number is growing.

    • Novel green supplier selection method by combining quality function deployment with partitioned Bonferroni mean operator in interval type-2 fuzzy environment

      2019, Information Sciences
      Citation Excerpt :

      The key difference between the two is that, while the memberships of T1FSs are crisp values, the memberships of T2FSs are T1FSs; so T2FSs can more easily express vagueness and imprecision than T1FSs, and T2FSs are attracting increasing attention from researchers. Thus far, interval T2FSs (IT2FSs) [38] have been the most actively implemented T2FSs, and have also been successfully applied to control [31], identification, and prediction [18], order allocation [20], data clustering [13], transportation mode selection [30], and global supplier selection [22], among others. In particular, several researchers have proposed extended MCDM methods using IT2FSs, and applied these to the GSS field.

    • Enhancements of rule-based models through refinements of Fuzzy C-Means

      2019, Knowledge-Based Systems
      Citation Excerpt :

      In fuzzy rule-based models and fuzzy modeling [1–3], we have been witnessing a number of refinements of architectures and algorithmic nature [4–12]. As an underlying design process includes a phase of the formation of fuzzy sets in a multivariable input space (and eventually a single output space), some focused efforts were placed on the use of fuzzy clustering; one can refer here to intensive studies [13–15]. Fuzzy clustering plays a pivotal role in fuzzy rule-based models.

    View all citing articles on Scopus
    View full text