Two cluster validity indices for the LAMDA clustering method

https://doi.org/10.1016/j.asoc.2020.106102Get rights and content

Highlights

  • The paper proposes two cluster validity indices for the LAMDA clustering algorithm.

  • The CVGED index is based on the granulation error and the ratio of the distance.

  • The CVCOD index is based on the ratio of the distances and the ratio of compactness.

  • CVGED and CVCOD have a better performance than ICC and CV.

  • CVCOD can find the best parameter values and to increase the quality of partition.

Abstract

The learning algorithm and multivariable data analysis (LAMDA) is an algorithm to group quantitative and qualitative data, applying self-learning and/or directed learning. Usually, LAMDA automatically generates classes by assigning the best data partition to a class. To evaluate the data partitions generated by LAMDA, the internal evaluation is used to find the optimal number of clusters. For the LAMDA algorithm, the cluster validity (CV) is the most popular index which is based on inter-class contrast (ICC). However, other indices have not been defined for LAMDA and a comparative analysis is required to evaluate its performance. In this paper, two metrics called cluster validity index based on granulation error and the ratio of the distance (CVGED) and cluster validity index based on the ratios of covariance and distance (CVCOD) are proposed. Such indices are compared with the CV and ICC indices for two experiments: using a databases repository and selected open data and experimental laboratory data. According to the main results, CVGED and CVCOD have a better performance in compactness, separation, and coefficient of variation than ICC and CV for most of the selected repository databases but the accuracy is limited for the four indices. Nevertheless, CVCOD improves the quality of data partition when the open data and experimental laboratory data are used.

Section snippets

Mathematical expressions and symbols

See Table 1.

LAMDA

LAMDA is a fuzzy method for clustering and classification tasks. Considering the former, LAMDA calculates the adequacy between samples and a class, using historical data. To find the adequacy, the algorithm relates the contribution of features or attributes of a sample with respect to a class. The above allows establishing the global adequacy between a sample and a class [33]. To understand its operation, four steps are explained as shown below:

New cluster validity indices for the LAMDA algorithm

Firstly, the definition of two indices are presented, and afterwards the importance of their expressions and interactions between them are explained below.

Computational complexity of the LAMDA algorithm and metrics

In order to analyze the computational complexity of the LAMDA algorithm and the cluster validity indices, two steps are defined: (1) By applying the LAMDA algorithm and (2) By applying the cluster validity indices.

  • 1.

    First step: The LAMDA algorithm calculates MAD and GAD functions according to the historical data size. The MAD function generates a computational complexity O(KND) but it can increase to O(K2ND) if N or D is large value. The above is related with the automatic generation of classes.

Experimental settings

In this section, we explain two kind of experiments to apply the ICC, CV, CVGED and CVCOD indices. It is important to clarify that all experiments use the LAMDA algorithm with different GAD functions explained in Section 2.1.

Results and discussion

For this section, the main results and a general discussion are shown. In order to clarify the main results, several points are mentioned below:

  • The symbols () and () mean max and min value, respectively.

  • These symbols are used for others metrics.

  • MAD and GAD functions are represented by M() and G(), where () indicates the kind of function.

  • The TIGAD function, the GTD function, and the intuitionistic fuzzy complement function are represented as TIGADGTD()Name, where () indicates the kind

Conclusion

In this paper, two cluster validity indices, CVGED and CVCOD, are proposed for the LAMDA clustering algorithm. The CVCOD index shows the best performance for the experiments 1 and 2, where the most optimal number of clusters and quality of clustering were obtained. One advantage of the CVCOD index is to improve the stability clustering (experiment 1) and to generate the best data partition analyzed by other indices and metrics (experiment 2). Therefore, the CVCOD index can find the most optimal

CRediT authorship contribution statement

Javier Fernando Botía Valderrama: Conceptualization, Methodology, Formal analysis, Writing - original draft, Project administration, Validation, Supervision. Diego José Luis Botía Valderrama: Software, Investigation, Writing - review & editing, Validation, Visualization, Resources.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We acknowledge to “Alcaldía Municipal de Medellín - Información y Evaluación Estratégica” and open data supplied by “Ministerio de las TICs”, “Secretaría de Educación de Antioquia”, “Departamento Administrativo de Planeación - Subdirección de Información y Evaluación Estratégica ” and “MEData”, for allowing the access and the use of the historical data about the population projections 1995–2005–2015 and 2016–2020 of Medellín city, multidimensional index quality of life survey of Medellín city,

References (67)

  • WangW. et al.

    On fuzzy cluster validity indices

    Fuzzy Sets and Systems

    (2007)
  • BotíaJ.F. et al.

    Automaton based on fuzzy clustering methods for monitoring industrial processes

    Eng. Appl. Artif. Intell.

    (2013)
  • ZimmermannH.-J. et al.

    Latent connectives in human decision making

    Fuzzy Sets and Systems

    (1980)
  • ChairaT.

    A novel intuitionistic fuzzy c means clustering algorithm and its application to medical images

    Appl. Soft Comput.

    (2011)
  • KoskoB.

    Fuzzy entropy and conditioning

    Inform. Sci.

    (1986)
  • BandyopadhyayS. et al.

    Use of a fuzzy granulation – degranulation criterion for assessing cluster validity

    Fuzzy Sets and Systems

    (2011)
  • ZhouK. et al.

    Exploring the uniform effect of FCM clustering: A data distribution perspective

    Knowl.-Based Syst.

    (2016)
  • RousseeuwP.J.

    Silhouettes: A graphical aid to the interpretation and validation of cluster analysis

    J. Comput. Appl. Math.

    (1987)
  • SulemanA.

    Measuring the congruence of fuzzy partitions in fuzzy c-means clustering

    Appl. Soft Comput.

    (2017)
  • KimB. et al.

    Integrating cluster validity indices based on data envelopment analysis

    Appl. Soft Comput.

    (2018)
  • CornuéjolsA. et al.

    Collaborative clustering: Why, when, what and how

    Inf. Fusion

    (2018)
  • RatnerB.

    Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data

    (2011)
  • LesotM.-J. et al.

    Fuzzy prototypes: From a cognitive view to a machine learning principle

  • GustafsonD.E. et al.

    Fuzzy clustering with a fuzzy covariance matrix

  • HuangH. et al.

    Multiple kernel fuzzy clustering

    IEEE Trans. Fuzzy Syst.

    (2012)
  • ThongP.H. et al.

    Picture fuzzy clustering: a new computational intelligence method

    Soft Comput.

    (2016)
  • MartínJ.A. et al.

    Controlling selectivity in nonstandard pattern recognition algorithms, the process of classification and learning the meaning of linguistic descriptors of concepts

    (1982)
  • ZambranoJ.G. et al.

    Search algorithm for image recognition based on learning algorithm for multivariate data analysis

  • BotíaJ.F.

    Methodology for Predicting the Behavior of Optical Frequency Comb

    (2017)
  • Kempowsky-HamonT. et al.

    Fuzzy logic selection as a new reliable tool to identify molecular grade signatures in breast cancer – the INNODIAG study

    BMC Med. Genom.

    (2015)
  • DoncescuA. et al.

    Reinforced operators in fuzzy clustering systems

  • EmilionR. et al.

    A general version of the triple Π operator

    Int. J. Intell. Syst.

    (2013)
  • BedoyaC. et al.

    Yager–rybalov triple Π operator as a means of reducing the number of generated clusters in unsupervised anuran vocalization recognition

  • Cited by (8)

    • LAMDA-HSCC: A semi-supervised learning algorithm based on the multivariate data analysis

      2022, Expert Systems with Applications
      Citation Excerpt :

      Although there are several semi-supervised algorithms, these methods are based on assumptions about the distribution of the data such as that they follow a normal distribution, and in the practice is too hard to hold (He et al., 2021). In addition, normally, the assignment of an individual to a class or cluster is through distance methods to a centroid of a class/cluster, which can generate poor performance in non-convex groups (Cerrada et al., 2019; Valderrama & Valderrama, 2020). In this work, we propose a semi-supervised learning algorithm, called LAMDA-HSCC, which involves tasks of classification and clustering to consider the following scenarios:

    • A new validity clustering index-based on finding new centroid positions using the mean of clustered data to determine the optimum number of clusters

      2022, Expert Systems with Applications
      Citation Excerpt :

      The study was limited to images with a well-contrasted background for binary segmentation, and the indices are not used in real-time applications. In Valderrama and Valderrama (2020), two clustering validity indices were proposed to evaluate the clusters produced by the learning algorithm and multivariable data analysis (LAMDA). The first cluster validity index is known as CVI based on granulation error and the ratio of distance (CVGED), whereas the second index is called CVI based on the ratio of covariance and distance (CVCOD).

    View all citing articles on Scopus
    View full text