Two cluster validity indices for the LAMDA clustering method

doi:10.1016/j.asoc.2020.106102

Applied Soft Computing

Volume 89, April 2020, 106102

https://doi.org/10.1016/j.asoc.2020.106102 Get rights and content

Highlights

•
The paper proposes two cluster validity indices for the LAMDA clustering algorithm.
•
The CVGED index is based on the granulation error and the ratio of the distance.
•
The CVCOD index is based on the ratio of the distances and the ratio of compactness.
•
CVGED and CVCOD have a better performance than ICC and CV.
•
CVCOD can find the best parameter values and to increase the quality of partition.

Abstract

The learning algorithm and multivariable data analysis (LAMDA) is an algorithm to group quantitative and qualitative data, applying self-learning and/or directed learning. Usually, LAMDA automatically generates classes by assigning the best data partition to a class. To evaluate the data partitions generated by LAMDA, the internal evaluation is used to find the optimal number of clusters. For the LAMDA algorithm, the cluster validity (CV) is the most popular index which is based on inter-class contrast (ICC). However, other indices have not been defined for LAMDA and a comparative analysis is required to evaluate its performance. In this paper, two metrics called cluster validity index based on granulation error and the ratio of the distance (CVGED) and cluster validity index based on the ratios of covariance and distance (CVCOD) are proposed. Such indices are compared with the CV and ICC indices for two experiments: using a databases repository and selected open data and experimental laboratory data. According to the main results, CVGED and CVCOD have a better performance in compactness, separation, and coefficient of variation than ICC and CV for most of the selected repository databases but the accuracy is limited for the four indices. Nevertheless, CVCOD improves the quality of data partition when the open data and experimental laboratory data are used.

Section snippets

Mathematical expressions and symbols

See Table 1.

LAMDA

LAMDA is a fuzzy method for clustering and classification tasks. Considering the former, LAMDA calculates the adequacy between samples and a class, using historical data. To find the adequacy, the algorithm relates the contribution of features or attributes of a sample with respect to a class. The above allows establishing the global adequacy between a sample and a class [33]. To understand its operation, four steps are explained as shown below:

New cluster validity indices for the LAMDA algorithm

Firstly, the definition of two indices are presented, and afterwards the importance of their expressions and interactions between them are explained below.

Computational complexity of the LAMDA algorithm and metrics

In order to analyze the computational complexity of the LAMDA algorithm and the cluster validity indices, two steps are defined: (1) By applying the LAMDA algorithm and (2) By applying the cluster validity indices.

1.
First step: The LAMDA algorithm calculates MAD and GAD functions according to the historical data size. The MAD function generates a computational complexity $O (K N D)$ but it can increase to $O (K^{2} N D)$ if $N$ or $D$ is large value. The above is related with the automatic generation of classes.

Experimental settings

In this section, we explain two kind of experiments to apply the $I C C$ , $C V$ , $C V G E D$ and $C V C O D$ indices. It is important to clarify that all experiments use the LAMDA algorithm with different GAD functions explained in Section 2.1.

Results and discussion

For this section, the main results and a general discussion are shown. In order to clarify the main results, several points are mentioned below:

•
The symbols ( $↑$ ) and ( $↓$ ) mean $max$ and $min$ value, respectively.
•
These symbols are used for others metrics.
•
MAD and GAD functions are represented by $M^{(\cdot)}$ and $G^{(\cdot)}$ , where $(\cdot)$ indicates the kind of function.
•
The TIGAD function, the GTD function, and the intuitionistic fuzzy complement function are represented as $T I G A D - G T D^{(\cdot)} - N a m e$ , where $(\cdot)$ indicates the kind

Conclusion

In this paper, two cluster validity indices, CVGED and CVCOD, are proposed for the LAMDA clustering algorithm. The CVCOD index shows the best performance for the experiments 1 and 2, where the most optimal number of clusters and quality of clustering were obtained. One advantage of the CVCOD index is to improve the stability clustering (experiment 1) and to generate the best data partition analyzed by other indices and metrics (experiment 2). Therefore, the CVCOD index can find the most optimal

CRediT authorship contribution statement

Javier Fernando Botía Valderrama: Conceptualization, Methodology, Formal analysis, Writing - original draft, Project administration, Validation, Supervision. Diego José Luis Botía Valderrama: Software, Investigation, Writing - review & editing, Validation, Visualization, Resources.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We acknowledge to “Alcaldía Municipal de Medellín - Información y Evaluación Estratégica” and open data supplied by “Ministerio de las TICs”, “Secretaría de Educación de Antioquia”, “Departamento Administrativo de Planeación - Subdirección de Información y Evaluación Estratégica ” and “MEData”, for allowing the access and the use of the historical data about the population projections 1995–2005–2015 and 2016–2020 of Medellín city, multidimensional index quality of life survey of Medellín city,

References (67)

JainA.K.
Data clustering: 50 years beyond k-means
Pattern Recognit. Lett.
(2010)
BezdekJ.C. et al.
FCM: The fuzzy C-means clustering algorithm
Comput. Geosci.
(1984)
KempowskyT. et al.
Process situation assessment: From a fuzzy partition to a finite state machine
Eng. Appl. Artif. Intell.
(2006)
LamriniB. et al.
Detection of functional states by the ‘LAMDA’ classification technique: application to a coagulation process in drinking water treatment
C. R. Phys.
(2005)
IsazaC.V. et al.
Situation prediction based on fuzzy clustering for industrial complex processes
Inform. Sci.
(2014)
RuizF.A. et al.
A new criterion to validate and improve the classification process of LAMDA algorithm applied to diesel engines
Eng. Appl. Artif. Intell.
(2017)
DoncescuA. et al.
Image color segmentation using the fuzzy tree algorithm T-LAMDA
Fuzzy Sets and Systems
(2007)
BotíaJ.F. et al.
Fuzzy cellular automata and intuitionistic fuzzy sets applied to an optical frequency comb spectral shape
Eng. Appl. Artif. Intell.
(2017)
HedjaziL. et al.
Similarity-margin based feature selection for symbolic interval data
Pattern Recognit. Lett.
(2011)
HedjaziL. et al.
Membership-margin based feature selection for mixed type and high-dimensional data: Theory and applications
Inform. Sci.
(2015)

WangW. et al.

On fuzzy cluster validity indices

Fuzzy Sets and Systems

(2007)

BotíaJ.F. et al.

Automaton based on fuzzy clustering methods for monitoring industrial processes

Eng. Appl. Artif. Intell.

(2013)

ZimmermannH.-J. et al.

Latent connectives in human decision making

Fuzzy Sets and Systems

(1980)

ChairaT.

A novel intuitionistic fuzzy c means clustering algorithm and its application to medical images

Appl. Soft Comput.

(2011)

KoskoB.

Fuzzy entropy and conditioning

Inform. Sci.

(1986)

BandyopadhyayS. et al.

Use of a fuzzy granulation – degranulation criterion for assessing cluster validity

Fuzzy Sets and Systems

(2011)

ZhouK. et al.

Exploring the uniform effect of FCM clustering: A data distribution perspective

Knowl.-Based Syst.

(2016)

RousseeuwP.J.

Silhouettes: A graphical aid to the interpretation and validation of cluster analysis

J. Comput. Appl. Math.

(1987)

SulemanA.

Measuring the congruence of fuzzy partitions in fuzzy c-means clustering

Appl. Soft Comput.

(2017)

KimB. et al.

Integrating cluster validity indices based on data envelopment analysis

Appl. Soft Comput.

(2018)

CornuéjolsA. et al.

Collaborative clustering: Why, when, what and how

Inf. Fusion

(2018)

RatnerB.

Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data

(2011)

LesotM.-J. et al.

Fuzzy prototypes: From a cognitive view to a machine learning principle

GustafsonD.E. et al.

Fuzzy clustering with a fuzzy covariance matrix

HuangH. et al.

Multiple kernel fuzzy clustering

IEEE Trans. Fuzzy Syst.

(2012)

ThongP.H. et al.

Picture fuzzy clustering: a new computational intelligence method

Soft Comput.

(2016)

MartínJ.A. et al.

Controlling selectivity in nonstandard pattern recognition algorithms, the process of classification and learning the meaning of linguistic descriptors of concepts

(1982)

ZambranoJ.G. et al.

Search algorithm for image recognition based on learning algorithm for multivariate data analysis

BotíaJ.F.

Methodology for Predicting the Behavior of Optical Frequency Comb

(2017)

Kempowsky-HamonT. et al.

Fuzzy logic selection as a new reliable tool to identify molecular grade signatures in breast cancer – the INNODIAG study

BMC Med. Genom.

(2015)

DoncescuA. et al.

Reinforced operators in fuzzy clustering systems

EmilionR. et al.

A general version of the triple $Π$ operator

Int. J. Intell. Syst.

(2013)

BedoyaC. et al.

Yager–rybalov triple $Π$ operator as a means of reducing the number of generated clusters in unsupervised anuran vocalization recognition

Cited by (8)

LAMDA-HSCC: A semi-supervised learning algorithm based on the multivariate data analysis
2022, Expert Systems with Applications
Citation Excerpt :
Although there are several semi-supervised algorithms, these methods are based on assumptions about the distribution of the data such as that they follow a normal distribution, and in the practice is too hard to hold (He et al., 2021). In addition, normally, the assignment of an individual to a class or cluster is through distance methods to a centroid of a class/cluster, which can generate poor performance in non-convex groups (Cerrada et al., 2019; Valderrama & Valderrama, 2020). In this work, we propose a semi-supervised learning algorithm, called LAMDA-HSCC, which involves tasks of classification and clustering to consider the following scenarios:
In this work, we propose a semi-supervised learning algorithm, which can solve problems of classification, clustering, or a combination of them. This algorithm is based on the LAMDA family (Learning Algorithm for Multivariate Data Analysis), which computes the membership degree of an individual to a class or cluster considering the contribution of all features/descriptors. Thereby, it uses the LAMDA-RD approach for the clustering problem and the LAMDA-HAD approach for the classification problem. Also, it is composed of three sub-models for the migration, merging, and separation problems to improve the assignment of individuals to the classes/clusters. This proposal, called LAMDA- HSCC (Hybrid Scenarios of Classification and Clustering), is applied to several datasets of classification, clustering, and hybrid, in order to compare its performance with other algorithms, showing very encouraging results. Particularly, we define a new metric for evaluating performance in a semi-supervised context, called the Semi-Supervised Criterion (SSC), in which our approach achieves very good results.
PIFHC: The Probabilistic Intuitionistic Fuzzy Hierarchical Clustering Algorithm
2022, Applied Soft Computing
Hierarchical clustering techniques help in building a tree-like structure called dendrogram from the data points which can be used to find the closest related data objects. This paper presents a novel hierarchical clustering technique which considers intuitionistic fuzzy sets to deal with the uncertainty present in the data. Instead of using traditional hamming distance or Euclidean distance measure to find the distance between the data points, it employs the probabilistic Euclidean distance measure to propose a novel clustering approach which we term as ‘Probabilistic Intuitionistic Fuzzy Hierarchical Clustering (PIFHC) Algorithm’. The proposed PIFHC algorithm considers probabilistic weights from the data to measure the distances between the data points. Clustering results over UCI datasets show that our proposed PIFHC algorithm gives better cluster accuracies than its existing counterparts. PIFHC efficiently provides improvements of 1%–3.5% in the clustering accuracy compared to other fuzzy hierarchical clustering algorithms for most of the datasets. We further provide experimental results with the real-world car dataset and the Listeria monocytogenes dataset for mouse susceptibility to demonstrate the practical efficacy of the proposed algorithm. For Listeria datasets as well, proposed PIFHC records 1.7% improvement against the state-of-the-art methods The dendrograms formed by the proposed PIFHC algorithm exhibits high cophenetic correlation coefficient with an improvement of 0.75% over others. We provide various AGNES methods to update the distance between merged clusters in the proposed PIFHC algorithm.
A new validity clustering index-based on finding new centroid positions using the mean of clustered data to determine the optimum number of clusters
2022, Expert Systems with Applications
Citation Excerpt :
The study was limited to images with a well-contrasted background for binary segmentation, and the indices are not used in real-time applications. In Valderrama and Valderrama (2020), two clustering validity indices were proposed to evaluate the clusters produced by the learning algorithm and multivariable data analysis (LAMDA). The first cluster validity index is known as CVI based on granulation error and the ratio of distance (CVGED), whereas the second index is called CVI based on the ratio of covariance and distance (CVCOD).
Clustering, an unsupervised pattern classification method, plays an important role in identifying input dataset structures. It partitions input datasets into clusters or groups where either the optimum number of clusters is known in prior or automatically determined. In the case of automatic clustering, the performance is evaluated using a cluster validity index (CVI), which determines the optimum number of clusters in the data. From previous works, the improper cluster centroids positioning produced by clustering algorithms could reduce the performance of the validation process and performance produced by the previous state-of-the-art CVIs. In addition, those previous CVIs can only work properly with certain clustering algorithms and simple datasets structures, which their performances will reduce if they are applied to other clustering algorithms as well as more complex datasets. This study proposes an efficient CVI, namely, the validity clustering index based on finding the mean of clustered data (VCIM). The proposed approach combines the properties of the score function index and the mean to determine new cluster centroid positions. The performance of the VCIM index is compared with well-known CVIs on both artificial and real-life datasets. The obtained results on artificial datasets show that the proposed VCIM index outperforms the other CVIs in determining the true number of clusters for the five conventional clustering algorithms, namely, K-means, Fuzzy C-mean, agglomerative hierarchical average linkage clustering, variance-based differential evolution, and density peaks clustering and Particle swarm optimization (PDPC) algorithms. For the 14 real-word datasets, the proposed VCIM index correctly determined the optimum number of clusters for 11 out of 14 for the K-means clustering algorithm, 9 out of 14 for both Fuzzy clustering and agglomerative hierarchical average linkage clustering algorithms, 12 out of 14 for the variance-based differential evolution algorithm and 11 out of 14 datasets for PDPC. The obtained results using the proposed VCIM show its significance when combined with clustering algorithms and nominate its potential in various clustering applications.
A New Cluster Validity Index for Fuzzy Clustering Using Separation and Compactness
2023, Research Square
P-IT2IFCM: Probabilistic Interval Type-2 Intuitionistic Fuzzy c-Means Clustering Algorithm
2022, IEEE International Conference on Fuzzy Systems
LAMDA controller applied to the trajectory tracking of an aerial manipulator
2021, Applied Sciences (Switzerland)

View all citing articles on Scopus

View full text

Two cluster validity indices for the LAMDA clustering method

Highlights

Abstract

Section snippets

Mathematical expressions and symbols

LAMDA

New cluster validity indices for the LAMDA algorithm

Computational complexity of the LAMDA algorithm and metrics

Experimental settings

Results and discussion

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Pattern Recognit. Lett.

Comput. Geosci.

Eng. Appl. Artif. Intell.

C. R. Phys.

Inform. Sci.

Eng. Appl. Artif. Intell.

Fuzzy Sets and Systems

Eng. Appl. Artif. Intell.

Pattern Recognit. Lett.

Inform. Sci.

Fuzzy Sets and Systems

Eng. Appl. Artif. Intell.

Fuzzy Sets and Systems

Appl. Soft Comput.

Inform. Sci.

Fuzzy Sets and Systems

Knowl.-Based Syst.

J. Comput. Appl. Math.

Appl. Soft Comput.

Appl. Soft Comput.

Inf. Fusion

Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data

Fuzzy prototypes: From a cognitive view to a machine learning principle

Fuzzy clustering with a fuzzy covariance matrix

Multiple kernel fuzzy clustering

IEEE Trans. Fuzzy Syst.

Picture fuzzy clustering: a new computational intelligence method

Soft Comput.

Controlling selectivity in nonstandard pattern recognition algorithms, the process of classification and learning the meaning of linguistic descriptors of concepts

Search algorithm for image recognition based on learning algorithm for multivariate data analysis

Methodology for Predicting the Behavior of Optical Frequency Comb

Fuzzy logic selection as a new reliable tool to identify molecular grade signatures in breast cancer – the INNODIAG study

BMC Med. Genom.

Reinforced operators in fuzzy clustering systems

A general version of the triple Π operator

Int. J. Intell. Syst.

Yager–rybalov triple Π operator as a means of reducing the number of generated clusters in unsupervised anuran vocalization recognition

A general version of the triple $Π$ operator

Yager–rybalov triple $Π$ operator as a means of reducing the number of generated clusters in unsupervised anuran vocalization recognition