Semi-supervised Selective Clustering Ensemble based on constraint information

doi:10.1016/j.neucom.2021.07.056

Neurocomputing

Volume 462, 28 October 2021, Pages 412-425

https://doi.org/10.1016/j.neucom.2021.07.056 Get rights and content

Abstract

Clustering is an important research direction in data mining. However, there is no one clustering algorithm that can be applied efficiently in all situation. Clustering ensemble is the best way to solve the above-mentioned problems. It combines the results of multiple clustering algorithms, and the final result is significantly better than a single clustering algorithm. Although there is a lot of constraint information, the existing clustering ensemble algorithm does not utilize it. This paper uses constraint information in consensus function and proposes a Semi-supervised Selective Clustering Ensemble based on Chameleon (SSCEC) and Semi-supervised Selective Clustering Ensemble based on Ncut (SSCEN) to solve the above problem. SSCEC uses the chameleon algorithm as consensus function, and processes constraint information in subgraph partition and subgraph combining. SSCEN uses the Normalized cut algorithm as consensus function, and processes constraint information in the process of graph dichotomy. The experiment results show that our proposed two semi-supervised member selection clustering ensemble algorithms are better than other semi-supervised algorithms.

Introduction

With the development of clustering, clustering analysis plays an important role in many fields, such as pattern recognition, image processing, electronic commerce, document clustering, data analysis and customer recommendation, to be a common method of data mining. Clustering analysis divides data into clusters according to a certain standard. The goal is that the similarity of data in the same cluster is high, and the similarity of data in different clusters is low [1], [2]. Different from classification, clustering is an unsupervised learning algorithm, because the input data does not contain its real classification information. Therefore, clustering uses similarity measure function, such as Euclidean distance, to divide the data into different clusters according to attributes of the data or its intrinsic features, which follows the goal of minimizing the distance between data in the same cluster and maximizing the distance of data in different clusters [3].

Clustering has been applied in many aspects, and with its development, more researches on clustering algorithms spring out. Faced with increasingly clustering algorithms and data sets, it is difficult to find a clustering algorithm that can be applied to all data sets [4]. Therefore, Strehl and Ghosh proposed the concept of clustering ensemble [5]. Assuming data set $X = \{x_{1}, x_{2}, \dots, x_{n}\}$ contains n data, data set X with M clustering algorithms to get M partitions about X. The set of clustering members $P = \{P_{1}, P_{2}, \dots, P_{M}\}$ is composed of these partitions, and denotes the partition gotten with the i-th clustering algorithm. Then, the consensus function $Γ$ will merge these clustering members and get the final partition $P^{*}$ . The process is shown in Fig. 1.

Clustering ensemble can make the best use of the partitioning information of all single clustering algorithms, so that the final result is better than the result of single clustering algorithm [6]. Each clustering algorithm has its own advantages and disadvantages, which makes them only applicable to specific data sets. Clustering ensemble combines the results of these algorithms to avoid the disadvantages of single clustering algorithm and makes it suitable for more data sets. In addition, clustering ensemble can also anti noise and outliers [7]. Because of its advantages, clustering ensemble has been applied to medicine science, image processing, and synoptic science.

The existing methods of clustering ensemble using constraint information are mainly divided into three types. The first one is using constraint information in member generation, and each generated member conforms to the given constraint information. The second one is that the constraint information is used in member selection, and selecting members using the constraint information that each member meets. The third one is using constraint information in consensus function. When constraint information is used in the generation mechanism and member selection stage, the diversity of members will be reduced, and affect the result of clustering fusion. In this paper, the members in the subset are transformed into similarity matrix, which can improve the similarity matrix according to the constraint information. When using the constraint information given by prior knowledge in the consensus function stage, as the consensus function selected in this paper is based on graph algorithm, it is more intuitive and convenient to use the constraint information. Compared the advantages and disadvantages of constraint information used in different stages, this paper uses constraint information in consensus function. Two consensus functions based on Chameleon and Ncut modify the similarity matrix of data and data relationship according to the constraint information, and use Chameleon or Ncut to utilize the constraint information again in the consensus function part. SSCEC and SSCEN algorithm make utilize of constraint information twice, which greatly improves the effect of the algorithms. Combined with the characteristics of chameleon algorithm and normalized cut algorithm, this paper proposed two semi-supervised selective clustering ensemble algorithms based on the graph and constraint information. For $cannot - link$ constraints, they are directly divided into different clusters when the graph is divided, and they are not merged when the graph is merged. For $must - link$ constraints, they will not be divided into different clusters when dividing graphs, and they will not be merged into one cluster when combining graphs. This is the main idea of semi-supervised selective clustering ensemble algorithms.

The main advantages and contributions of this paper are as follows:

(1) We propose two semi-supervised selective clustering ensemble algorithms based on multiple clustering member selection strategy.

(2) Four clustering member selection strategies are applied to provide diverse solutions and guarantee the quality of selected solutions.

(3) The constraint information is utilized twice in the construction of the similarity matrix and the consensus function stage, which improve the performance of the proposed algorithms.

The following sections of this paper are as follows. Section 2 introduces the research situation of constraint information and semi-supervised clustering ensemble. Based on constraint information, two semi-supervised selective clustering ensemble algorithms are proposed in Section 3. Section 4 compares the two algorithms with other excellent semi-supervised clustering ensemble algorithms to illustrate their superiorities. Section 5 summarizes the work of this paper and plans the future research directions.

Section snippets

Constraint information

Users with professional background often want to apply this knowledge to clustering ensemble. This kind of knowledge is called constraint information in clustering ensemble. Ma et al. proposed a novel combinatorial term weighting scheme CmTLB [8] based on the term weighting scheme, which combined with the application of sentiment analysis [9]. It can increase the diversity of user constraint information used for clustering. In particular, training samples can be generated using sentiment-based

Semi-supervised selective clustering ensemble algorithms

Traditional clustering ensemble does not consider how to use the constraint information given by experts. The existing semi-supervised clustering ensemble algorithms generally uses constraint information in generative mechanism, which uses semi-supervised clustering algorithm to generate clustering members. The cluster ensemble algorithm with member selection uses constraint information in member selection. However, using constraint information in generative mechanism and member selection will

Experiments

The experiments are performed on the 10 UCI data sets [40] in Table 1. These UCI data sets come from different fields such as medicine, life, society, and physics. They differ in the amount of data, dimensions, and number of clusters to reflect the strengths and weaknesses of our algorithms. It is worth noting that all data sets are marked with supervised classification information. These label information is not only used to evaluate our algorithms, but also to generate constraint information

Conclusion

In recent years, ensemble research is a hot topic in data mining, and with the development of Internet technology, it has more broader application prospects. In this paper, the prior knowledge given by experts in real life is used as the constraint information to guide consensus function, and SSCEC and SSCEN are proposed based on chameleon algorithm and Ncut algorithm respectively, and analyzed on UCI data sets. Experiments show that the constraint information does have a guiding effect on

CRediT authorship contribution statement

Tinghuai Ma: Supervision, Conceptualization, Funding acquisition. Zheng Zhang: Methodology. Lei Guo: Methodology, Software, Writing - original draft. Xin Wang: Conceptualization, Visualization, Validation, Writing - review & editing. Yurong Qian: Writing - review & editing. Najla Al-Nabhan: Writing - review & editing, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported in part by National Key Research and Development Program of China (2021YFE0104400, 2018YFC1507805). The authors extend their appreciation to the Deanship of Scientific Research at King Saud University for funding this work through research group No. RGP-1441-33.

Tinghuai Ma received his Bachelor (HUST, China, 1997), Master (HUST, China, 2000), PhD (Chinese Academy of Science, 2003) and was Post-doctoral associate (AJOU University, 2004). Now, he is a professor in Computer Sciences at Nanjing University of Information Science & Technology, China. His research interests are data mining, social network, privacy preserving, data sharing etc.

References (43)

L. Bai et al.
A multiple k-means clustering ensemble algorithm to find nonlinearly separable clusters
Information Fusion
(2020)
X. Zhao et al.
Clustering ensemble selection for categorical data based on internal validity indices
Pattern Recognition
(2017)
H. Rong et al.
Deep rolling: A novel emotion prediction model for a multi-participant communication context
Information Sciences
(2019)
T. Ma et al.
Graph classification based on graph set reconstruction and graph kernel feature reduction
Neurocomputing
(2018)
T. Ma et al.
Natural disaster topic extraction in sina microblogging based on graph analysis
Expert Systems with Applications
(2019)
T. Ma et al.
Lgiem: Global and local node influence based community detection
Future Generation Computer Systems
(2020)
H. Wang et al.
Constraint neighborhood projections for semi-supervised clustering
IEEE Transactions on Cybernetics
(2014)
J. Zhou et al.
Ensemble clustering based on dense representation
Neurocomputing
(2019)
F. Yang et al.
Cluster ensemble selection with constraints
Neurocomputing
(2017)
X. Sevillano et al.
Parallel hierarchical architectures for efficient consensus clustering on big multimedia cluster ensembles
Information Sciences
(2020)

C. Han et al.

Ensemble clustering based on evidence theory

Cited by (6)

A semi-supervised hierarchical ensemble clustering framework based on a novel similarity metric and stratified feature sampling
2023, Journal of King Saud University - Computer and Information Sciences
Recently, both ensemble clustering and semi-supervised clustering have emerged as important paradigms of traditional clustering. Ensemble clustering seeks to integrate multiple clustering results from different methods or the same methods with different parameters. Semi-supervised clustering involves using a small amount of class membership information in some samples for the learning process. Meanwhile, Semi-Supervised Ensemble Clustering (SSEC) has attracted increasing attention due to its high performance. However, most SSEC algorithms are configured based on partitional clustering techniques, and there are few attempts on hierarchical clustering techniques. Even in existing hierarchy-based SSEC algorithms, prior knowledge is not sufficiently used and is often applied to create primary partitions. To address these problems, we propose a Semi-supervised Hierarchical Ensemble Clustering framework based on a novel Similarity metric and stratified feature Sampling, which we call SHECSS. SHECSS uses the information of all primary partitions according to their strength to calculate the similarity between samples. Also, SHECSS is equipped with a stratified feature sampling mechanism that can improve the diversity of primary partitions and deal with high-dimensional data. Here, the primary partitions are created based on multiple hierarchical clustering techniques, and the target partition is configured by a consensus function based on the clusters clustering policy. Experimental results show the effectiveness and efficiency of SHECSS compared to representative clustering methods.
Semi-supervised hierarchical ensemble clustering based on an innovative distance metric and constraint information
2023, Engineering Applications of Artificial Intelligence
Agglomerative Hierarchical Clustering (AHC) is a bottom-up clustering strategy in which each object is originally a cluster, and more pairs of clusters are formed by traversing the hierarchy. It has been proven that there is no individual AHC clustering algorithm that can be efficient in all situations. In order to address this problem, ensemble clustering techniques have been introduced. These techniques combine the results of several output partitions to achieve a consensus with higher accuracy compared to an individual clustering algorithm. This paper proposes an AHC-based ensemble semi-supervised clustering algorithm to improve performance. In semi-supervised clustering, class membership information is used in some objects. Here, we introduce the Semi-Supervised Ensemble Hierarchical Clustering based on Constraints Information (SSEHCCI) algorithm. SSEHCCI is developed using several individual clustering algorithms based on AHC. SSEHCCI includes a flexible weighting policy to generate base partitions and uses the constraints information to configure the semi-supervised clustering. In addition, SSEHCCI uses an innovative distance measure to calculate the distance between each pair of objects. Experimental results show that SSEHCCI performs better than existing semi-supervised algorithms on some University of California Irvine (UCI) datasets. Specifically, we observed an average accuracy of SSEHCCI compared to SSDC and RSSC of 2.6% and 1.8%, respectively.
An ensemble hierarchical clustering algorithm based on merits at cluster and partition levels
2023, Pattern Recognition
Ensemble clustering has emerged as a combination of several basic clustering algorithms to achieve high quality final clustering. However, this technique is challenging due to the complexities in primary clusters such as overlapping, vagueness, instability and uncertainty. Typically, ensemble clustering uses all the primary clusters into partitions for consensus, where the merits of a cluster or a partition can be considered to improve the quality of the consensus. In general, the robustness of a partition may be poorly measured, while having some high-quality clusters. Inspired by the evaluation of cluster and partition, this paper proposes an ensemble hierarchical clustering algorithm based on the cluster consensus selection approach. Here, the selection of a subset of primary clusters from partitions based on their merit level is emphasized. Merit level is defined using the development of Normalized Mutual Information measure. Clusters of basic clustering algorithms that satisfy the predefined threshold of this measure are selected to participate in the final consensus. In addition, the consensus of the selected primary clusters to create the final clusters is performed based on the clusters clustering technique. In this technique, the selected primary clusters are re-clustered to create hyper-clusters. Finally, the final clusters are formed by assigning instances to hyper-clusters with the highest similarity. Here, an innovative criterion based on merit and cluster size for defining similarity is presented. The performance of the proposed algorithm has been proven by extensive experiments on real-world datasets from the UCI repository compared to state-of-the-art algorithms such as CPDM, ENMI, IDEA, CFTLC and SSCEN.
Big data analysis using a parallel ensemble clustering architecture and an unsupervised feature selection approach
2023, Journal of King Saud University - Computer and Information Sciences
Citation Excerpt :
Eventually, the final clusters are generated from the consensus matrix based on the normalized cut algorithm. Ma et al. (2021) developed an ensemble clustering method based on constraint information. The authors propose two ensemble clustering approaches: Semi-supervised Selective Clustering Ensemble based on Ncut (SSCEN) and Semi-supervised Selective Clustering Ensemble based on Chameleon (SSCEC).
Ensemble clustering is known as a challenging research direction in data mining. The results of several individual clustering methods are combined to produce higher quality final clusters. This study introduces a parallel hierarchical clustering approach based on the divide-and-conquer strategy, which is an attempt to realize faster and more efficient ensemble clustering. Here, we propose a cluster consensus selection approach that selects a subset of meriting primary clusters to participate in the final consensus. Considering the sample-cluster and cluster–cluster similarity on the selected primary clusters, we form the final clusters based on the clusters clustering technique as a consensus function. In addition, the proposed scheme is equipped with an unsupervised feature selection approach to remove features that do not contribute significantly to clustering. Extensive evaluations have been performed on datasets of different dimensions from the University of California Irvine (UCI) machine learning repository. The simulation results guarantee the efficiency of the proposed scheme and improves the average performance between 6% and 24% compared to the state-of-the-art clustering methods.
A Hybrid Clustering Method Based on the Several Diverse Basic Clustering and Meta-Clustering Aggregation Technique
2024, Cybernetics and Systems
Double-Constrained Consensus Clustering with Application to Online Anti-Counterfeiting
2023, Applied Sciences (Switzerland)

Zheng Zhang received her Bachelor degree in Computer Science & Technology from Nanjing University of Information Science & Technology, China in 2018. Currently, he is a candidate of PhD. in Nanjing University of Information Science & Technology. His research interest is machine learning and NLP tasks

Lei Guo received her Bachelor degree in Computer Science & Technology from Nanjing University of Information Science & Technology, China in 2018. Currently, he is a candidate of PhD. in Nanjing University of Information Science & Technology. His research interest is community detection.

Yurong Qian is a professor in the School of Software, Xinjiang University, China. She received her BS and MS degree in computer science and technology from Xinjiang University (2002 and 2005), and PhD in biology from Nanjing University (2010), China. From 2012 to 2013, she worked as a postdoctoral fellow in the Department of Electronics and Computer Engineering, Hanyang University, South Korea. Her research interests include cloud computing, image processing, as well as intelligent computation such as artificial neural networks.

Najla Al-Nabhan has received her BS in Computer Applications (Hon) and MS in Computer Science both from The George Washington University on 2005 and 2008 respectively. In 2013, she received her Ph.D. in Computer Science from King Saud University. She is currently working as the Vic Assistant professor at Computer Science Department, College of Computer and Information Sciences(CCIS), King Saud University(KSU),Riyadh, Saudi Arabia. Her current research interest includes: Wireless Sensor Networks, Multimedia Sensor networks, Cognitive Networks, and Network Security.

View full text

Semi-supervised Selective Clustering Ensemble based on constraint information

Abstract

Introduction

Section snippets

Constraint information

Semi-supervised selective clustering ensemble algorithms

Experiments

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Information Fusion

Pattern Recognition

Information Sciences

Neurocomputing

Expert Systems with Applications

Future Generation Computer Systems

IEEE Transactions on Cybernetics

Neurocomputing

Neurocomputing

Information Sciences

Information Fusion

Knowledge-Based Systems

Artificial Intelligence

Neurocomputing

Knowledge-Based Systems

Neurocomputing

Neurocomputing

Pattern Recognition

Knowledge-Based Systems

Knowledge-Based Systems

Ensemble clustering based on evidence theory