Semi-supervised Selective Clustering Ensemble based on constraint information
Introduction
With the development of clustering, clustering analysis plays an important role in many fields, such as pattern recognition, image processing, electronic commerce, document clustering, data analysis and customer recommendation, to be a common method of data mining. Clustering analysis divides data into clusters according to a certain standard. The goal is that the similarity of data in the same cluster is high, and the similarity of data in different clusters is low [1], [2]. Different from classification, clustering is an unsupervised learning algorithm, because the input data does not contain its real classification information. Therefore, clustering uses similarity measure function, such as Euclidean distance, to divide the data into different clusters according to attributes of the data or its intrinsic features, which follows the goal of minimizing the distance between data in the same cluster and maximizing the distance of data in different clusters [3].
Clustering has been applied in many aspects, and with its development, more researches on clustering algorithms spring out. Faced with increasingly clustering algorithms and data sets, it is difficult to find a clustering algorithm that can be applied to all data sets [4]. Therefore, Strehl and Ghosh proposed the concept of clustering ensemble [5]. Assuming data set contains n data, data set X with M clustering algorithms to get M partitions about X. The set of clustering members is composed of these partitions, and denotes the partition gotten with the i-th clustering algorithm. Then, the consensus function will merge these clustering members and get the final partition . The process is shown in Fig. 1.
Clustering ensemble can make the best use of the partitioning information of all single clustering algorithms, so that the final result is better than the result of single clustering algorithm [6]. Each clustering algorithm has its own advantages and disadvantages, which makes them only applicable to specific data sets. Clustering ensemble combines the results of these algorithms to avoid the disadvantages of single clustering algorithm and makes it suitable for more data sets. In addition, clustering ensemble can also anti noise and outliers [7]. Because of its advantages, clustering ensemble has been applied to medicine science, image processing, and synoptic science.
The existing methods of clustering ensemble using constraint information are mainly divided into three types. The first one is using constraint information in member generation, and each generated member conforms to the given constraint information. The second one is that the constraint information is used in member selection, and selecting members using the constraint information that each member meets. The third one is using constraint information in consensus function. When constraint information is used in the generation mechanism and member selection stage, the diversity of members will be reduced, and affect the result of clustering fusion. In this paper, the members in the subset are transformed into similarity matrix, which can improve the similarity matrix according to the constraint information. When using the constraint information given by prior knowledge in the consensus function stage, as the consensus function selected in this paper is based on graph algorithm, it is more intuitive and convenient to use the constraint information. Compared the advantages and disadvantages of constraint information used in different stages, this paper uses constraint information in consensus function. Two consensus functions based on Chameleon and Ncut modify the similarity matrix of data and data relationship according to the constraint information, and use Chameleon or Ncut to utilize the constraint information again in the consensus function part. SSCEC and SSCEN algorithm make utilize of constraint information twice, which greatly improves the effect of the algorithms. Combined with the characteristics of chameleon algorithm and normalized cut algorithm, this paper proposed two semi-supervised selective clustering ensemble algorithms based on the graph and constraint information. For constraints, they are directly divided into different clusters when the graph is divided, and they are not merged when the graph is merged. For constraints, they will not be divided into different clusters when dividing graphs, and they will not be merged into one cluster when combining graphs. This is the main idea of semi-supervised selective clustering ensemble algorithms.
The main advantages and contributions of this paper are as follows:
(1) We propose two semi-supervised selective clustering ensemble algorithms based on multiple clustering member selection strategy.
(2) Four clustering member selection strategies are applied to provide diverse solutions and guarantee the quality of selected solutions.
(3) The constraint information is utilized twice in the construction of the similarity matrix and the consensus function stage, which improve the performance of the proposed algorithms.
The following sections of this paper are as follows. Section 2 introduces the research situation of constraint information and semi-supervised clustering ensemble. Based on constraint information, two semi-supervised selective clustering ensemble algorithms are proposed in Section 3. Section 4 compares the two algorithms with other excellent semi-supervised clustering ensemble algorithms to illustrate their superiorities. Section 5 summarizes the work of this paper and plans the future research directions.
Section snippets
Constraint information
Users with professional background often want to apply this knowledge to clustering ensemble. This kind of knowledge is called constraint information in clustering ensemble. Ma et al. proposed a novel combinatorial term weighting scheme CmTLB [8] based on the term weighting scheme, which combined with the application of sentiment analysis [9]. It can increase the diversity of user constraint information used for clustering. In particular, training samples can be generated using sentiment-based
Semi-supervised selective clustering ensemble algorithms
Traditional clustering ensemble does not consider how to use the constraint information given by experts. The existing semi-supervised clustering ensemble algorithms generally uses constraint information in generative mechanism, which uses semi-supervised clustering algorithm to generate clustering members. The cluster ensemble algorithm with member selection uses constraint information in member selection. However, using constraint information in generative mechanism and member selection will
Experiments
The experiments are performed on the 10 UCI data sets [40] in Table 1. These UCI data sets come from different fields such as medicine, life, society, and physics. They differ in the amount of data, dimensions, and number of clusters to reflect the strengths and weaknesses of our algorithms. It is worth noting that all data sets are marked with supervised classification information. These label information is not only used to evaluate our algorithms, but also to generate constraint information
Conclusion
In recent years, ensemble research is a hot topic in data mining, and with the development of Internet technology, it has more broader application prospects. In this paper, the prior knowledge given by experts in real life is used as the constraint information to guide consensus function, and SSCEC and SSCEN are proposed based on chameleon algorithm and Ncut algorithm respectively, and analyzed on UCI data sets. Experiments show that the constraint information does have a guiding effect on
CRediT authorship contribution statement
Tinghuai Ma: Supervision, Conceptualization, Funding acquisition. Zheng Zhang: Methodology. Lei Guo: Methodology, Software, Writing - original draft. Xin Wang: Conceptualization, Visualization, Validation, Writing - review & editing. Yurong Qian: Writing - review & editing. Najla Al-Nabhan: Writing - review & editing, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was supported in part by National Key Research and Development Program of China (2021YFE0104400, 2018YFC1507805). The authors extend their appreciation to the Deanship of Scientific Research at King Saud University for funding this work through research group No. RGP-1441-33.
Tinghuai Ma received his Bachelor (HUST, China, 1997), Master (HUST, China, 2000), PhD (Chinese Academy of Science, 2003) and was Post-doctoral associate (AJOU University, 2004). Now, he is a professor in Computer Sciences at Nanjing University of Information Science & Technology, China. His research interests are data mining, social network, privacy preserving, data sharing etc.
References (43)
- et al.
A multiple k-means clustering ensemble algorithm to find nonlinearly separable clusters
Information Fusion
(2020) - et al.
Clustering ensemble selection for categorical data based on internal validity indices
Pattern Recognition
(2017) - et al.
Deep rolling: A novel emotion prediction model for a multi-participant communication context
Information Sciences
(2019) - et al.
Graph classification based on graph set reconstruction and graph kernel feature reduction
Neurocomputing
(2018) - et al.
Natural disaster topic extraction in sina microblogging based on graph analysis
Expert Systems with Applications
(2019) - et al.
Lgiem: Global and local node influence based community detection
Future Generation Computer Systems
(2020) - et al.
Constraint neighborhood projections for semi-supervised clustering
IEEE Transactions on Cybernetics
(2014) - et al.
Ensemble clustering based on dense representation
Neurocomputing
(2019) - et al.
Cluster ensemble selection with constraints
Neurocomputing
(2017) - et al.
Parallel hierarchical architectures for efficient consensus clustering on big multimedia cluster ensembles
Information Sciences
(2020)
Synergetic information bottleneck for joint multi-view and ensemble clustering
Information Fusion
Spectral co-clustering ensemble
Knowledge-Based Systems
Clustering ensemble based on sample’s stability
Artificial Intelligence
An integrated k-means–laplacian cluster ensemble approach for document datasets
Neurocomputing
Hierarchical cluster ensemble model based on knowledge granulation
Knowledge-Based Systems
On-line relational and multiple relational som
Neurocomputing
Semi-supervised hierarchical clustering ensemble and its application
Neurocomputing
Iterative ensemble normalized cuts
Pattern Recognition
Incremental fuzzy cluster ensemble learning based on rough set theory
Knowledge-Based Systems
Semi-supervised evolutionary ensembles for web video categorization
Knowledge-Based Systems
Ensemble clustering based on evidence theory
Cited by (6)
A semi-supervised hierarchical ensemble clustering framework based on a novel similarity metric and stratified feature sampling
2023, Journal of King Saud University - Computer and Information SciencesSemi-supervised hierarchical ensemble clustering based on an innovative distance metric and constraint information
2023, Engineering Applications of Artificial IntelligenceAn ensemble hierarchical clustering algorithm based on merits at cluster and partition levels
2023, Pattern RecognitionBig data analysis using a parallel ensemble clustering architecture and an unsupervised feature selection approach
2023, Journal of King Saud University - Computer and Information SciencesCitation Excerpt :Eventually, the final clusters are generated from the consensus matrix based on the normalized cut algorithm. Ma et al. (2021) developed an ensemble clustering method based on constraint information. The authors propose two ensemble clustering approaches: Semi-supervised Selective Clustering Ensemble based on Ncut (SSCEN) and Semi-supervised Selective Clustering Ensemble based on Chameleon (SSCEC).
A Hybrid Clustering Method Based on the Several Diverse Basic Clustering and Meta-Clustering Aggregation Technique
2024, Cybernetics and SystemsDouble-Constrained Consensus Clustering with Application to Online Anti-Counterfeiting
2023, Applied Sciences (Switzerland)
Tinghuai Ma received his Bachelor (HUST, China, 1997), Master (HUST, China, 2000), PhD (Chinese Academy of Science, 2003) and was Post-doctoral associate (AJOU University, 2004). Now, he is a professor in Computer Sciences at Nanjing University of Information Science & Technology, China. His research interests are data mining, social network, privacy preserving, data sharing etc.
Zheng Zhang received her Bachelor degree in Computer Science & Technology from Nanjing University of Information Science & Technology, China in 2018. Currently, he is a candidate of PhD. in Nanjing University of Information Science & Technology. His research interest is machine learning and NLP tasks
Lei Guo received her Bachelor degree in Computer Science & Technology from Nanjing University of Information Science & Technology, China in 2018. Currently, he is a candidate of PhD. in Nanjing University of Information Science & Technology. His research interest is community detection.
Yurong Qian is a professor in the School of Software, Xinjiang University, China. She received her BS and MS degree in computer science and technology from Xinjiang University (2002 and 2005), and PhD in biology from Nanjing University (2010), China. From 2012 to 2013, she worked as a postdoctoral fellow in the Department of Electronics and Computer Engineering, Hanyang University, South Korea. Her research interests include cloud computing, image processing, as well as intelligent computation such as artificial neural networks.
Najla Al-Nabhan has received her BS in Computer Applications (Hon) and MS in Computer Science both from The George Washington University on 2005 and 2008 respectively. In 2013, she received her Ph.D. in Computer Science from King Saud University. She is currently working as the Vic Assistant professor at Computer Science Department, College of Computer and Information Sciences(CCIS), King Saud University(KSU),Riyadh, Saudi Arabia. Her current research interest includes: Wireless Sensor Networks, Multimedia Sensor networks, Cognitive Networks, and Network Security.