Improving constrained clustering via swarm intelligence
Introduction
Clustering is one of the most important data analysis methods in the research area of machine learning and data mining. The purpose of clustering is to gain a reasonable division over the datasets. Clustering algorithms are often regarded as unsupervised learning without using the target attribute during the learning process. However, in many real-world applications, we can obtain some prior knowledge or domain information for our problems at hand. How to utilize these limited but important constraints has become more and more important in recent days [3].
Swarm intelligence [7] is a popular research area in artificial intelligence. It refers to social insects that live in groups with a high degree of intelligence [13]. Ant colony can accomplish the tasks that a single ant cannot; just similar to what happened in the human society. Inspired by the acts such as breeding, foraging, nest construction, garbage collection, and territory defense that performed by ants and other social insects, researchers designed a series of algorithms that successfully applies to function optimization, combinatorial optimization, robotics, and other areas.
Some researchers use artificial ants to deal with the clustering problems and have made remarkable achievements. During the development of applying swarm intelligence to clustering, the earliest model proposed by Deneubourg [2], often called the basic model (BM, Basic Model), is used to explain the ants' behavior of piling bodies together to form an ant's grave. Its main idea is to pick up bodies in sparse areas, and drop it at a place where there are more of the same types of bodies. By adding the real data vectors which contain the similarity of data objects, Lumer and Faieta modified the BM model, often called LF [2], to form the de facto clustering model. He and Hui used ant-based clustering (Ant-C) [9] algorithms to analyze the gene expression data; El-Feghi et al. presented AACA [10] algorithm which takes the properties of aggregation pheromone and perception of the environment into account to improve the rate of the convergence. Mohamed Jafar Abul Hasan gave a survey about the evolution of the clustering based on swarm intelligence [11]. Han and Shi improved ant colony algorithm for fuzzy clustering in image segmentation [12]. Lutz Herrmann and Alfred Ultsch proposed an artificial life system [4] based on ESM [5], [8] to deal with the clustering problem. Recently Xu et al. presented an ant sleeping model [6] to improve performances of ant-based clustering.
In our previous work, we have suggested a new ant clustering framework RWAC (Random Walk Ant Clustering). In the traditional ant-based clustering algorithms, the ants pick up and drop down the data objects to form clusters. The main difference of RWAC from the traditional ant-based clustering algorithms lies in that each ant represents a data point which is more simple and direct. The ants randomly walk on the grid to find a place where it feels fit enough to sleep. They perceive the fitness of neighborhood to decide their action: stay to sleep or wake up to leave. In this paper, based on RWAC, we integrate a heuristic walk mechanism to accelerate the convergence speed of RWAC and this method can be easily extended to deal with semi-supervised learning when the domain knowledge is provided in the form of pairwise constraints, hence we call it CAC (Constrained Ant Clustering). CAC is a simple and effective ant-based semi-supervised clustering algorithm.
Section snippets
Constrained clustering
Cluster analysis or clustering is the assignment of a set of data points into subsets (called clusters), so that high intra-similarity and low inter-similarity can be achieved. In the real life, while doing cluster analysis, we can always get a small amount of domain information, such as labels or constraints. Constraints, in the form of two data points that must be assigned to the same cluster or different clusters, sometimes can be easily accessed. Utilizing the pairwise constraints can
Ant-based clustering models
This section briefly describes the first ant-based clustering model BM framework, outlining the principles and operations during its procession. Based on BM we make an introduction of RWAC clustering framework as an improvement of BM, which is more direct to simulate the behavior of social groups. After that we give out the main idea of the RWAC and the algorithm framework.
CAC framework
In this section, we propose a constrained ant clustering framework by integrating a heuristic walk mechanism to accelerate the convergence speed of RWAC and extend it to constrained clustering with a little effort this clustering framework can be changed to handle constrained situation. First, we give out the heuristic walk mechanism in detail. After that the whole algorithm framework is given.
Experiments
In this section, we first compare the CAC framework without any constraints with both Kmeans and RWAC. Then for each data set we randomly generate different numbers of pairwise constraints, and compare RWAC with CAC and COP-Kmeans [1], which is a state-of-the-art constrained clustering algorithm adapting kmeans to satisfy the provided constraints. Three evaluation criteria (purity, F measure and rand index) are used to evaluate the performance of the algorithms. The experiments on 7 UCI
Conclusions
In this paper, we propose a constrained ant clustering framework by embedding heuristic walk mechanism when the domain knowledge is provided in the form of pairwise constraints. The experimental results illustrate that our CAC framework outperforms RWAC and COP-Kmeans algorithms on both artificial dataset and real-world UCI datasets.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under grant No. 61003180, No. 61070047 and No. 61103018; Natural Science Foundation of Education Department of Jiangsu Province under contract 09KJB20013; Natural Science Foundation of Jiangsu Province under contract BK2010318, BK2011442 and BK2012128; Research Innovation Program for College graduates of Jiangsu Province (CXLX12_0917); The New Century Talent Project of Yangzhou University.
Xiaohua Xu received his Ph.D. degree in computer science from Nanjing University of Aeronautics and Astronautics of China in 2008, and M.S. degree from Yangzhou University of China in 2005. His research interests include machine learning, evolutionary computation, and parallel algorithms.
References (15)
- et al.
Ant algorithms and stigmergy
Future Gener. Comput. Syst.
(2000) - et al.
An improved ant colony algorithm for fuzzy clustering in image segmentation
Neurocomputing
(2007) - K. Wagstaff, C. Cardie, S. Rogers, S. Schroedl, Constrained k-means clustering with background knowledge, in:...
- X. Zhu, Semi-supervised Learning with Graphs, Doctoral Dissertation, Carnegie Mellon University, CMU-LTI-05-192,...
- L. Herrmann, A. Ultsch, An artificial life approach for semi-supervised Learning. Data analysis, machine learning and...
- A. Ultsch, L. Herrmann, Automatic Clustering with U⁎C. Technical Report, Department of Mathematics and Computer...
- et al.
A novel ant clustering algorithm based on cellular automata
Web Intell. Agent Syst.
(2007)
Cited by (16)
3SHACC: Three stages hybrid agglomerative constrained clustering
2022, NeurocomputingEnhancing instance-level constrained clustering through differential evolution
2021, Applied Soft ComputingCitation Excerpt :There have been attempts to solve the constrained clustering problem with nature-inspired algorithms, such as the adaptation of the Biased Random-key Genetic Algorithm (BRKGA) presented in [25]. Swarm-based methods have also been applied to constrained clustering, such as the one presented in [26]. Differential Evolution (DE) is an evolution-based algorithm that has proven to be excellent in real-domain problem solving [27].
Swarm intelligence for clustering — A systematic review with new perspectives on data mining
2019, Engineering Applications of Artificial IntelligenceParsimonious memory unit for recurrent neural networks with application to natural language processing
2018, NeurocomputingCitation Excerpt :These high WER are mainly due to speech disfluencies and to adverse acoustic environments (for example, calls from noisy streets with mobile phones). The categorization task of the 20-Newsgroups [38] dataset is employed to exhibit long-term dependencies. This corpus is a collection of roughly 1000 postings on 20 use net newsgroups.
An improved bee colony optimization algorithm with an application to document clustering
2015, NeurocomputingCitation Excerpt :The aim of clustering is to group a set of data objects into a set of meaningful sub-classes, called clusters which could be disjoint or not. Clustering is a fundamental tool in exploratory data analysis with practical importance in a wide variety of applications such as data mining, machine learning, pattern recognition, statistical data analysis, data compression, and vector quantization [88]. The aim of clustering is to find the hidden structure underlying a given collection of data points.
Hybrid meta-heuristic optimization algorithms for time-domain-constrained data clustering
2014, Applied Soft Computing Journal
Xiaohua Xu received his Ph.D. degree in computer science from Nanjing University of Aeronautics and Astronautics of China in 2008, and M.S. degree from Yangzhou University of China in 2005. His research interests include machine learning, evolutionary computation, and parallel algorithms.
Lin Lu received his B.S. degree in 2011 at Yangzhou University. He is currently pursuing the M.S. degree at Yangzhou University. His research interests include swarm intelligence and machine learning.
Ping He received her M.S. degree from Yangzhou University of China in 2008. She is currently pursuing the Ph.D. degree at Nanjing University of Aeronautics and Astronautics of China. Her research interests include machine learning, data mining and bioinformatics.
Zhoujin Pan received his B.S. degree in 2009 at Yangzhou University. He is currently pursuing the M.S. degree at Yangzhou University. His research interests include swarm intelligence and machine learning.
Ling Chen is a professor in the Computer Science Department at Yangzhou University, Yangzhou, P.R. China. He did research for two years on parallel algorithms and architectures at the University of Pittsburgh, PA, first as a visiting scholar in 1986 and, later, as a visiting associate professor in 1992. His research interests include parallel algorithm design, artificial intelligence, and bioinformatics. Professor Chen is a member of IEEE Computer Society and Chinese Computer Society.