A multi-prototype clustering algorithm
Introduction
Clustering is the unsupervised classification of patterns into groups [1]. It is widely used in data analysis such as data mining, pattern recognition and information retrieval. The Voronoi diagram also provides a means of naturally partitioning space into subregions to facilitate spatial data analysis and has been applied for data clustering [2], [3], [4], [5]. But this technique often implies emphasis on the shape and arrangement of patterns, i.e., the geometric aspect of groups. Clustering techniques have been widely studied in Refs. [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19]. They suggest more on grouping behavior and can be broadly classified into hierarchical or partitional clustering [1].
Hierarchical clustering is a procedure of transforming the proximity matrix of the data set into a sequence of nested groups in an agglomerative or divisive manner. The agglomerative hierarchical clustering has been widely studied as it allows for more feasible segments to be investigated [7], [8], [9], [10], [11]. The Single-link [7], Complete-link [8] and average-link [7] algorithms produce a sequence of clusterings based on the rank order of proximities. The Single-link and Complete-link algorithms use the distance between two closest and farthest points of two clusters as the cluster distance, respectively. Dependence on a few data points to measure the cluster distance makes these algorithms sensitive to noise. The average-link algorithm is more robust to noise by using the average distance of all pairs of patterns from different clusters as the cluster distance. A CURE algorithm [9] represents each cluster with a certain fixed number of well scattered points and shrinks these points toward the cluster center by a specified fraction. This algorithm achieves an improvement of noise robustness over the Single-link algorithm. A Chameleon algorithm [10] partitions a constructed k-nearest neighbor graph into a number of subclusters followed by dynamically merging the subclusters. In general, the hierarchical clustering algorithms can provide an easy understanding of the inherent structure of the data set. But they often require high computational cost and large memory space which make them inefficient for large data sets.
Partitional clustering produces a single partition of the data set which aims to optimize a certain cluster criterion function. Many partitional clustering algorithms have been proposed based on different cluster criterions [20], [21], [22], [23], [24], [25], [26]. In fact, each cluster criterion imposes a certain structure on the data set. The model-based clustering algorithms assume that the data distribution of a cluster fit a given probability density model such as Gaussian model [20], [21]. They can discover the hyper-ellipsoidal clusters. But assumption of a static model makes them ineffective to adequately capture the characteristics of individual clusters, especially when the data set contains the clusters of diverse shapes and densities. Some nonparametric clustering algorithms based on density and grid are proposed to identify clusters by searching the regions of high data density separated by sparse valleys [22], [23], [24]. Although these algorithms can find clusters of arbitrary shape, their performances usually degrade for the high dimensional data set. The squared-error clustering algorithm is based on squared error criterion [1], [25], [30]. It tends to work well with compact clusters of hyper-spherical shape and similar size and is widely studied and used [25], [26], [27], [28], [29], [30], [31]. Some new distance measures are proposed to detect clusters with specific characteristics [28], [29]. Besides the squared error, other criterions such as the Davies–Bouldin index [32] and cluster variance are imposed as a global criterion to determine the optimum number of clusters [26], [31]. Most partitional clustering algorithms require less memory space and computation cost than the hierarchical clustering algorithms. But their clustering results are usually not as good as those of hierarchical clustering. Recently, support vector clustering is proposed to generate the cluster boundaries of arbitrary shape by transforming the original space to a high dimensional space with a kernel function [19]. Although this algorithm can solve some difficult clustering problems, it is not easy to choose a suitable kernel parameter and the clustering result cannot provide information about the representation of cluster.
The hybrid clustering algorithms are proposed to combine the merits of partitional and hierarchical clustering algorithms for better data grouping [12], [13], [14], [15], [33], [34]. They usually partition the data set into a relatively large number of small subclusters and construct a hierarchical structure for them based on a certain cluster distance (similarity) measure. A given number of clusters can be found on the hierarchical structure. A BIRCH algorithm [12] arranges the data set into a number of subclusters represented by cluster feature (CF) vectors in a tree structure. It is efficient for large data sets.
In some applications, we may need to efficiently represent data and reduce the data complexity through clustering. A single prototype for each cluster, e.g., the centroid or medoid of cluster in -means type clustering, may not adequately model the clusters of arbitrary shape and size and hence limit the clustering performance on the complex data structure. This paper proposes a clustering algorithm to represent a cluster by multiple prototypes. The remaining of this paper is organized as follows. Section 2 reviews the related work along with the discussion of their differences. Section 3 presents the proposed multi-prototype clustering algorithm. In Section 4, the proposed algorithm is tested on both synthetic and real data sets and the results are compared to some existing clustering algorithms. Section 5 gives the conclusions.
Section snippets
Related work
The squared-error clustering algorithm produces a partition of the data set which aims to minimize the squared error [1], [25], [30]. Let where be a set of patterns represented as points in -dimensional space and be the number of clusters. The cluster prototypes are denoted by a set of vectors . The squared error function is computed assubject towhere and indicates that
The proposed clustering algorithm
In this section, we first propose a separation measure to evaluate how well two cluster prototypes are separated. Next, we present the proposed multi-prototype clustering algorithm based on the separation measure. Finally, the complexity analysis of the proposed algorithm is provided.
Experimental results and comparisons
The proposed multi-prototype clustering algorithm can be applied for any numerical data set. We conduct a series of experiments on both synthetic and real data sets to demonstrate the clustering performance of the proposed algorithm. The results are compared to some existing clustering algorithms.
Conclusions
In this paper, we have proposed a multi-prototype clustering algorithm which can discover the clusters of arbitrary shape and size. The squared-error clustering is used to produce a number of prototypes and locate the regions of high density because of its low computation and memory space requirements and yet good performance. A separation measure is proposed to evaluate how well two prototypes are separated by a sparse region. Multiple prototypes with small separations are organized to model a
About the Author—MANHUA LIU received B.Eng. degree in 1997 and M.Eng. degree in 2002 in automatic control from North China Institute of Technology and Shanghai JiaoTong University, China, respectively. She got Ph.D. degree in 2007 from Nanyang Technological University (NTU), Singapore. She was a research fellow in NTU. Currently, she is a lecturer in Shanghai Jiao Tong University, PR China. Her research interests include biometrics, pattern recognition, image processing and machine learning and
References (35)
- et al.
On finding the number of clusters
Pattern Recognition Lett.
(1999) - et al.
A clustering algorithm using an evolutionary programming-based approach
Pattern Recognition Lett.
(1997) - et al.
Cluster center initialization algorithm for k-means clustering
Pattern Recognition Lett.
(2004) - et al.
The use of linked line segments for cluster representation and data reduction
Pattern Recognition Lett.
(1999) - et al.
A new approach to clustering data with arbitrary shapes
Pattern Recognition
(2005) - et al.
Algorithms for Clustering Data
(1988) - et al.
Spatial Tessellations, Concepts and Applications of Voronoi Diagrams
(2000) - H. Koivistoinen, M. Ruuska, T. Elomaa, A Voronoi diagram approach to autonomous clustering, discovery science,...
- C. Reyes, M. Adjouadi, A clustering technique for random data classification, IEEE International Conference on Systems,...
- J. Li, P. Hao, Hierarchical structuring of data on manifolds, IEEE Conference on Computer Vision and Pattern...
Data clustering: a review
ACM Comput. Surveys
Numerical Taxonomy
Step-wise clustering procedures
J. Am. Statist. Assoc.
Chameleon: a hierarchical clustering algorithm using dynamic modeling
IEEE Comput.
A new cluster isolation criterion based on dissimilarity increments
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (67)
A novel tree structure-based multi-prototype clustering algorithm
2024, Journal of King Saud University - Computer and Information SciencesA fast O(NlgN) time hybrid clustering algorithm using the circumference proximity based merging technique for diversified datasets
2023, Engineering Applications of Artificial IntelligenceAnalysis of high dimensional brain data using prototype based fuzzy clustering
2020, Clinical Epidemiology and Global HealthEvolutionary multi-objective automatic clustering enhanced with quality metrics and ensemble strategy
2020, Knowledge-Based SystemsA fast hybrid clustering technique based on local nearest neighbor using minimum spanning tree
2019, Expert Systems with Applications
About the Author—MANHUA LIU received B.Eng. degree in 1997 and M.Eng. degree in 2002 in automatic control from North China Institute of Technology and Shanghai JiaoTong University, China, respectively. She got Ph.D. degree in 2007 from Nanyang Technological University (NTU), Singapore. She was a research fellow in NTU. Currently, she is a lecturer in Shanghai Jiao Tong University, PR China. Her research interests include biometrics, pattern recognition, image processing and machine learning and so forth.
About the Author—XUDONG JIANG received B.Eng. and M.Eng. from University of Electronic Science and Technology of China in 1983 and 1986, respectively, and Ph.D. from University of German Federal Armed Forces Hamburg, Germany in 1997, all in electrical and electronic engineering. From 1986 to 1993, he was a Lecturer at the University of Electronic Science and Technology of China where he received two Science and Technology Awards from Ministry for Electronic Industry of China. He was a recipient of German Konrad-Adenauer Foundation young scientist scholarship. From 1993 to 1997, he was with the University of German Federal Armed Forces Hamburg, Germany as a scientific assistant. From 1998 to 2002, He was with the Centre for Signal Processing, Nanyang Technological University, Singapore, as Research/Senior Fellow, where he developed a fingerprint verification algorithm achieving top in speed and second top in accuracy in the International Fingerprint Verification Competition (FVC2000). From 2002 to 2004, he was a Lead Scientist and Head of Biometrics Laboratory at the Institute for Infocomm Research, Singapore. Currently he is an Assistant Professor at the School of EEE, Nanyang Technological University, Singapore. His research interest includes pattern recognition, image processing, computer vision and biometrics.
About the Author—ALEX CHICHUNG KOT was educated at the University of Rochester, New York, and at the University of Rhode Island, Rhode Island, USA, where he received the Ph.D. degree in electrical engineering in 1989. He was with the AT&T Bell Company, New York, USA. Since 1991, he has been with the Nanyang Technological University (NTU), Singapore, where he is Vice Dean of the School of EEE. His research and teaching interests are in the areas of signal processing for communications, signal processing, watermarking, and information security. Dr. Kot served as the General Co-Chair for the Second International Conference on Information, Communications and Signal Processing (ICICS) in December 1999, the Advisor for ICICS‘01 and ICONIP‘02. He received the NTU Best Teacher of the Year Award in 1996 and has served as the Chairman of the IEEE Signal Processing Chapter in Singapore. He is the General Co-Chair for the IEEE ICIP 2004 and served as Associate Editor for the IEEE Transactions on Signal Processing and the IEEE Transactions on Circuits and Systems for Video Technology.