Elsevier

Pattern Recognition

Volume 42, Issue 5, May 2009, Pages 689-698
Pattern Recognition

A multi-prototype clustering algorithm

https://doi.org/10.1016/j.patcog.2008.09.015Get rights and content

Abstract

Clustering is an important unsupervised learning technique widely used to discover the inherent structure of a given data set. Some existing clustering algorithms uses single prototype to represent each cluster, which may not adequately model the clusters of arbitrary shape and size and hence limit the clustering performance on complex data structure. This paper proposes a clustering algorithm to represent one cluster by multiple prototypes. The squared-error clustering is used to produce a number of prototypes to locate the regions of high density because of its low computational cost and yet good performance. A separation measure is proposed to evaluate how well two prototypes are separated. Multiple prototypes with small separations are grouped into a given number of clusters in the agglomerative method. New prototypes are iteratively added to improve the poor cluster separations. As a result, the proposed algorithm can discover the clusters of complex structure with robustness to initial settings. Experimental results on both synthetic and real data sets demonstrate the effectiveness of the proposed clustering algorithm.

Introduction

Clustering is the unsupervised classification of patterns into groups [1]. It is widely used in data analysis such as data mining, pattern recognition and information retrieval. The Voronoi diagram also provides a means of naturally partitioning space into subregions to facilitate spatial data analysis and has been applied for data clustering [2], [3], [4], [5]. But this technique often implies emphasis on the shape and arrangement of patterns, i.e., the geometric aspect of groups. Clustering techniques have been widely studied in Refs. [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19]. They suggest more on grouping behavior and can be broadly classified into hierarchical or partitional clustering [1].

Hierarchical clustering is a procedure of transforming the proximity matrix of the data set into a sequence of nested groups in an agglomerative or divisive manner. The agglomerative hierarchical clustering has been widely studied as it allows for more feasible segments to be investigated [7], [8], [9], [10], [11]. The Single-link [7], Complete-link [8] and average-link [7] algorithms produce a sequence of clusterings based on the rank order of proximities. The Single-link and Complete-link algorithms use the distance between two closest and farthest points of two clusters as the cluster distance, respectively. Dependence on a few data points to measure the cluster distance makes these algorithms sensitive to noise. The average-link algorithm is more robust to noise by using the average distance of all pairs of patterns from different clusters as the cluster distance. A CURE algorithm [9] represents each cluster with a certain fixed number of well scattered points and shrinks these points toward the cluster center by a specified fraction. This algorithm achieves an improvement of noise robustness over the Single-link algorithm. A Chameleon algorithm [10] partitions a constructed k-nearest neighbor graph into a number of subclusters followed by dynamically merging the subclusters. In general, the hierarchical clustering algorithms can provide an easy understanding of the inherent structure of the data set. But they often require high computational cost and large memory space which make them inefficient for large data sets.

Partitional clustering produces a single partition of the data set which aims to optimize a certain cluster criterion function. Many partitional clustering algorithms have been proposed based on different cluster criterions [20], [21], [22], [23], [24], [25], [26]. In fact, each cluster criterion imposes a certain structure on the data set. The model-based clustering algorithms assume that the data distribution of a cluster fit a given probability density model such as Gaussian model [20], [21]. They can discover the hyper-ellipsoidal clusters. But assumption of a static model makes them ineffective to adequately capture the characteristics of individual clusters, especially when the data set contains the clusters of diverse shapes and densities. Some nonparametric clustering algorithms based on density and grid are proposed to identify clusters by searching the regions of high data density separated by sparse valleys [22], [23], [24]. Although these algorithms can find clusters of arbitrary shape, their performances usually degrade for the high dimensional data set. The squared-error clustering algorithm is based on squared error criterion [1], [25], [30]. It tends to work well with compact clusters of hyper-spherical shape and similar size and is widely studied and used [25], [26], [27], [28], [29], [30], [31]. Some new distance measures are proposed to detect clusters with specific characteristics [28], [29]. Besides the squared error, other criterions such as the Davies–Bouldin index [32] and cluster variance are imposed as a global criterion to determine the optimum number of clusters [26], [31]. Most partitional clustering algorithms require less memory space and computation cost than the hierarchical clustering algorithms. But their clustering results are usually not as good as those of hierarchical clustering. Recently, support vector clustering is proposed to generate the cluster boundaries of arbitrary shape by transforming the original space to a high dimensional space with a kernel function [19]. Although this algorithm can solve some difficult clustering problems, it is not easy to choose a suitable kernel parameter and the clustering result cannot provide information about the representation of cluster.

The hybrid clustering algorithms are proposed to combine the merits of partitional and hierarchical clustering algorithms for better data grouping [12], [13], [14], [15], [33], [34]. They usually partition the data set into a relatively large number of small subclusters and construct a hierarchical structure for them based on a certain cluster distance (similarity) measure. A given number of clusters can be found on the hierarchical structure. A BIRCH algorithm [12] arranges the data set into a number of subclusters represented by cluster feature (CF) vectors in a tree structure. It is efficient for large data sets.

In some applications, we may need to efficiently represent data and reduce the data complexity through clustering. A single prototype for each cluster, e.g., the centroid or medoid of cluster in K-means type clustering, may not adequately model the clusters of arbitrary shape and size and hence limit the clustering performance on the complex data structure. This paper proposes a clustering algorithm to represent a cluster by multiple prototypes. The remaining of this paper is organized as follows. Section 2 reviews the related work along with the discussion of their differences. Section 3 presents the proposed multi-prototype clustering algorithm. In Section 4, the proposed algorithm is tested on both synthetic and real data sets and the results are compared to some existing clustering algorithms. Section 5 gives the conclusions.

Section snippets

Related work

The squared-error clustering algorithm produces a partition of the data set which aims to minimize the squared error [1], [25], [30]. Let X={X1,X2,,XN} where Xi=[xi,1,xi,2,,xi,M]RM be a set of N patterns represented as points in M-dimensional space and K be the number of clusters. The cluster prototypes are denoted by a set of vectors Z={Z1,Z2,,ZK}. The squared error function is computed asE(U,Z)=l=1Ki=1Nui,ld2(Xi,Zl),subject tol=1Kui,l=1,1iN,where ui,l{0,1} and ui,l=1 indicates that

The proposed clustering algorithm

In this section, we first propose a separation measure to evaluate how well two cluster prototypes are separated. Next, we present the proposed multi-prototype clustering algorithm based on the separation measure. Finally, the complexity analysis of the proposed algorithm is provided.

Experimental results and comparisons

The proposed multi-prototype clustering algorithm can be applied for any numerical data set. We conduct a series of experiments on both synthetic and real data sets to demonstrate the clustering performance of the proposed algorithm. The results are compared to some existing clustering algorithms.

Conclusions

In this paper, we have proposed a multi-prototype clustering algorithm which can discover the clusters of arbitrary shape and size. The squared-error clustering is used to produce a number of prototypes and locate the regions of high density because of its low computation and memory space requirements and yet good performance. A separation measure is proposed to evaluate how well two prototypes are separated by a sparse region. Multiple prototypes with small separations are organized to model a

About the AuthorMANHUA LIU received B.Eng. degree in 1997 and M.Eng. degree in 2002 in automatic control from North China Institute of Technology and Shanghai JiaoTong University, China, respectively. She got Ph.D. degree in 2007 from Nanyang Technological University (NTU), Singapore. She was a research fellow in NTU. Currently, she is a lecturer in Shanghai Jiao Tong University, PR China. Her research interests include biometrics, pattern recognition, image processing and machine learning and

References (35)

  • A.K. Jain et al.

    Data clustering: a review

    ACM Comput. Surveys

    (1999)
  • P.H.A. Sneath et al.

    Numerical Taxonomy

    (1973)
  • B. King

    Step-wise clustering procedures

    J. Am. Statist. Assoc.

    (1967)
  • S. Guha, R. Rastogi, K. Shim, Cure: an efficient clustering algorithm for large databases, Proceedings of the...
  • G. Karypis et al.

    Chameleon: a hierarchical clustering algorithm using dynamic modeling

    IEEE Comput.

    (1999)
  • A.L. Fred et al.

    A new cluster isolation criterion based on dissimilarity increments

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2003)
  • T. Zhang, R. Ramakrishnan, M. Livny, Birch: an efficient data clustering method for very large databases, Proceedings...
  • Cited by (67)

    • A novel tree structure-based multi-prototype clustering algorithm

      2024, Journal of King Saud University - Computer and Information Sciences
    View all citing articles on Scopus

    About the AuthorMANHUA LIU received B.Eng. degree in 1997 and M.Eng. degree in 2002 in automatic control from North China Institute of Technology and Shanghai JiaoTong University, China, respectively. She got Ph.D. degree in 2007 from Nanyang Technological University (NTU), Singapore. She was a research fellow in NTU. Currently, she is a lecturer in Shanghai Jiao Tong University, PR China. Her research interests include biometrics, pattern recognition, image processing and machine learning and so forth.

    About the AuthorXUDONG JIANG received B.Eng. and M.Eng. from University of Electronic Science and Technology of China in 1983 and 1986, respectively, and Ph.D. from University of German Federal Armed Forces Hamburg, Germany in 1997, all in electrical and electronic engineering. From 1986 to 1993, he was a Lecturer at the University of Electronic Science and Technology of China where he received two Science and Technology Awards from Ministry for Electronic Industry of China. He was a recipient of German Konrad-Adenauer Foundation young scientist scholarship. From 1993 to 1997, he was with the University of German Federal Armed Forces Hamburg, Germany as a scientific assistant. From 1998 to 2002, He was with the Centre for Signal Processing, Nanyang Technological University, Singapore, as Research/Senior Fellow, where he developed a fingerprint verification algorithm achieving top in speed and second top in accuracy in the International Fingerprint Verification Competition (FVC2000). From 2002 to 2004, he was a Lead Scientist and Head of Biometrics Laboratory at the Institute for Infocomm Research, Singapore. Currently he is an Assistant Professor at the School of EEE, Nanyang Technological University, Singapore. His research interest includes pattern recognition, image processing, computer vision and biometrics.

    About the AuthorALEX CHICHUNG KOT was educated at the University of Rochester, New York, and at the University of Rhode Island, Rhode Island, USA, where he received the Ph.D. degree in electrical engineering in 1989. He was with the AT&T Bell Company, New York, USA. Since 1991, he has been with the Nanyang Technological University (NTU), Singapore, where he is Vice Dean of the School of EEE. His research and teaching interests are in the areas of signal processing for communications, signal processing, watermarking, and information security. Dr. Kot served as the General Co-Chair for the Second International Conference on Information, Communications and Signal Processing (ICICS) in December 1999, the Advisor for ICICS‘01 and ICONIP‘02. He received the NTU Best Teacher of the Year Award in 1996 and has served as the Chairman of the IEEE Signal Processing Chapter in Singapore. He is the General Co-Chair for the IEEE ICIP 2004 and served as Associate Editor for the IEEE Transactions on Signal Processing and the IEEE Transactions on Circuits and Systems for Video Technology.

    View full text