A top-down approach for density-based clustering using multidimensional indexes

doi:10.1016/j.jss.2003.08.237

Journal of Systems and Software

Volume 73, Issue 1, September 2004, Pages 169-180

https://doi.org/10.1016/j.jss.2003.08.237 Get rights and content

Abstract

Clustering on large databases has been studied actively as an increasing number of applications involve huge amount of data. In this paper, we propose an efficient top-down approach for density-based clustering, which is based on the density information stored in index nodes of a multidimensional index. We first provide a formal definition of the cluster based on the concept of region contrast partition. Based on this notion, we propose a novel top-down clustering algorithm, which improves the efficiency through branch-and-bound pruning. For this pruning, we present a technique for determining the bounds based on sparse and dense internal regions and formally prove the correctness of the bounds. Experimental results show that the proposed method reduces the elapsed time by up to 96 times compared with that of BIRCH, which is a well-known clustering method. The results also show that the performance improvement becomes more marked as the size of the database increases.

Introduction

Data mining has become a research area of increasing importance. In particular, clustering on a large database has become one of the most actively studied topics of data mining (Chen et al., 1996). Clustering, also known as unsupervised learning, distinguishes dense areas with high data concentration from sparse areas to find useful patterns of data distribution in the database (Ankerst et al., 1999; Ester et al., 1996; Karypis et al., 1999; Kaufman and Rousseeuw, 1990; Ng and Han, 1994; Schikuta, 1996). The purpose of clustering is to group the objects of a database into meaningful subclasses, called clusters (Ester et al., 1996). Clustering is widely used in various applications such as customer purchase pattern analysis, medical data analysis, geographical information analysis, and image analysis. An example is seismic fault detection in a geographic information system when provided with data on earthquakes in seismic regions (see <http://www.ceri.memphis.edu> for examples). Here, we can locate the faults by partitioning the data into two regions, one with frequent earthquakes and the other without, via clustering. Basically, the focus of clustering methods has been on the accuracy of clusters and the computation time. As databases become larger, however, most clustering methods are no longer practical because of excessive processing time. Therefore, recent clustering techniques are focusing on the scalability (Breunig et al., 2001; Ganti et al., 1999).

In this paper, we propose a novel top-down clustering approach that avoids such excessive computations by searching an index for densely populated regions in a database. In particular, this method takes advantage of a multidimensional index commonly used in large database applications (such as data warehouses and geographical information systems). In a multidimensional index, objects that are closer to each other have a higher probability of being stored in the same or adjacent data pages. This is called the clustering property (Ester et al., 1995; Lee et al., 1997). By taking advantage of this property, we identify the neighboring objects by using only density information but without accessing the objects themselves or doing a lot of distance calculations. We further improve the efficiency by reducing the number of index nodes accessed using the pruning mechanism in a top-down search of the index.

Specifically, we first provide a formal definition of a cluster based on the notion of the density of regions (which we formally define in Section 3.2) in the multidimensional index. For this definition, we introduce the concept of the region contrast partition, which divides the database space into the higher-density section and the lower-density section based on the density of regions. Then, we present a branch-and-bound algorithm for pruning the index search to do the region contrast partition efficiently. Given two bounds (high and low), the pruning eliminates the index nodes whose region densities are out of the bounds. For this algorithm, we describe how the bounds are calculated and formally prove their correctness.

We demonstrate empirically that the proposed method is more efficient than BIRCH (Zhang et al., 1996), a well-known clustering algorithm, while producing clusters with the same or better accuracy. For this experiment, we use the elapsed time as the efficiency metric and introduce a new accuracy metric based on the relative number of objects in a cluster. The experimental results show that our algorithm is one or two orders of magnitude more efficient if we consider the index as already available from other applications. Even if we take the index creation and maintenance cost into consideration, our algorithm is significantly more efficient when the creation cost is amortized over a number of clustering operations performed until the index (if at all) needs to be recreated.

The rest of this paper is organized as follows. Section 2 introduces related work on existing clustering algorithms for large databases. Section 3 provides a formal definition of the cluster based on density represented by the multidimensional index. Section 4 presents the proposed top-down clustering algorithm. Section 5 shows the experimental results comparing the proposed algorithm and BIRCH. Finally, Section 6 summarizes and concludes the paper.

Section snippets

Related work

For an efficient clustering of large databases, some methods use sampling techniques (Breunig et al., 2001; Guha et al., 1998; Palmer and Faloutsos, 2000), and others use cluster summary information (Ganti et al., 1999; Zhang et al., 1996). The former methods extract samples from large databases and perform clustering on the samples. These methods have the advantage of being simple and easy to apply. However, they have the disadvantage that the accuracy of the clusters found depends largely on

Definition of the cluster

In this section we provide some definitions to formally define the cluster based on density represented by the multidimensional index. In Section 3.1, we introduce the terminology related to multidimensional indexes. In Section 3.2, we present the notion of the region contrast partition, which provides the basis of the proposed clustering method, and define a cluster based on the partition.

A top-down approach for density-based clustering

In this section we propose a top-down approach for density-based clustering that uses a multidimensional index. In Section 4.1, we introduce the concept of density-based pruning using internal entry information. In Section 4.2, we present an efficient top-down clustering algorithm using a branch-and-bound pruning mechanism.

Performance evaluation

In this section we present the results of comparing our proposed algorithm density_pruning_clustering with BIRCH (Zhang et al., 1996), a widely-known clustering algorithm. First, we describe the experimental data and environment in Section 5.1. Then, we evaluate the efficiency of obtaining clusters in Section 5.2, the accuracy of clusters in Section 5.3, and the sensitivity of cluster accuracy to the clustering factor in Section 5.4.

Conclusions

In this paper, we have proposed a novel top-down clustering method based on region density using a multidimensional index. Generally, multidimensional indexes have inherent clustering property of storing similar (i.e., close to each other) objects in the same or adjacent data pages. By taking advantage of this property, our method finds similar objects using only the region density information without incurring the high cost of accessing the objects themselves and calculating distances among

Acknowledgements

This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Advanced Information Technology Research Center (AITrc).

References (20)

Ankerst, M., Breunig, M., Kriegel, H.P., Sander, J., 1999. OPTICS: Ordering points to identify the clustering...
Breunig, M., Kriegel, H.P., Kroger, P., Sander, J., 2001. Data bubbles: quality preserving performance boosting for...
M.S. Chen et al.
Data mining: an overview from a database perspective
IEEE Trans. Knowledge Data Eng.
(1996)
Ester, M., Kriegel, H.P., Xu, X., 1995. Knowledge discovery in large spatial databases: focusing techniques for...
Ester, M., Kriegel, H.P., Sander, J., Xu, X., 1996. A density-based algorithm for discovering clusters in large spatial...
Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A., French, J., 1999. Clustering large datasets in arbitrary metric...
Guha, S., Rastogi, R., Shim, K.S., 1998. CURE: an efficient clustering algorithm for large databases. In: Proc. Int'l...
Hwang, J.J., Whang, K.Y., Moon, Y.S., Lee, B.S., 2003. Top-down clustering using multidimensional indexes, KAIST...
G. Karypis et al.
Chameleon: hierarchical clustering using dynamic modeling
IEEE Comp.
(1999)
L. Kaufman et al.
Finding Groups in Data: An Introduction to Cluster Analysis
(1990)

There are more references available in the full text version of this article.

Cited by (6)

Accelerating k-medoid-based algorithms through metric access methods
2008, Journal of Systems and Software
Scalable data mining algorithms have become crucial to efficiently support KDD processes on large databases. In this paper, we address the task of scaling up k-medoid-based algorithms through the utilization of metric access methods, allowing clustering algorithms to be executed by database management systems in a fraction of the time usually required by the traditional approaches. We also present an optimization strategy that can be applied as an additional step of the proposed algorithm in order to achieve better clustering solutions. Experimental results based on several datasets, including synthetic and real ones, show that the proposed algorithm can reduce the number of distance calculations by a factor of more than three thousand times when compared to existing algorithms, while producing clusters of equivalent quality.
ClusMAM: Fast and effective unsupervised clustering of large complex datasets using metric access methods
2016, Proceedings of the ACM Symposium on Applied Computing
A data mining approach to analyzing student-peer relationships from communication history records
2013, International Journal of Innovative Computing, Information and Control
Volume-based clustering for arbitrary shaped clusters
2013, International Journal of Computational Vision and Robotics
Evolutionary fuzzy clustering algorithm dominated by guided function
2011, Xitong Gongcheng Lilun yu Shijian/System Engineering Theory and Practice
Mining meaningful student groups based on communication history records
2007, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View full text

A top-down approach for density-based clustering using multidimensional indexes

Abstract

Introduction

Section snippets

Related work

Definition of the cluster

A top-down approach for density-based clustering

Performance evaluation

Conclusions

Acknowledgements

Data mining: an overview from a database perspective

IEEE Trans. Knowledge Data Eng.

Chameleon: hierarchical clustering using dynamic modeling

IEEE Comp.

Finding Groups in Data: An Introduction to Cluster Analysis