A top-down approach for density-based clustering using multidimensional indexes
Introduction
Data mining has become a research area of increasing importance. In particular, clustering on a large database has become one of the most actively studied topics of data mining (Chen et al., 1996). Clustering, also known as unsupervised learning, distinguishes dense areas with high data concentration from sparse areas to find useful patterns of data distribution in the database (Ankerst et al., 1999; Ester et al., 1996; Karypis et al., 1999; Kaufman and Rousseeuw, 1990; Ng and Han, 1994; Schikuta, 1996). The purpose of clustering is to group the objects of a database into meaningful subclasses, called clusters (Ester et al., 1996). Clustering is widely used in various applications such as customer purchase pattern analysis, medical data analysis, geographical information analysis, and image analysis. An example is seismic fault detection in a geographic information system when provided with data on earthquakes in seismic regions (see <http://www.ceri.memphis.edu> for examples). Here, we can locate the faults by partitioning the data into two regions, one with frequent earthquakes and the other without, via clustering. Basically, the focus of clustering methods has been on the accuracy of clusters and the computation time. As databases become larger, however, most clustering methods are no longer practical because of excessive processing time. Therefore, recent clustering techniques are focusing on the scalability (Breunig et al., 2001; Ganti et al., 1999).
In this paper, we propose a novel top-down clustering approach that avoids such excessive computations by searching an index for densely populated regions in a database. In particular, this method takes advantage of a multidimensional index commonly used in large database applications (such as data warehouses and geographical information systems). In a multidimensional index, objects that are closer to each other have a higher probability of being stored in the same or adjacent data pages. This is called the clustering property (Ester et al., 1995; Lee et al., 1997). By taking advantage of this property, we identify the neighboring objects by using only density information but without accessing the objects themselves or doing a lot of distance calculations. We further improve the efficiency by reducing the number of index nodes accessed using the pruning mechanism in a top-down search of the index.
Specifically, we first provide a formal definition of a cluster based on the notion of the density of regions (which we formally define in Section 3.2) in the multidimensional index. For this definition, we introduce the concept of the region contrast partition, which divides the database space into the higher-density section and the lower-density section based on the density of regions. Then, we present a branch-and-bound algorithm for pruning the index search to do the region contrast partition efficiently. Given two bounds (high and low), the pruning eliminates the index nodes whose region densities are out of the bounds. For this algorithm, we describe how the bounds are calculated and formally prove their correctness.
We demonstrate empirically that the proposed method is more efficient than BIRCH (Zhang et al., 1996), a well-known clustering algorithm, while producing clusters with the same or better accuracy. For this experiment, we use the elapsed time as the efficiency metric and introduce a new accuracy metric based on the relative number of objects in a cluster. The experimental results show that our algorithm is one or two orders of magnitude more efficient if we consider the index as already available from other applications. Even if we take the index creation and maintenance cost into consideration, our algorithm is significantly more efficient when the creation cost is amortized over a number of clustering operations performed until the index (if at all) needs to be recreated.
The rest of this paper is organized as follows. Section 2 introduces related work on existing clustering algorithms for large databases. Section 3 provides a formal definition of the cluster based on density represented by the multidimensional index. Section 4 presents the proposed top-down clustering algorithm. Section 5 shows the experimental results comparing the proposed algorithm and BIRCH. Finally, Section 6 summarizes and concludes the paper.
Section snippets
Related work
For an efficient clustering of large databases, some methods use sampling techniques (Breunig et al., 2001; Guha et al., 1998; Palmer and Faloutsos, 2000), and others use cluster summary information (Ganti et al., 1999; Zhang et al., 1996). The former methods extract samples from large databases and perform clustering on the samples. These methods have the advantage of being simple and easy to apply. However, they have the disadvantage that the accuracy of the clusters found depends largely on
Definition of the cluster
In this section we provide some definitions to formally define the cluster based on density represented by the multidimensional index. In Section 3.1, we introduce the terminology related to multidimensional indexes. In Section 3.2, we present the notion of the region contrast partition, which provides the basis of the proposed clustering method, and define a cluster based on the partition.
A top-down approach for density-based clustering
In this section we propose a top-down approach for density-based clustering that uses a multidimensional index. In Section 4.1, we introduce the concept of density-based pruning using internal entry information. In Section 4.2, we present an efficient top-down clustering algorithm using a branch-and-bound pruning mechanism.
Performance evaluation
In this section we present the results of comparing our proposed algorithm density_pruning_clustering with BIRCH (Zhang et al., 1996), a widely-known clustering algorithm. First, we describe the experimental data and environment in Section 5.1. Then, we evaluate the efficiency of obtaining clusters in Section 5.2, the accuracy of clusters in Section 5.3, and the sensitivity of cluster accuracy to the clustering factor in Section 5.4.
Conclusions
In this paper, we have proposed a novel top-down clustering method based on region density using a multidimensional index. Generally, multidimensional indexes have inherent clustering property of storing similar (i.e., close to each other) objects in the same or adjacent data pages. By taking advantage of this property, our method finds similar objects using only the region density information without incurring the high cost of accessing the objects themselves and calculating distances among
Acknowledgements
This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Advanced Information Technology Research Center (AITrc).
References (20)
- Ankerst, M., Breunig, M., Kriegel, H.P., Sander, J., 1999. OPTICS: Ordering points to identify the clustering...
- Breunig, M., Kriegel, H.P., Kroger, P., Sander, J., 2001. Data bubbles: quality preserving performance boosting for...
- et al.
Data mining: an overview from a database perspective
IEEE Trans. Knowledge Data Eng.
(1996) - Ester, M., Kriegel, H.P., Xu, X., 1995. Knowledge discovery in large spatial databases: focusing techniques for...
- Ester, M., Kriegel, H.P., Sander, J., Xu, X., 1996. A density-based algorithm for discovering clusters in large spatial...
- Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A., French, J., 1999. Clustering large datasets in arbitrary metric...
- Guha, S., Rastogi, R., Shim, K.S., 1998. CURE: an efficient clustering algorithm for large databases. In: Proc. Int'l...
- Hwang, J.J., Whang, K.Y., Moon, Y.S., Lee, B.S., 2003. Top-down clustering using multidimensional indexes, KAIST...
- et al.
Chameleon: hierarchical clustering using dynamic modeling
IEEE Comp.
(1999) - et al.
Finding Groups in Data: An Introduction to Cluster Analysis
(1990)
Cited by (6)
Accelerating k-medoid-based algorithms through metric access methods
2008, Journal of Systems and SoftwareClusMAM: Fast and effective unsupervised clustering of large complex datasets using metric access methods
2016, Proceedings of the ACM Symposium on Applied ComputingA data mining approach to analyzing student-peer relationships from communication history records
2013, International Journal of Innovative Computing, Information and ControlVolume-based clustering for arbitrary shaped clusters
2013, International Journal of Computational Vision and RoboticsEvolutionary fuzzy clustering algorithm dominated by guided function
2011, Xitong Gongcheng Lilun yu Shijian/System Engineering Theory and PracticeMining meaningful student groups based on communication history records
2007, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)