FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting

https://doi.org/10.1016/j.infsof.2003.07.003Get rights and content

Abstract

The aim of this paper is to present a novel subspace clustering method named FINDIT. Clustering is the process of finding interesting patterns residing in the dataset by grouping similar data objects from dissimilar ones based on their dimensional values. Subspace clustering is a new area of clustering which achieves the clustering goal in high dimension by allowing clusters to be formed with their own correlated dimensions.

In subspace clustering, selecting correct dimensions is very important because the distance between points is easily changed according to the selected dimensions. However, to select dimensions correctly is difficult, because data grouping and dimension selecting should be performed simultaneously. FINDIT determines the correlated dimensions for each cluster based on two key ideas: dimension-oriented distance measure which fully utilizes dimensional difference information, and dimension voting policy which determines important dimensions in a probabilistic way based on V nearest neighbors' information. Through various experiments on synthetic data, FINDIT is shown to be very successful in the high dimensional clustering problem. FINDIT satisfies most requirements for good clustering methods such as accuracy of results, robustness to the noise and the cluster density, and scalability to the dataset size and the dimensionality. Moreover, it is gracefully scalable to full dimension without any modification to algorithm.

Introduction

Data Mining is the process of extracting unknown and potentially useful information from database. It can be used in many application areas like insurance, health-care, database marketing, stock management and scientific knowledge discovery. Clustering is one of the frequently used tools in Data Mining, which refers to the process of partitioning data so that intra-group similarities are maximized and inter-group similarities are minimized at the same time. It is especially useful in the condition when there is little knowledge about the given dataset. Therefore, many data clustering techniques have been proposed [3], [4], [5], [6], [7], [8], [9], [12], [15], [16], [22]. However, conventional clustering methods do not scale well to high dimension in terms of effectiveness and efficiency. There are some problems that prevent them from performing well in high dimensional datasets. Firstly, it is difficult to distinguish similar points from dissimilar ones, since the distance between any two points becomes almost the same value [17]. Secondly, clusters tend to exist in different subspaces [10].

One possible extension of conventional clustering is the application of dimension reduction techniques. These approaches lower the dimensionality first, either by removing less important dimensions or by transforming the original space to a low dimensional space. Then conventional clustering techniques are applied to the dataset in reduced dimensions. However, because clusters can be formed in different subspaces, such kinds of dimension reduction can get rid of useful dimensional information from some clusters. As a result of the lost information, it can generate clusters that may not fully reflect the original clusters' properties. Moreover, the generated result is usually not suitable for further analysis, which is required by many data mining applications.

Subspace clustering is the answer to this challenge. It achieves the clustering goal by allowing clusters to be formed with their own correlated dimensions. Since, it was first proposed by Ref. [10], several different subspace clustering methods have been presented and had some success [13], [14], [18], [19], [20]. Recent subspace clustering algorithms can be broadly classified into two categories: the grid-based approach [10], [14], [18] and the partitioning approach [13], [19]. The grid-based approaches mainly focus on the detailed space segmentation to report the dense regions, and the partitioning approaches focus on the disjoint cluster generation.

CLIQUE [10], the first subspace clustering algorithm, partitions the whole data space into non-overlapping rectangular units and then searches for dense units based on the assumption: ‘If a k-dimensional region is dense, the (k−1)-dimensional region containing it should also be dense’. After dense units are found, several sets of connected dense units are reported as clusters. CLIQUE performs a heuristic pruning step to reduce the number of candidate spaces, which increase exponentially according to the dimensionality. However, there is a tradeoff between accuracy and time. ENCLUS [14] uses the similar approach suggested by CLIQUE. It suggests three requirements for good clusters and also presents one measure named entropy measure that satisfies those three requirements. In the clustering process, the entropy measure is used to prune away uninteresting subspaces efficiently. MAFIA [18] is another extension of CLIQUE. It proposes so-called adaptive grid to enhance the effectiveness and efficiency of CLIQUE and adopts parallel processing to deal with large datasets. These three subspace clustering methods are effective at finding arbitrarily shaped clusters. However, the scalability to high dimension is the common problem of grid-based approaches.

PROCLUS [13] is a variation of k-medoid algorithm in subspace clustering. It starts by choosing a random set of k-medoids and iterates to improve the quality of result by exchanging bad medoids with new ones. In each iteration, all data points are assigned to their nearest medoids, and each cluster's dimensions are selected based on the assigned points. When the quality of result does not change within a certain number of medoid changes, the algorithm stops and reports the generated clusters as the result. ORCLUS [19] is the extended version of PROCLUS, that deals with the correlation of the arbitrary axis. As in PROCLUS, in each iteration, points are assigned to the nearest seeds to form clusters, but the distance is measured in arbitrary dimensions of each cluster. To find out the hidden dimensions of clusters, ORCLUS uses singular value decomposition, which is a famous dimension reduction technique. At the end of each iteration, the number of seeds and their dimension sizes are reduced according to the given factor α and β (note that the initial number of seeds is much bigger than the number of clusters k). The algorithm stops when the remaining number of seeds is reduced to k. In general, both PROCLUS and ORCLUS generate highly disjoint clusters compared to CLIQUE.

CLTree [20] is another interesting method which does not belong to any of the above two classes. It modifies decision tree, which is originally a classification method, to make a subspace clustering algorithm. Because decision tree algorithm requires every point to have a binary class label, CLTree regards all points in the given region as labeled Y, and then scatters virtual points labeled N into the given region uniformly. For Y class points and N class points in the given region, the best cut distinguishing two classes is selected. Sequentially, two partitioned regions become the child nodes of the original region, and this process is repeated until the decision tree is completely constructed. After decision tree construction, since too many regions are usually generated, pruning and merging steps are subsequently performed to make appropriately sized clusters.

Although, these previous subspace clustering methods have been successful in some points, they have problems relating the scalability to the dimensionality [10], [14], [18] and the dataset size [13], [19]. Moreover, none of them has reported the robustness result to the noise ratio and the volume of clusters. The robustness against them is very important because the noise (i.e. outlier) would probably increase according to the dimensionality of the dataset, and clustering in the high dimension is what subspace clustering is proposed for.

We present an algorithm, named FINDIT (a Fast and INtelligent subspace clustering algorithm using DImendion voTing), which is highly accurate in various conditions and very fast in spite of the increasing dataset size and dimensionality. We also suggest a new distance measure devised for subspace clustering which could be usefully adopted by other high dimensional clustering methods.

The remainder of the paper is organized as follows. Our motivation and the algorithm overview are written in Section 2. The detailed explanation of our algorithm is presented in Section 3. Section 4 explains about the datasets used in the experiments and the performance results. In Section 5, we discuss about the various properties of FINDIT. Section 6 contains the conclusion of our research work.

Section snippets

Dimension-oriented distance

Dimension-oriented distance (dod) is our unique distance measure which utilizes dimensional difference information and value difference information together, while the conventional distance measures use only value difference information. We measure the similarity between two points by counting the number of dimensions in which the two points' value difference is smaller than the given ϵ because we think they are ‘near enough’ on that dimension. That is, if the Manhattan distance between two

Sampling phase

In sampling phase, two different samples S and M are made from the dataset referring the dataset size N and the user parameter Cminsize. In order for the subsequent process to work properly, any original cluster larger than Cminsize should have more than a certain number of points in S and have at least one point in M. For S and M to satisfy the above property, we should answer to the question of ‘How many points should be selected for S and M, respectively?’. As a solution of this question,

Experimentations

We conducted a series of experiments designed to measure the performance of FINDIT in terms of accuracy, robustness and scalability. For this purpose we generated various synthetic datasets based on the method described in Ref. [13], variating the range of parameters. All of the experiments were run on a Pentium-3 500 MHz Linux machine with 1 GB Memory and 15 GB SCSI type disk drive.

Discussion

Epsilon and the soundness criteria. Selecting bestepsilon is very important in our method since the running parameter ϵ plays a key role in shaping clusters. To find out the property of ϵ, we conducted an experiment on two datasets S1D20AD7PO0.4 and S3D20AD7PO0.4. For these two datasets, we obtained 25 different medoid cluster sets variating ϵ from 1 to 25 (note that all dimensions have the normalized valuerange of [1,100]), and performed the data assigning phase against 25 different medoid

Conclusion

In this paper, we have suggested a new subspace clustering algorithm named FINDIT. We have experimentally shown that the proposed algorithm significantly improves quality of clustering, robustness to the increasing noise and cluster diameter, and time scalability to the dataset size and the dimensionality. It generates disjoint clusters accurately with their subspace information, and this high accuracy does not degrade at all even when 50% out of data points become outliers. The accuracy and

References (22)

  • J.S Vitter

    Random sampling with a reservoir

    (1985)
  • A.K Jain et al.

    Algorithms for Clustering Data

    (1988)
  • L Kaufman et al.

    Finding Groups in Data: An Introduction to Cluster Analysis

    (1990)
  • D.R Cutting et al.

    Scatter/gather: a cluster-based approach to browsing large document collections

    (1992)
  • R Ng et al.

    Efficient and effective clustering methods for spatial data mining

    (1994)
  • M Ester et al.

    A density-based algorithm for discovering clusters in large spatial databases with noise

    in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Portland, OR

    (1996)
  • T Zhang et al.

    BIRCH: an efficient data clustering method for large databases

    in: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Que., Canada

    (1996)
  • W Wang et al.

    STING: a statistical information grid approach to spatial data mining

    (1997)
  • G Sheikholeslami et al.

    A multi-resolution clustering approach for very large spatial databases

    (1998)
  • R Agrawal et al.

    Automatic subspace clustering of high dimensional data for data mining applications

    in: Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA

    (1998)
  • N Joze-Khajavi et al.

    Two-phase clustering of large datasets

    (1998)
  • Cited by (90)

    • Discovering the potential opportunities of scientific advancement and technological innovation: A case study of smart health monitoring technology

      2020, Technological Forecasting and Social Change
      Citation Excerpt :

      Well-known algorithms that apply the top-down approach include PROCLUS (PROjected CLUStering) (Aggarwal et al., 1999). The algorithms of FINDIT (Woo et al., 2004) and SSPC (Yip et al., 2005) are variations of PROCLUS. Furthermore, the bottom-up approach attempts to anticipate the subspace of the clusters and then determines the cluster members.

    • A rough set based subspace clustering technique for high dimensional data

      2020, Journal of King Saud University - Computer and Information Sciences
      Citation Excerpt :

      In Top down approach, the exploration of clusters starts from full dimensional space and subspace clustering process continues towards lower dimensional subspaces. Some of the top down clustering techniques are FINDIT (a Fast and Intelligent Subspace Clustering Algorithm using Dimension Voting) (Woo et al., 2004), PROCLUS (PROjected CLUStering) (Aggarwal et al., 1999) and ORCLUS (arbitrarily ORiented projected CLUSter generation) (Aggarwal and Yu, 2000). Some recent works in projected clustering was carried by Feiping et al. (2014a, 2016).

    • SubspaceDB: In-database subspace clustering for analytical query processing

      2019, Data and Knowledge Engineering
      Citation Excerpt :

      Section 5 presents empirical results of effectiveness of analytical querying strategy and comparison with external analytics platform and finally Section 6 concludes the paper with future scope of work. There are various machine learning algorithms available for prediction, classification, and clustering [13] such as Regression, Decision Trees, K-means, [10], K-medoid [14], Expectation Maximization [15], and subspace clustering [3–5,16] for analysing voluminous and high-dimensional data. Many of these well-defined algorithms are complex and iterative in nature with statistical and mathematical models.

    View all citing articles on Scopus
    View full text