ABSTRACT
While much work has been done in finding linear correlation among subsets of features in high-dimensional data, work on detecting nonlinear correlation has been left largely untouched. In this paper, we present an algorithm for finding and visualizing nonlinear correlation clusters in the subspace of high-dimensional databases.Unlike the detection of linear correlation in which clusters are of unique orientations, finding nonlinear correlation clusters of varying orientations requires merging clusters of possibly very different orientations. Combined with the fact that spatial proximity must be judged based on a subset of features that are not originally known, deciding which clusters to be merged during the clustering process becomes a challenge. To avoid this problem, we propose a novel concept called co-sharing level which captures both spatial proximity and cluster orientation when judging similarity between clusters. Based on this concept, we develop an algorithm which not only detects nonlinear correlation clusters but also provides a way to visualize them. Experiments on both synthetic and real-life datasets are done to show the effectiveness of our method.
- Hinneburg A. and Keim D. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In Proc. of the 25th Int. Conf. on Very Large Data Bases, pages 506 - 517, 1999.]] Google ScholarDigital Library
- Hinneburg A. and Keim D. A. An efficient approach to cluster in large multimedia databases with noise. In Proc. of the Int. Conf. on Knowledge Discovery and Data Mining, 1998.]]Google Scholar
- Yu P. S. Aggarwal C. C. Finding generalized projected clusters in high dimensional spaces. In Proc. of ACM SIGMOD Conf. Proceedings, volume 29, 2000.]] Google ScholarDigital Library
- R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. of ACM-SIGMOD Int. Conf. on Management of Data, pages 94--105, June 1998.]] Google ScholarDigital Library
- M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure. In Proc. 1999 ACM-SIGMOD Int. Conf. on Management of Data, pages 49--60, June 1999.]] Google ScholarDigital Library
- C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.]]Google Scholar
- Christian Bohm, Karin Kailing, Peer Kroger, and Arthur Zimek. Computing clusters of correlation connected objects. In Proc. of ACM-SIGMOD Int. Conf. on Management of Data, June 2004.]] Google ScholarDigital Library
- P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98), pages 9--15, Aug. 1998.]]Google Scholar
- Agrawal C. C., Procopiuc C., Wolf J. L., Yu P. S., and Park J. S. Fast algorithms for projected clustering. In Proc. of ACM SIGMOD Int. conf. on Management of Data, pages 61--72, 1999.]] Google ScholarDigital Library
- C. H. Cheng, A. C. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, 1996.]] Google ScholarDigital Library
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining (KDD'96), pages 226--231, Portland, Oregon, Aug. 1996.]]Google Scholar
- Patrik D Haeseleer, Xiling Wen, Stefanie Fuhrman, and Roland Somogyi. Mining the gene expression matrix: Inferring gene relationships from large scale gene expression data. Information Processing in Cells and Tissues, pages 203--212, 1998.]] Google ScholarDigital Library
- V. R. Iyer, M. B. Eisen, D. T. Ross, G. Schuler, T. Moore, J.C.F. Lee, J. M. Trent, L. M. Staudt, J. Jr Hudson, M. S. Boguski, D. Lashkari, D. Shalon, D. Botstein, and P. O. Brown. The transcriptional program in the response of human fibroblasts to serum. Science, 283:83--87, 1999.]]Google ScholarCross Ref
- Han J. and Kamber M. Data mining concepts and techniques. Morgan Kaufmann, August 2001.]] Google ScholarDigital Library
- Banfield J. D. and Raftery A. E. Model-based gaussian and non-gaussian clustering. Biometrics, 49:803--821, September, 1993.]]Google ScholarCross Ref
- I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, 2002.]]Google Scholar
- Kaufman L. and Rousseeuw P. J. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience, 1990.]]Google Scholar
- C. M. Procopiuc, M. Jones, P. K. Agarwal, and M. T. M. A monte carlo algorithm for fast projective clustering. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2002.]] Google ScholarDigital Library
- J. Roy. A fast improvement to the em algorithm on its own terms. JRSS(B), 51:127--138, 1989.]]Google Scholar
- Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2323--2326, 2000.]]Google Scholar
- A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-based clustering in large databases. In Proc. 2001 Int. Conf. on Database Theory, Jan. 2001.]] Google ScholarDigital Library
- A. K. H. Tung, J. Hou, and J. Han. Spatial clustering in the presence of obstacles. In Proc. 2001 Int. Conf. on Data Engineering, Heidelberg, Germany, April 2001.]] Google ScholarDigital Library
- XU X., Ester M., Kriegel H-P., and Sander J. A distributed-based clustering algorithm for mining in large spatial databases. In Proc. 1998 Int. Conf. on Data Engineering, 1998.]] Google ScholarDigital Library
- CURLER: finding and visualizing nonlinear correlation clusters
Recommendations
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global InformatizationIn this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology
This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number ...
On cluster tree for nested and multi-density data clustering
Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach-a cluster tree to determine such cluster structure and ...
Comments