Abstract
High dimensional data sets generally exhibit low density, since the number of possible cells exceeds the actual number of cells in the set. This characteristic has prompted researchers to automate the search for subspaces where the density is higher. In this paper we present an algorithm that takes advantage of categorical, unordered dimensions to increase the density of subspaces in the data set. It does this by shuffling rows in those dimensions, so the final ordering results in increased density of regions in hyperspace. We argue for the usage of this shuffling technique as a preprocessing step for other techniques that compress the hyperspace by means of statistical models, since denser regions usually result in better-fitting models. The experimental results support this argument. We also show how to integrate this algorithm with two grid clustering procedures in order to find these dense regions. The experimental results in both synthetic and real data sets show that row-shuffling can drastically increase the density of the subspaces, leading to better clusters.
This work has been supported by NSF grant IIS-9732113
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
R. Agrawal, J. Gerhrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington, 1998.
D. Barbará, X. Wu. Using Approximations to Scale Exploratory Data Analysis in Datacubes. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, California, 1999.
D. Barbará, X. Wu. Finding Dense Clusters in Hyperspace: An Approach Based on Row Shuffling. Technical Report, George Mason university, ISE Dept, August 2000.
D. Barbará, X. Wu. Using Loglinear Models to Compress Datacubes. In Proceedings of the first International Conference on WebInformation Management, Shanghai, China, 2000.
P. Bradley, U. Fayyad, and C. Reina, Scaling Clustering Algorithms to Large Databases. In Proceedings of the 1998 ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining, August 1998.
M. Ester, H. Kriegel, J. Sander, and X. Xu A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of 1996 ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining, 1996.
A. Hinneburg, D.A. Keim Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering. In Proceedings of the 25rd VLDB Conference, Edinburgh, Scotland, 1999.
International Business Machines IBM Intelligent Miner User’s Guide, 1996
Piotr Indyk. Dimensionality Reduction Techniques for Proximity Problems.
R.T. Ng, J. Han Efficient and Effective Clustering Methods for Spatial Data Mining. In Proceedings of the 20th Very Large Data Bases Conference, 1994.
PKDD99 Discovery Challenge Download the Data. http://lisp.vse.cz/pkdd99/chall.htm
W. Wang, J. Yang, R. Muntz STING: A Statistical Information Grid Approach to Spatial Data Ming. In Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997.
T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, June 1996.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Barbará, D., Wu, X. (2001). Finding Dense Clusters in Hyperspace: An Approach Based on Row Shuffling. In: Wang, X.S., Yu, G., Lu, H. (eds) Advances in Web-Age Information Management. WAIM 2001. Lecture Notes in Computer Science, vol 2118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47714-4_28
Download citation
DOI: https://doi.org/10.1007/3-540-47714-4_28
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42298-3
Online ISBN: 978-3-540-47714-3
eBook Packages: Springer Book Archive