Finding Dense Clusters in Hyperspace: An Approach Based on Row Shuffling

Barbará, Daniel; Wu, Xintao

doi:10.1007/3-540-47714-4_28

Daniel Barbará⁷ &
Xintao Wu⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2118))

Included in the following conference series:

International Conference on Web-Age Information Management

327 Accesses
1 Citations

Abstract

High dimensional data sets generally exhibit low density, since the number of possible cells exceeds the actual number of cells in the set. This characteristic has prompted researchers to automate the search for subspaces where the density is higher. In this paper we present an algorithm that takes advantage of categorical, unordered dimensions to increase the density of subspaces in the data set. It does this by shuffling rows in those dimensions, so the final ordering results in increased density of regions in hyperspace. We argue for the usage of this shuffling technique as a preprocessing step for other techniques that compress the hyperspace by means of statistical models, since denser regions usually result in better-fitting models. The experimental results support this argument. We also show how to integrate this algorithm with two grid clustering procedures in order to find these dense regions. The experimental results in both synthetic and real data sets show that row-shuffling can drastically increase the density of the subspaces, leading to better clusters.

This work has been supported by NSF grant IIS-9732113

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. Agrawal, J. Gerhrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington, 1998.
Google Scholar
D. Barbará, X. Wu. Using Approximations to Scale Exploratory Data Analysis in Datacubes. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, California, 1999.
Google Scholar
D. Barbará, X. Wu. Finding Dense Clusters in Hyperspace: An Approach Based on Row Shuffling. Technical Report, George Mason university, ISE Dept, August 2000.
Google Scholar
D. Barbará, X. Wu. Using Loglinear Models to Compress Datacubes. In Proceedings of the first International Conference on WebInformation Management, Shanghai, China, 2000.
Google Scholar
P. Bradley, U. Fayyad, and C. Reina, Scaling Clustering Algorithms to Large Databases. In Proceedings of the 1998 ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining, August 1998.
Google Scholar
M. Ester, H. Kriegel, J. Sander, and X. Xu A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of 1996 ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining, 1996.
Google Scholar
A. Hinneburg, D.A. Keim Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering. In Proceedings of the 25rd VLDB Conference, Edinburgh, Scotland, 1999.
Google Scholar
International Business Machines IBM Intelligent Miner User’s Guide, 1996
Google Scholar
Piotr Indyk. Dimensionality Reduction Techniques for Proximity Problems.
Google Scholar
R.T. Ng, J. Han Efficient and Effective Clustering Methods for Spatial Data Mining. In Proceedings of the 20th Very Large Data Bases Conference, 1994.
Google Scholar
PKDD99 Discovery Challenge Download the Data. http://lisp.vse.cz/pkdd99/chall.htm
W. Wang, J. Yang, R. Muntz STING: A Statistical Information Grid Approach to Spatial Data Ming. In Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997.
Google Scholar
T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, June 1996.
Google Scholar

Download references

Author information

Authors and Affiliations

ISE Dept., George Mason University, MSN 4A4, Fairfax, VA, 22030, USA
Daniel Barbará & Xintao Wu

Authors

Daniel Barbará
View author publications
You can also search for this author in PubMed Google Scholar
Xintao Wu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information and Software Engineering, George Mason University, Fairfax, VA, 22030-4444, USA
X. Sean Wang
Department of Computer Science and Engineering, Northeastern University, Shenyang, 110004, China
Ge Yu
Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
Hongjun Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barbará, D., Wu, X. (2001). Finding Dense Clusters in Hyperspace: An Approach Based on Row Shuffling. In: Wang, X.S., Yu, G., Lu, H. (eds) Advances in Web-Age Information Management. WAIM 2001. Lecture Notes in Computer Science, vol 2118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47714-4_28

Download citation

DOI: https://doi.org/10.1007/3-540-47714-4_28
Published: 28 June 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42298-3
Online ISBN: 978-3-540-47714-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics