Skip to main content

Subspace Clustering of High Dimensional Spatial Data with Noises

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3056))

Included in the following conference series:

Abstract

Clustering a large amount of high dimensional spatial data sets with noises is a difficult challenge in data mining. In this paper, we present a new subspace clustering method, called SCI (Subspace Clustering based on Information), to solve this problem. The SCI combines Shannon information with grid-based and density-based clustering techniques. The design of clustering algorithms is equivalent to construct an equivalent relationship among data points. Therefore, we propose an equivalent relationship, named density-connected, to identify the main bodies of clusters. For the purpose of noise detection and cluster boundary discovery, we also use the grid approach to devise a new cohesive mechanism to merge data points of borders into clusters and to filter out the noises. However, the curse of dimensionality is a well-known serious problem of using grid approach on high dimensional data sets because the number of the grid cells grows exponentially in dimensions. To strike a compromise between the randomness and the structure, we propose an automatic method for attribute selection based on the Shannon information. With the merit of only requiring one data scan, algorithm SCI is very efficient with its run time being linear to the size of the input data set. As shown by our experimental results, SCI is very powerful to discover arbitrary shapes of clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD 1998 (1998)

    Google Scholar 

  2. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Chichester (1991)

    Book  MATH  Google Scholar 

  3. Dai, B.-R., Lin, C.-R., Chen, M.-S.: On the techniques for data clustering with numerical constraints. In: Proc. Of the 3rd SIAM Intern’l Conference on Data Mining (2003)

    Google Scholar 

  4. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  5. Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys (1999)

    Google Scholar 

  6. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovery in large spatial databases with noise. In: Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining (KDD 1996), Poland, Maine, August 1996, pp. 67–82 (1996)

    Google Scholar 

  7. Fraleigh, J.B.: A First Course in Abstract Algebra, 6th edn. Addison-Wesley, Reading (1999)

    MATH  Google Scholar 

  8. Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia databases with noise. In: Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD 1998), New York, August 1998, pp. 58–65 (1998)

    Google Scholar 

  9. Hinneburg, A., Keim, D.A.: Optimal Grid-Clustering: Towards Breaking the curse of Dimensionality in High-Dimensional Clustering. In: VLDB (1999)

    Google Scholar 

  10. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  11. Han, J., Kamber, M., Tung, A.: Spatial Clustering Methods in Data Mining: A Survey. In: Miller, H., Han, J. (eds.) Geographic Data Mining and Knowledge Discovery, Taylor and Francis. 21, Abington (2001)

    Google Scholar 

  12. Lin, C.-R., Chen, M.-S.: A robust and efficient clustering algorithm based on cohesion self-merging. In: Proc. Of the 8th ACM SIGKDD Internal Conf. On Knowledge Discovery and Data Mining (August 2002)

    Google Scholar 

  13. Schikuta, E.: Grid-Clustering: A Fast Hierarchical Clustering Method for Very Large Data Sets. In: Proc. 13th Int. Conf. On Pattern Recognition, vol. 2, pp. 101–105. IEEE Computer Society, Los Alamitos (1996)

    Chapter  Google Scholar 

  14. Sheikholeslami, G., Chatterjee, S., Zhang, A.: WaveCluster: A multi-resolution clustering approach for very large spatial databases. In: VLDB (1998)

    Google Scholar 

  15. Wang, W., Yang, J., Muntz, R.: STING: A Statistical Information grid Approach to Spatial Data Mining. In: VLDB (1997)

    Google Scholar 

  16. Zaiane, O.R., Foss, A., Lee, C.-H., Wang, W.: On data clustering analysis: Scalability, constraints, and validation. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 28–39. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hsu, CM., Chen, MS. (2004). Subspace Clustering of High Dimensional Spatial Data with Noises. In: Dai, H., Srikant, R., Zhang, C. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Computer Science(), vol 3056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24775-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24775-3_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22064-0

  • Online ISBN: 978-3-540-24775-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics