Skip to main content

CLINCH: Clustering Incomplete High-Dimensional Data for Data Mining Application

  • Conference paper
Web Technologies Research and Development - APWeb 2005 (APWeb 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3399))

Included in the following conference series:

Abstract

Clustering is a common technique in data mining to discover hidden patterns from massive datasets. With the development of privacy-maintaining data mining application, clustering incomplete high-dimensional data has becoming more and more useful. Motivated by these limits, we develop a novel algorithm CLINCH, which could produce fine clusters on incomplete high-dimensional data space. To handle missing attributes, CLINCH employs a prediction method that can be more precise than traditional techniques. On the other hand, we also introduce an efficient way in which dimensions are processed one by one to attack the “curse of dimensionality”. Experiments show that our algorithm not only outperforms many existing high-dimensional clustering algorithms in scalability and efficiency, but also produces precise results.

This paper was supported by the Key Program of National Natural Science Foundation of China (No. 69933010 and 60303008) and China National 863 High-Tech Projects (No. 2002AA4Z3430 and 2002AA231041).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high-dimensonal data for data mining applications. In: Proc. of the ACM SIGMOD Conference, Seattle, WA, pp. 94–105 (1998)

    Google Scholar 

  • Abo, A., Hopcroft, J., Ullman, J.: The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading (1974)

    Google Scholar 

  • Aggarwal, C.C., Parthasarathy, S.: Mining Massively Incomplete Data Sets by Conceptual Reconstruction. In: Proc. of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2001)

    Google Scholar 

  • Aggarwal, C.C., Procopius, C., Wolf, J.L., Yu, P.S., Park, J.S.: Fast Algorithm for Projected Clustering. In: Proc. of the ACM SIGMOD Conference, Philadelphia, PA, pp. 61–72 (1999)

    Google Scholar 

  • Agrawal, R., Srikant, R.: Privacy Preserving Data Mining. In: ACM SIGMOD (2000)

    Google Scholar 

  • Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. Sigmod Record 29(2), 70–92 (2000)

    Article  Google Scholar 

  • Cheng, C., Fu, A., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proc. of the 5th ACM SIGKDD Conference, San Diego, CA, pp. 84–93 (1999)

    Google Scholar 

  • Ghahramani, Z., Jordan, M.I.: Learning from incomplete data. Department of Brain and Cognitive Sciences, Paper No. 108, MIT (1994)

    Google Scholar 

  • Goil, S., Nagesh, H., Choudhary, A.: MAFIA: Efficient and scalable subspace clustering for very large data sets. Technicial Report CPDC-TR-9906-010, Northwestern University (1999)

    Google Scholar 

  • Han, J.W., et al.: Data Mining: Concepts and Techniques. Morgan Kaufmann Press, San Francisco (2000)

    Google Scholar 

  • Honda, K., Yamakawa, A., Kanda, A., Ichihashi, H.: An application of fuzzy c-Means Clustering to PCA-Like method for missing value estimation. In: Proc. 16th Int. Conf. on Production Research, Prague, Czech (July 2001)

    Google Scholar 

  • Joliffe, I.: Principal Component Analysis. Springer, New York (1986)

    Google Scholar 

  • Little, R., Rubin, D.: Statistical Analysis with Missing Data Values. Wiley Series in Prob. and Stat. (1987)

    Google Scholar 

  • Berkhin, P.: Survey of Clustering Data Mining Techniques. In: Accrue Sotware (2002)

    Google Scholar 

  • Quinlan, J.R.: Programs for Machine Learning. Morgan Kaufman, San Francisco (1993)

    Google Scholar 

  • Rodas, J., Gramajo, J.: Classification and Clustering Study in Incomplete Data Domain. Informatic Systems and Languages Department, Technical University of Catalonia (2000)

    Google Scholar 

  • Shibayama, T.: A PCA-Like Method for Multivariate Data with Missing Values. Japanese Journal of Educational Psychology 40, 257–265 (1992)

    Google Scholar 

  • Steinbach, M., Ertöz, L., Kumnar, V.: The Challenges of Clustering High Dimensional Data. Applications in Econophysics, Bioinformatics, and Pattern Recognition

    Google Scholar 

  • Shum, H., Ikeuchi, K., Reddy, R.: Principal Component Analysis with Missing Data and its Application to Polyhedral Object Modeling. IEEE Transaction on Pattern Analysis and Machine Intelligence 17(9), 854–867 (1995)

    Article  Google Scholar 

  • Zhou, D., Cheng, Z.P., Wang, C., Zhou, H.F., Wang, W., Shi, B.L.: SUDEPHIC: Self-tuning Density-based Partitioning and Hierarchical Clustering. In: Proc of the 9th International Conference on Database Systems for Advanced Applications, Jeju Island, Korea (2004)

    Google Scholar 

  • Zha, H.Y., Ding, C., Gu, M., He, X.F., Simon, H.: Spectral Relaxation for K-means Clustering. In: Neural Info. Processing Systems NIPS 2001(2001)

    Google Scholar 

  • Zait, M., Messatfa, H.: A comparative study of clustering methods. Future Generation Computer Systems 13(2-3), 149–159 (1997)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cheng, Z. et al. (2005). CLINCH: Clustering Incomplete High-Dimensional Data for Data Mining Application. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds) Web Technologies Research and Development - APWeb 2005. APWeb 2005. Lecture Notes in Computer Science, vol 3399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31849-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-31849-1_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25207-8

  • Online ISBN: 978-3-540-31849-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics