Abstract
Clustering is a common technique in data mining to discover hidden patterns from massive datasets. With the development of privacy-maintaining data mining application, clustering incomplete high-dimensional data has becoming more and more useful. Motivated by these limits, we develop a novel algorithm CLINCH, which could produce fine clusters on incomplete high-dimensional data space. To handle missing attributes, CLINCH employs a prediction method that can be more precise than traditional techniques. On the other hand, we also introduce an efficient way in which dimensions are processed one by one to attack the “curse of dimensionality”. Experiments show that our algorithm not only outperforms many existing high-dimensional clustering algorithms in scalability and efficiency, but also produces precise results.
This paper was supported by the Key Program of National Natural Science Foundation of China (No. 69933010 and 60303008) and China National 863 High-Tech Projects (No. 2002AA4Z3430 and 2002AA231041).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high-dimensonal data for data mining applications. In: Proc. of the ACM SIGMOD Conference, Seattle, WA, pp. 94–105 (1998)
Abo, A., Hopcroft, J., Ullman, J.: The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading (1974)
Aggarwal, C.C., Parthasarathy, S.: Mining Massively Incomplete Data Sets by Conceptual Reconstruction. In: Proc. of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2001)
Aggarwal, C.C., Procopius, C., Wolf, J.L., Yu, P.S., Park, J.S.: Fast Algorithm for Projected Clustering. In: Proc. of the ACM SIGMOD Conference, Philadelphia, PA, pp. 61–72 (1999)
Agrawal, R., Srikant, R.: Privacy Preserving Data Mining. In: ACM SIGMOD (2000)
Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. Sigmod Record 29(2), 70–92 (2000)
Cheng, C., Fu, A., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proc. of the 5th ACM SIGKDD Conference, San Diego, CA, pp. 84–93 (1999)
Ghahramani, Z., Jordan, M.I.: Learning from incomplete data. Department of Brain and Cognitive Sciences, Paper No. 108, MIT (1994)
Goil, S., Nagesh, H., Choudhary, A.: MAFIA: Efficient and scalable subspace clustering for very large data sets. Technicial Report CPDC-TR-9906-010, Northwestern University (1999)
Han, J.W., et al.: Data Mining: Concepts and Techniques. Morgan Kaufmann Press, San Francisco (2000)
Honda, K., Yamakawa, A., Kanda, A., Ichihashi, H.: An application of fuzzy c-Means Clustering to PCA-Like method for missing value estimation. In: Proc. 16th Int. Conf. on Production Research, Prague, Czech (July 2001)
Joliffe, I.: Principal Component Analysis. Springer, New York (1986)
Little, R., Rubin, D.: Statistical Analysis with Missing Data Values. Wiley Series in Prob. and Stat. (1987)
Berkhin, P.: Survey of Clustering Data Mining Techniques. In: Accrue Sotware (2002)
Quinlan, J.R.: Programs for Machine Learning. Morgan Kaufman, San Francisco (1993)
Rodas, J., Gramajo, J.: Classification and Clustering Study in Incomplete Data Domain. Informatic Systems and Languages Department, Technical University of Catalonia (2000)
Shibayama, T.: A PCA-Like Method for Multivariate Data with Missing Values. Japanese Journal of Educational Psychology 40, 257–265 (1992)
Steinbach, M., Ertöz, L., Kumnar, V.: The Challenges of Clustering High Dimensional Data. Applications in Econophysics, Bioinformatics, and Pattern Recognition
Shum, H., Ikeuchi, K., Reddy, R.: Principal Component Analysis with Missing Data and its Application to Polyhedral Object Modeling. IEEE Transaction on Pattern Analysis and Machine Intelligence 17(9), 854–867 (1995)
Zhou, D., Cheng, Z.P., Wang, C., Zhou, H.F., Wang, W., Shi, B.L.: SUDEPHIC: Self-tuning Density-based Partitioning and Hierarchical Clustering. In: Proc of the 9th International Conference on Database Systems for Advanced Applications, Jeju Island, Korea (2004)
Zha, H.Y., Ding, C., Gu, M., He, X.F., Simon, H.: Spectral Relaxation for K-means Clustering. In: Neural Info. Processing Systems NIPS 2001(2001)
Zait, M., Messatfa, H.: A comparative study of clustering methods. Future Generation Computer Systems 13(2-3), 149–159 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cheng, Z. et al. (2005). CLINCH: Clustering Incomplete High-Dimensional Data for Data Mining Application. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds) Web Technologies Research and Development - APWeb 2005. APWeb 2005. Lecture Notes in Computer Science, vol 3399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31849-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-540-31849-1_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25207-8
Online ISBN: 978-3-540-31849-1
eBook Packages: Computer ScienceComputer Science (R0)