CLINCH: Clustering Incomplete High-Dimensional Data for Data Mining Application

Cheng, Zunping; Zhou, Ding; Wang, Chen; Guo, Jiankui; Wang, Wei; Ding, Baokang; Shi, Baile

doi:10.1007/978-3-540-31849-1_10

Zunping Cheng²¹,
Ding Zhou²²,
Chen Wang²¹,
Jiankui Guo²¹,
Wei Wang²¹,
Baokang Ding²¹ &
…
Baile Shi²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3399))

Included in the following conference series:

Asia-Pacific Web Conference

568 Accesses
2 Citations

Abstract

Clustering is a common technique in data mining to discover hidden patterns from massive datasets. With the development of privacy-maintaining data mining application, clustering incomplete high-dimensional data has becoming more and more useful. Motivated by these limits, we develop a novel algorithm CLINCH, which could produce fine clusters on incomplete high-dimensional data space. To handle missing attributes, CLINCH employs a prediction method that can be more precise than traditional techniques. On the other hand, we also introduce an efficient way in which dimensions are processed one by one to attack the “curse of dimensionality”. Experiments show that our algorithm not only outperforms many existing high-dimensional clustering algorithms in scalability and efficiency, but also produces precise results.

This paper was supported by the Key Program of National Natural Science Foundation of China (No. 69933010 and 60303008) and China National 863 High-Tech Projects (No. 2002AA4Z3430 and 2002AA231041).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high-dimensonal data for data mining applications. In: Proc. of the ACM SIGMOD Conference, Seattle, WA, pp. 94–105 (1998)
Google Scholar
Abo, A., Hopcroft, J., Ullman, J.: The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading (1974)
Google Scholar
Aggarwal, C.C., Parthasarathy, S.: Mining Massively Incomplete Data Sets by Conceptual Reconstruction. In: Proc. of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2001)
Google Scholar
Aggarwal, C.C., Procopius, C., Wolf, J.L., Yu, P.S., Park, J.S.: Fast Algorithm for Projected Clustering. In: Proc. of the ACM SIGMOD Conference, Philadelphia, PA, pp. 61–72 (1999)
Google Scholar
Agrawal, R., Srikant, R.: Privacy Preserving Data Mining. In: ACM SIGMOD (2000)
Google Scholar
Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. Sigmod Record 29(2), 70–92 (2000)
Article Google Scholar
Cheng, C., Fu, A., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proc. of the 5th ACM SIGKDD Conference, San Diego, CA, pp. 84–93 (1999)
Google Scholar
Ghahramani, Z., Jordan, M.I.: Learning from incomplete data. Department of Brain and Cognitive Sciences, Paper No. 108, MIT (1994)
Google Scholar
Goil, S., Nagesh, H., Choudhary, A.: MAFIA: Efficient and scalable subspace clustering for very large data sets. Technicial Report CPDC-TR-9906-010, Northwestern University (1999)
Google Scholar
Han, J.W., et al.: Data Mining: Concepts and Techniques. Morgan Kaufmann Press, San Francisco (2000)
Google Scholar
Honda, K., Yamakawa, A., Kanda, A., Ichihashi, H.: An application of fuzzy c-Means Clustering to PCA-Like method for missing value estimation. In: Proc. 16th Int. Conf. on Production Research, Prague, Czech (July 2001)
Google Scholar
Joliffe, I.: Principal Component Analysis. Springer, New York (1986)
Google Scholar
Little, R., Rubin, D.: Statistical Analysis with Missing Data Values. Wiley Series in Prob. and Stat. (1987)
Google Scholar
Berkhin, P.: Survey of Clustering Data Mining Techniques. In: Accrue Sotware (2002)
Google Scholar
Quinlan, J.R.: Programs for Machine Learning. Morgan Kaufman, San Francisco (1993)
Google Scholar
Rodas, J., Gramajo, J.: Classification and Clustering Study in Incomplete Data Domain. Informatic Systems and Languages Department, Technical University of Catalonia (2000)
Google Scholar
Shibayama, T.: A PCA-Like Method for Multivariate Data with Missing Values. Japanese Journal of Educational Psychology 40, 257–265 (1992)
Google Scholar
Steinbach, M., Ertöz, L., Kumnar, V.: The Challenges of Clustering High Dimensional Data. Applications in Econophysics, Bioinformatics, and Pattern Recognition
Google Scholar
Shum, H., Ikeuchi, K., Reddy, R.: Principal Component Analysis with Missing Data and its Application to Polyhedral Object Modeling. IEEE Transaction on Pattern Analysis and Machine Intelligence 17(9), 854–867 (1995)
Article Google Scholar
Zhou, D., Cheng, Z.P., Wang, C., Zhou, H.F., Wang, W., Shi, B.L.: SUDEPHIC: Self-tuning Density-based Partitioning and Hierarchical Clustering. In: Proc of the 9th International Conference on Database Systems for Advanced Applications, Jeju Island, Korea (2004)
Google Scholar
Zha, H.Y., Ding, C., Gu, M., He, X.F., Simon, H.: Spectral Relaxation for K-means Clustering. In: Neural Info. Processing Systems NIPS 2001(2001)
Google Scholar
Zait, M., Messatfa, H.: A comparative study of clustering methods. Future Generation Computer Systems 13(2-3), 149–159 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Fudan University, China
Zunping Cheng, Chen Wang, Jiankui Guo, Wei Wang, Baokang Ding & Baile Shi
Pennsylvania State University, USA
Ding Zhou

Authors

Zunping Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Ding Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Chen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiankui Guo
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Baokang Ding
View author publications
You can also search for this author in PubMed Google Scholar
Baile Shi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Victoria University, Australia
Yanchun Zhang
University of Kyoto, Japan
Katsumi Tanaka
Chinese University of Hong Kong, Hong Kong, China
Jeffrey Xu Yu
Key Laboratory of Data Engineering and Knowledge Engineering, Renmin University of China, MOE, 100872, Beijing, P.R. China
Shan Wang
Department of Computer Science and Engineering, Shanghai Jiatong University, 80 Dongcuan Road, 200240, Shanghai, China
Minglu Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheng, Z. et al. (2005). CLINCH: Clustering Incomplete High-Dimensional Data for Data Mining Application. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds) Web Technologies Research and Development - APWeb 2005. APWeb 2005. Lecture Notes in Computer Science, vol 3399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31849-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-540-31849-1_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25207-8
Online ISBN: 978-3-540-31849-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics