Abstract
Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many overlapping clusters. Such algorithms have been extensively studied for numerical data, but only a few have been proposed for categorical data. Typical drawbacks of existing projected and subspace clustering algorithms for numerical or categorical data are that they rely on parameters whose appropriate values are difficult to set appropriately or that they are unable to identify projected clusters with few relevant attributes. We present P3C, a robust algorithm for projected clustering that can effectively discover projected clusters in the data while minimizing the number of required parameters. P3C does not need the number of projected clusters as input, and can discover, under very general conditions, the true number of projected clusters. P3C is effective in detecting very low-dimensional projected clusters embedded in high dimensional spaces. P3C positions itself between projected and subspace clustering in that it can compute both disjoint or overlapping clusters. P3C is the first projected clustering algorithm for both numerical and categorical data.
Similar content being viewed by others
References
Aggarwal C, Procopiuc C, Wolf J, Yu P, and Park J (1999) Fast algorithms for projected clustering. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the ACM SIGMOD international conference on management of data, Philadelphia, pp 61–72
Aggarwal C, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. In: Chen W, Naughton J, Bernstein P (eds) Proceedings of the ACM SIGMOD international conference on management of data, Dallas, pp 70–81
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Haas L, Tiwary A (eds) Proceedings of the ACM SIGMOD international conference on management of data, Seattle, pp 94–105
Agrawal R, Srikan R (1994) Fast algorithms for mining association rules. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the international conference on very large data bases VLDB, Santiago de Chile, Chile, pp 487–499
Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D and Levine A (1999). Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96(12): 6745–6750
Andritsos P, Tsaparas P, Miller J, Sevcik K (2004) LIMBO: scalable clustering of categorical data. In Proceedings of international conference on extending database technology EDBT, Heraklion, Greece, pp 123–146
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful? Lecture Notes in Computer Science, vol. 1540. Springer, Berlin, pp 217–235
Dempster A, Laird N and Rubin D (1977). Maximum likelihood for incomplete data via the EM algorithm. J Roy Stat Soc 39: 1–38
Gan G and Wu J (2004). Subspace clustering for high dimensional categorical data. ACM SIGKDD Explor Newslett 6(2): 87–94
Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS—clustering categorical data using summaries. In: ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, pp 73–83
Hinneburg A and Keim D (2003). A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415
Kailing K, Kriegel H, Krőger P (2004) Density-connected subspace clustering for high-dimensional data. In: Berry M, Dayal U, Kamath C, Skilicorn D (eds) Proceedings of the SIAM international conference on data mining, Lake Buena Vista, April 2004, pp 1–11
Kriegel H, Krőger P, Renz M, Wurst S (2005) A generic framework for efficient subspace clustering of high-dimensional data. In: Proceedings of the IEEE ICDM international conference on data mining, Houston, pp 250–257
Moise G, Sander J, Ester M (2006) P3C: a robust projected clustering algorithm. In: Proceedings of the IEEE ICDM international conference on data mining, Hong Kong, pp 414–425
Nagesh H, Goil S, Choudhary A (2001) Adaptive grids for clustering massive data sets. In: Proceedings of the SIAM international conference on data mining, Chicago, pp 1–17
Ng K, Fu A and Wong C (2005). Projective clustering by histograms. IEEE Trans Knowl Data Eng 17(3): 369–383
Parsons L, Haque E and Liu H (2004). Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newslett 6(1): 90–105
Procopiuc C, Jones M, Agarwal P, Murali T (2002) A Monte Carlo algorithm for fast projective clustering. In: Franklin M, Moon B, Ailamaki A (eds) Proceedings of the ACM SIGMOD international conference on management of data, Madison, pp 418–427
Rousseeuw P and Van Zomeren B (1990). Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411): 633–651
Snedecor G and Cochran W (1989). Statistical methods. Iowa State University Press, Cambridge
Tang J, Chen J, Fu A and Cheung W (2007). Capabilities of outlier detection schemes in large datasets, framework and methodologies. Knowl Inf Syst 11(1): 45–84
Wang J and Karypis G (2006). On efficiently summarizing categorical databases. Knowl Inf Syst 9(1): 19–37
Yip K, Cheung D and Ng M (2004). HARP: a practical projected clustering algorithm. IEEE Trans Knowl Data Eng 16(11): 1387–1397
Yip K, Cheung D, Ng M (2005) On discovery of extremely low-dimensional clusters using semi- supervised projected clustering. In: Proceedings of the IEEE ICDE international conference on data engineering, Tokyo, pp 329–340
Yiu M and Mamoulis N (2005). Frequent-pattern based iterative projected clustering. IEEE Trans Knowl Data Eng 17(2): 176–189
Zaki M, Peters M, Assent I, Seidl T (2005) CLICKS: an effective algorithm for mining subspace clusters in categorical datasets. In: Grossman R, Bayardo R, Bennett K (eds) Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, pp 733–742
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Moise, G., Sander, J. & Ester, M. Robust projected clustering. Knowl Inf Syst 14, 273–298 (2008). https://doi.org/10.1007/s10115-007-0090-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-007-0090-6