Robust projected clustering

Moise, Gabriela; Sander, Jörg; Ester, Martin

doi:10.1007/s10115-007-0090-6

Robust projected clustering

Regular Paper
Published: 21 July 2007

Volume 14, pages 273–298, (2008)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Gabriela Moise¹,
Jörg Sander¹ &
Martin Ester²

284 Accesses
Explore all metrics

Abstract

Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many overlapping clusters. Such algorithms have been extensively studied for numerical data, but only a few have been proposed for categorical data. Typical drawbacks of existing projected and subspace clustering algorithms for numerical or categorical data are that they rely on parameters whose appropriate values are difficult to set appropriately or that they are unable to identify projected clusters with few relevant attributes. We present P3C, a robust algorithm for projected clustering that can effectively discover projected clusters in the data while minimizing the number of required parameters. P3C does not need the number of projected clusters as input, and can discover, under very general conditions, the true number of projected clusters. P3C is effective in detecting very low-dimensional projected clusters embedded in high dimensional spaces. P3C positions itself between projected and subspace clustering in that it can compute both disjoint or overlapping clusters. P3C is the first projected clustering algorithm for both numerical and categorical data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal C, Procopiuc C, Wolf J, Yu P, and Park J (1999) Fast algorithms for projected clustering. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the ACM SIGMOD international conference on management of data, Philadelphia, pp 61–72
Aggarwal C, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. In: Chen W, Naughton J, Bernstein P (eds) Proceedings of the ACM SIGMOD international conference on management of data, Dallas, pp 70–81
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Haas L, Tiwary A (eds) Proceedings of the ACM SIGMOD international conference on management of data, Seattle, pp 94–105
Agrawal R, Srikan R (1994) Fast algorithms for mining association rules. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the international conference on very large data bases VLDB, Santiago de Chile, Chile, pp 487–499
Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D and Levine A (1999). Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96(12): 6745–6750
Article Google Scholar
Andritsos P, Tsaparas P, Miller J, Sevcik K (2004) LIMBO: scalable clustering of categorical data. In Proceedings of international conference on extending database technology EDBT, Heraklion, Greece, pp 123–146
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful? Lecture Notes in Computer Science, vol. 1540. Springer, Berlin, pp 217–235
Dempster A, Laird N and Rubin D (1977). Maximum likelihood for incomplete data via the EM algorithm. J Roy Stat Soc 39: 1–38
MATH MathSciNet Google Scholar
Gan G and Wu J (2004). Subspace clustering for high dimensional categorical data. ACM SIGKDD Explor Newslett 6(2): 87–94
Article Google Scholar
Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS—clustering categorical data using summaries. In: ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, pp 73–83
Hinneburg A and Keim D (2003). A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415
Article Google Scholar
Kailing K, Kriegel H, Krőger P (2004) Density-connected subspace clustering for high-dimensional data. In: Berry M, Dayal U, Kamath C, Skilicorn D (eds) Proceedings of the SIAM international conference on data mining, Lake Buena Vista, April 2004, pp 1–11
Kriegel H, Krőger P, Renz M, Wurst S (2005) A generic framework for efficient subspace clustering of high-dimensional data. In: Proceedings of the IEEE ICDM international conference on data mining, Houston, pp 250–257
Moise G, Sander J, Ester M (2006) P3C: a robust projected clustering algorithm. In: Proceedings of the IEEE ICDM international conference on data mining, Hong Kong, pp 414–425
Nagesh H, Goil S, Choudhary A (2001) Adaptive grids for clustering massive data sets. In: Proceedings of the SIAM international conference on data mining, Chicago, pp 1–17
Ng K, Fu A and Wong C (2005). Projective clustering by histograms. IEEE Trans Knowl Data Eng 17(3): 369–383
Article Google Scholar
Parsons L, Haque E and Liu H (2004). Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newslett 6(1): 90–105
Article Google Scholar
Procopiuc C, Jones M, Agarwal P, Murali T (2002) A Monte Carlo algorithm for fast projective clustering. In: Franklin M, Moon B, Ailamaki A (eds) Proceedings of the ACM SIGMOD international conference on management of data, Madison, pp 418–427
Rousseeuw P and Van Zomeren B (1990). Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411): 633–651
Article Google Scholar
Snedecor G and Cochran W (1989). Statistical methods. Iowa State University Press, Cambridge
MATH Google Scholar
Tang J, Chen J, Fu A and Cheung W (2007). Capabilities of outlier detection schemes in large datasets, framework and methodologies. Knowl Inf Syst 11(1): 45–84
Article Google Scholar
Wang J and Karypis G (2006). On efficiently summarizing categorical databases. Knowl Inf Syst 9(1): 19–37
Article Google Scholar
Yip K, Cheung D and Ng M (2004). HARP: a practical projected clustering algorithm. IEEE Trans Knowl Data Eng 16(11): 1387–1397
Article Google Scholar
Yip K, Cheung D, Ng M (2005) On discovery of extremely low-dimensional clusters using semi- supervised projected clustering. In: Proceedings of the IEEE ICDE international conference on data engineering, Tokyo, pp 329–340
Yiu M and Mamoulis N (2005). Frequent-pattern based iterative projected clustering. IEEE Trans Knowl Data Eng 17(2): 176–189
Article Google Scholar
Zaki M, Peters M, Assent I, Seidl T (2005) CLICKS: an effective algorithm for mining subspace clusters in categorical datasets. In: Grossman R, Bayardo R, Bennett K (eds) Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, pp 733–742

Download references

Author information

Authors and Affiliations

Department of Computing Science, University of Alberta, Edmonton, AB, T6G 2E8, Canada
Gabriela Moise & Jörg Sander
School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
Martin Ester

Authors

Gabriela Moise
View author publications
You can also search for this author inPubMed Google Scholar
Jörg Sander
View author publications
You can also search for this author inPubMed Google Scholar
Martin Ester
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Gabriela Moise.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moise, G., Sander, J. & Ester, M. Robust projected clustering. Knowl Inf Syst 14, 273–298 (2008). https://doi.org/10.1007/s10115-007-0090-6

Download citation

Received: 18 December 2006
Revised: 23 March 2007
Accepted: 23 April 2007
Published: 21 July 2007
Issue Date: March 2008
DOI: https://doi.org/10.1007/s10115-007-0090-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust projected clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient Monte Carlo clustering in subspaces

Efficient Density-Based Subspace Clustering in High Dimensions

Subspace Clustering Technique Using Multi-objective Functions for Multi-class Categorical Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Robust projected clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient Monte Carlo clustering in subspaces

Efficient Density-Based Subspace Clustering in High Dimensions

Subspace Clustering Technique Using Multi-objective Functions for Multi-class Categorical Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now