Skip to main content

Advertisement

Log in

Robust projected clustering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many overlapping clusters. Such algorithms have been extensively studied for numerical data, but only a few have been proposed for categorical data. Typical drawbacks of existing projected and subspace clustering algorithms for numerical or categorical data are that they rely on parameters whose appropriate values are difficult to set appropriately or that they are unable to identify projected clusters with few relevant attributes. We present P3C, a robust algorithm for projected clustering that can effectively discover projected clusters in the data while minimizing the number of required parameters. P3C does not need the number of projected clusters as input, and can discover, under very general conditions, the true number of projected clusters. P3C is effective in detecting very low-dimensional projected clusters embedded in high dimensional spaces. P3C positions itself between projected and subspace clustering in that it can compute both disjoint or overlapping clusters. P3C is the first projected clustering algorithm for both numerical and categorical data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal C, Procopiuc C, Wolf J, Yu P, and Park J (1999) Fast algorithms for projected clustering. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the ACM SIGMOD international conference on management of data, Philadelphia, pp 61–72

  2. Aggarwal C, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. In: Chen W, Naughton J, Bernstein P (eds) Proceedings of the ACM SIGMOD international conference on management of data, Dallas, pp 70–81

  3. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Haas L, Tiwary A (eds) Proceedings of the ACM SIGMOD international conference on management of data, Seattle, pp 94–105

  4. Agrawal R, Srikan R (1994) Fast algorithms for mining association rules. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the international conference on very large data bases VLDB, Santiago de Chile, Chile, pp 487–499

  5. Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D and Levine A (1999). Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96(12): 6745–6750

    Article  Google Scholar 

  6. Andritsos P, Tsaparas P, Miller J, Sevcik K (2004) LIMBO: scalable clustering of categorical data. In Proceedings of international conference on extending database technology EDBT, Heraklion, Greece, pp 123–146

  7. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful? Lecture Notes in Computer Science, vol. 1540. Springer, Berlin, pp 217–235

  8. Dempster A, Laird N and Rubin D (1977). Maximum likelihood for incomplete data via the EM algorithm. J Roy Stat Soc 39: 1–38

    MATH  MathSciNet  Google Scholar 

  9. Gan G and Wu J (2004). Subspace clustering for high dimensional categorical data. ACM SIGKDD Explor Newslett 6(2): 87–94

    Article  Google Scholar 

  10. Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS—clustering categorical data using summaries. In: ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, pp 73–83

  11. Hinneburg A and Keim D (2003). A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415

    Article  Google Scholar 

  12. Kailing K, Kriegel H, Krőger P (2004) Density-connected subspace clustering for high-dimensional data. In: Berry M, Dayal U, Kamath C, Skilicorn D (eds) Proceedings of the SIAM international conference on data mining, Lake Buena Vista, April 2004, pp 1–11

  13. Kriegel H, Krőger P, Renz M, Wurst S (2005) A generic framework for efficient subspace clustering of high-dimensional data. In: Proceedings of the IEEE ICDM international conference on data mining, Houston, pp 250–257

  14. Moise G, Sander J, Ester M (2006) P3C: a robust projected clustering algorithm. In: Proceedings of the IEEE ICDM international conference on data mining, Hong Kong, pp 414–425

  15. Nagesh H, Goil S, Choudhary A (2001) Adaptive grids for clustering massive data sets. In: Proceedings of the SIAM international conference on data mining, Chicago, pp 1–17

  16. Ng K, Fu A and Wong C (2005). Projective clustering by histograms. IEEE Trans Knowl Data Eng 17(3): 369–383

    Article  Google Scholar 

  17. Parsons L, Haque E and Liu H (2004). Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newslett 6(1): 90–105

    Article  Google Scholar 

  18. Procopiuc C, Jones M, Agarwal P, Murali T (2002) A Monte Carlo algorithm for fast projective clustering. In: Franklin M, Moon B, Ailamaki A (eds) Proceedings of the ACM SIGMOD international conference on management of data, Madison, pp 418–427

  19. Rousseeuw P and Van Zomeren B (1990). Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411): 633–651

    Article  Google Scholar 

  20. Snedecor G and Cochran W (1989). Statistical methods. Iowa State University Press, Cambridge

    MATH  Google Scholar 

  21. Tang J, Chen J, Fu A and Cheung W (2007). Capabilities of outlier detection schemes in large datasets, framework and methodologies. Knowl Inf Syst 11(1): 45–84

    Article  Google Scholar 

  22. Wang J and Karypis G (2006). On efficiently summarizing categorical databases. Knowl Inf Syst 9(1): 19–37

    Article  Google Scholar 

  23. Yip K, Cheung D and Ng M (2004). HARP: a practical projected clustering algorithm. IEEE Trans Knowl Data Eng 16(11): 1387–1397

    Article  Google Scholar 

  24. Yip K, Cheung D, Ng M (2005) On discovery of extremely low-dimensional clusters using semi- supervised projected clustering. In: Proceedings of the IEEE ICDE international conference on data engineering, Tokyo, pp 329–340

  25. Yiu M and Mamoulis N (2005). Frequent-pattern based iterative projected clustering. IEEE Trans Knowl Data Eng 17(2): 176–189

    Article  Google Scholar 

  26. Zaki M, Peters M, Assent I, Seidl T (2005) CLICKS: an effective algorithm for mining subspace clusters in categorical datasets. In: Grossman R, Bayardo R, Bennett K (eds) Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, pp 733–742

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriela Moise.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moise, G., Sander, J. & Ester, M. Robust projected clustering. Knowl Inf Syst 14, 273–298 (2008). https://doi.org/10.1007/s10115-007-0090-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-007-0090-6

Keywords