Skip to main content
Log in

Characterizing pattern preserving clustering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This paper describes a new approach for clustering—pattern preserving clustering—which produces more easily interpretable and usable clusters. This approach is motivated by the following observation: while there are usually strong patterns in the data—patterns that may be key for the analysis and description of the data—these patterns are often split among different clusters by current clustering approaches. This is, perhaps, not surprising, since clustering algorithms have no built-in knowledge of these patterns and may often have goals that are in conflict with preserving patterns, e.g., minimize the distance of points to their nearest cluster centroids. In this paper, our focus is to characterize (1) the benefits of pattern preserving clustering and (2) the most effective way of performing pattern preserving clustering. To that end, we propose and evaluate two clustering algorithms, HIerarchical Clustering with pAttern Preservation (HICAP) and bisecting K-means Clustering with pAttern Preservation (K-CAP). Experimental results on document data show that HICAP can produce overlapping clusters that preserve useful patterns, but has relatively worse clustering performance than bisecting K-means with respect to the clustering evaluation criterion of entropy. By contrast, in terms of entropy, K-CAP can perform substantially better than the bisecting K-means algorithm when data sets contain clusters of widely different sizes—a common situation in the real-world. Most importantly, we also illustrate how patterns, if preserved, can aid cluster interpretation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, pp 207–216

  2. Anderberg MR (1973) Cluster analysis for applications. Academic Press, New York

    MATH  Google Scholar 

  3. Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp. 436–442

  4. Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue Software, CA

  5. Brecheisen S, Kriegel H-P, Pfeifle M (2006) Multi-step density-based clustering. Knowl Inf Syst 9(3): 284–308

    Article  Google Scholar 

  6. Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of the third SIAM international conference on data mining, SIAM

  7. Gondek D, Hofmann T (2007) Non-redundant data clustering. Knowl Inf Syst 12(1): 1–24

    Article  Google Scholar 

  8. Han E-H, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) Webace: a web agent for document categorization and exploration. In: Proceedings of the second international conference on autonomous agents

  9. Han E-HS, Karypis G, Kumar V, Mobasher B (1998) Hypergraph based clustering in high-dimensional data sets: a summary of results. Bulletin of the Technical Committee on Data Engineering 21(1)

  10. Hinneburg A, Keim DA (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415

    Article  Google Scholar 

  11. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall Advanced Reference Series. Prentice Hall, Englewood Cliffs

    Google Scholar 

  12. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Computing Surveys (3)

  13. Karypis G (2006) Cluto: Software for clustering high-dimensional datasets. http://www.cs.umn.edu/~karypis

  14. Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley Series in Probability and Statistics, John Wiley and Sons, New York

    Google Scholar 

  15. Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl Inf Syst 12(1): 25–53

    Article  Google Scholar 

  16. Lewis D (2004) Reuters-21578 text categorization text collection 1.0. http://www.daviddlewis.com/resources/testcollections/reuters21578/

  17. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of the fifth berkeley symposium on mathematical statistics and probability, vol I, Statistics, University of California Press

  18. Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1(1): 24–45

    Article  Google Scholar 

  19. Omiecinski E (2003) Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1): 57–69

    Article  MathSciNet  Google Scholar 

  20. Ozdal MM, Aykanat C (2004) Hypergraph models and algorithms for data-pattern based clustering. Data Mining Knowl Discov 9(1): 29–57

    Article  MathSciNet  Google Scholar 

  21. Porter MF (1980) An algorithm for suffix stripping. In: Program 14(3): 130–137

    Google Scholar 

  22. Rijsbergen CJV (1979) Information retrieval, 2nd edn. Butterworths, London

    Google Scholar 

  23. Sander J, Ester M, Kriegel H-P, Xu X (1998) Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Mining Knowl Discov 2(2): 169–194

    Article  Google Scholar 

  24. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining

  25. Steinbach M, Tan P-N, Xiong H, Kumar V (2004) Generalizing the notion of support. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 689–694

  26. Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Proceedings of AAAI: workshop of artificial intelligence for web search, AAAI, pp 58–64. http://citeseer.nj.nec.com/strehl00impact.html

  27. TREC (1996) http://trec.nist.gov

  28. Tung AKH, Ng RT, Lakshmanan LVS, Han J (2001) Constraint-based clustering in large databases. In: den Bussche JV, Vianu V (eds) Database theory-ICDT 2001, 8th International Conference, pp 405–419

  29. Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of the 1999 ACM CIKM international conference on information and knowledge management, pp 483–490

  30. Xiong H, He X, Ding C, Zhang Y, Kumar V, Holbrook S (2005) Identification of functional modules in protein complexes via hyperclique pattern discovery. In: Proceedings of the pacific symposium on biocomputing

  31. Xiong H, Steinbach M, Tan P-N, Kumpar V (2004) HICAP: Hierarchial Clustering with Pattern Preservation. In: Proceedings of 2004 SIAM International Conference on Data Mining (SDM), pp 279–290

  32. Xiong H, Tan P, Kumar V (2003) Mining strong affinity association patterns in data sets with skewed support distribution. In: Proceedings of the third IEEE international conference on data mining (ICDM), pp 387–394

  33. Xiong H, Tan P-N, Kumar V (2006) Hyperclique pattern discovery. Data Mining Knowl Discov J 13(2): 219–242

    Article  MathSciNet  Google Scholar 

  34. Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the 2002 ACM CIKM international conference on information and knowledge management, ACM Press, New York, pp 515–524

  35. Zhao Y, Karypis G (2004) Criterion functions for document clustering: experiments and analysis. Mach Learn 55(3): 311–331

    Article  MATH  Google Scholar 

  36. Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3): 374–384

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hui Xiong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xiong, H., Steinbach, M., Ruslim, A. et al. Characterizing pattern preserving clustering. Knowl Inf Syst 19, 311–336 (2009). https://doi.org/10.1007/s10115-008-0148-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0148-0

Keywords

Navigation