Characterizing pattern preserving clustering

Xiong, Hui; Steinbach, Michael; Ruslim, Arifin; Kumar, Vipin

doi:10.1007/s10115-008-0148-0

Characterizing pattern preserving clustering

Regular Paper
Published: 30 May 2008

Volume 19, pages 311–336, (2009)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Hui Xiong¹,
Michael Steinbach²,
Arifin Ruslim² &
…
Vipin Kumar²

136 Accesses
14 Citations
Explore all metrics

Abstract

This paper describes a new approach for clustering—pattern preserving clustering—which produces more easily interpretable and usable clusters. This approach is motivated by the following observation: while there are usually strong patterns in the data—patterns that may be key for the analysis and description of the data—these patterns are often split among different clusters by current clustering approaches. This is, perhaps, not surprising, since clustering algorithms have no built-in knowledge of these patterns and may often have goals that are in conflict with preserving patterns, e.g., minimize the distance of points to their nearest cluster centroids. In this paper, our focus is to characterize (1) the benefits of pattern preserving clustering and (2) the most effective way of performing pattern preserving clustering. To that end, we propose and evaluate two clustering algorithms, HIerarchical Clustering with pAttern Preservation (HICAP) and bisecting K-means Clustering with pAttern Preservation (K-CAP). Experimental results on document data show that HICAP can produce overlapping clusters that preserve useful patterns, but has relatively worse clustering performance than bisecting K-means with respect to the clustering evaluation criterion of entropy. By contrast, in terms of entropy, K-CAP can perform substantially better than the bisecting K-means algorithm when data sets contain clusters of widely different sizes—a common situation in the real-world. Most importantly, we also illustrate how patterns, if preserved, can aid cluster interpretation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, pp 207–216
Anderberg MR (1973) Cluster analysis for applications. Academic Press, New York
MATH Google Scholar
Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp. 436–442
Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue Software, CA
Brecheisen S, Kriegel H-P, Pfeifle M (2006) Multi-step density-based clustering. Knowl Inf Syst 9(3): 284–308
Article Google Scholar
Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of the third SIAM international conference on data mining, SIAM
Gondek D, Hofmann T (2007) Non-redundant data clustering. Knowl Inf Syst 12(1): 1–24
Article Google Scholar
Han E-H, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) Webace: a web agent for document categorization and exploration. In: Proceedings of the second international conference on autonomous agents
Han E-HS, Karypis G, Kumar V, Mobasher B (1998) Hypergraph based clustering in high-dimensional data sets: a summary of results. Bulletin of the Technical Committee on Data Engineering 21(1)
Hinneburg A, Keim DA (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415
Article Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall Advanced Reference Series. Prentice Hall, Englewood Cliffs
Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Computing Surveys (3)
Karypis G (2006) Cluto: Software for clustering high-dimensional datasets. http://www.cs.umn.edu/~karypis
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley Series in Probability and Statistics, John Wiley and Sons, New York
Google Scholar
Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl Inf Syst 12(1): 25–53
Article Google Scholar
Lewis D (2004) Reuters-21578 text categorization text collection 1.0. http://www.daviddlewis.com/resources/testcollections/reuters21578/
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of the fifth berkeley symposium on mathematical statistics and probability, vol I, Statistics, University of California Press
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1(1): 24–45
Article Google Scholar
Omiecinski E (2003) Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1): 57–69
Article MathSciNet Google Scholar
Ozdal MM, Aykanat C (2004) Hypergraph models and algorithms for data-pattern based clustering. Data Mining Knowl Discov 9(1): 29–57
Article MathSciNet Google Scholar
Porter MF (1980) An algorithm for suffix stripping. In: Program 14(3): 130–137
Google Scholar
Rijsbergen CJV (1979) Information retrieval, 2nd edn. Butterworths, London
Google Scholar
Sander J, Ester M, Kriegel H-P, Xu X (1998) Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Mining Knowl Discov 2(2): 169–194
Article Google Scholar
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining
Steinbach M, Tan P-N, Xiong H, Kumar V (2004) Generalizing the notion of support. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 689–694
Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Proceedings of AAAI: workshop of artificial intelligence for web search, AAAI, pp 58–64. http://citeseer.nj.nec.com/strehl00impact.html
TREC (1996) http://trec.nist.gov
Tung AKH, Ng RT, Lakshmanan LVS, Han J (2001) Constraint-based clustering in large databases. In: den Bussche JV, Vianu V (eds) Database theory-ICDT 2001, 8th International Conference, pp 405–419
Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of the 1999 ACM CIKM international conference on information and knowledge management, pp 483–490
Xiong H, He X, Ding C, Zhang Y, Kumar V, Holbrook S (2005) Identification of functional modules in protein complexes via hyperclique pattern discovery. In: Proceedings of the pacific symposium on biocomputing
Xiong H, Steinbach M, Tan P-N, Kumpar V (2004) HICAP: Hierarchial Clustering with Pattern Preservation. In: Proceedings of 2004 SIAM International Conference on Data Mining (SDM), pp 279–290
Xiong H, Tan P, Kumar V (2003) Mining strong affinity association patterns in data sets with skewed support distribution. In: Proceedings of the third IEEE international conference on data mining (ICDM), pp 387–394
Xiong H, Tan P-N, Kumar V (2006) Hyperclique pattern discovery. Data Mining Knowl Discov J 13(2): 219–242
Article MathSciNet Google Scholar
Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the 2002 ACM CIKM international conference on information and knowledge management, ACM Press, New York, pp 515–524
Zhao Y, Karypis G (2004) Criterion functions for document clustering: experiments and analysis. Mach Learn 55(3): 311–331
Article MATH Google Scholar
Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3): 374–384
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Management Science and Information Systems, Rutgers, The State University of New Jersey, Newark, NJ, 07102, USA
Hui Xiong
Department of Computer Science and Engineering, University of Minnesota-Twin Cities, Minneapolis, MN, USA
Michael Steinbach, Arifin Ruslim & Vipin Kumar

Authors

Hui Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Michael Steinbach
View author publications
You can also search for this author in PubMed Google Scholar
Arifin Ruslim
View author publications
You can also search for this author in PubMed Google Scholar
Vipin Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hui Xiong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xiong, H., Steinbach, M., Ruslim, A. et al. Characterizing pattern preserving clustering. Knowl Inf Syst 19, 311–336 (2009). https://doi.org/10.1007/s10115-008-0148-0

Download citation

Received: 21 June 2007
Revised: 16 March 2008
Accepted: 12 April 2008
Published: 30 May 2008
Issue Date: June 2009
DOI: https://doi.org/10.1007/s10115-008-0148-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Characterizing pattern preserving clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Characterizing pattern preserving clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation