Abstract
This paper describes a new approach for clustering—pattern preserving clustering—which produces more easily interpretable and usable clusters. This approach is motivated by the following observation: while there are usually strong patterns in the data—patterns that may be key for the analysis and description of the data—these patterns are often split among different clusters by current clustering approaches. This is, perhaps, not surprising, since clustering algorithms have no built-in knowledge of these patterns and may often have goals that are in conflict with preserving patterns, e.g., minimize the distance of points to their nearest cluster centroids. In this paper, our focus is to characterize (1) the benefits of pattern preserving clustering and (2) the most effective way of performing pattern preserving clustering. To that end, we propose and evaluate two clustering algorithms, HIerarchical Clustering with pAttern Preservation (HICAP) and bisecting K-means Clustering with pAttern Preservation (K-CAP). Experimental results on document data show that HICAP can produce overlapping clusters that preserve useful patterns, but has relatively worse clustering performance than bisecting K-means with respect to the clustering evaluation criterion of entropy. By contrast, in terms of entropy, K-CAP can perform substantially better than the bisecting K-means algorithm when data sets contain clusters of widely different sizes—a common situation in the real-world. Most importantly, we also illustrate how patterns, if preserved, can aid cluster interpretation.
Similar content being viewed by others
References
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, pp 207–216
Anderberg MR (1973) Cluster analysis for applications. Academic Press, New York
Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp. 436–442
Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue Software, CA
Brecheisen S, Kriegel H-P, Pfeifle M (2006) Multi-step density-based clustering. Knowl Inf Syst 9(3): 284–308
Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of the third SIAM international conference on data mining, SIAM
Gondek D, Hofmann T (2007) Non-redundant data clustering. Knowl Inf Syst 12(1): 1–24
Han E-H, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) Webace: a web agent for document categorization and exploration. In: Proceedings of the second international conference on autonomous agents
Han E-HS, Karypis G, Kumar V, Mobasher B (1998) Hypergraph based clustering in high-dimensional data sets: a summary of results. Bulletin of the Technical Committee on Data Engineering 21(1)
Hinneburg A, Keim DA (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall Advanced Reference Series. Prentice Hall, Englewood Cliffs
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Computing Surveys (3)
Karypis G (2006) Cluto: Software for clustering high-dimensional datasets. http://www.cs.umn.edu/~karypis
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley Series in Probability and Statistics, John Wiley and Sons, New York
Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl Inf Syst 12(1): 25–53
Lewis D (2004) Reuters-21578 text categorization text collection 1.0. http://www.daviddlewis.com/resources/testcollections/reuters21578/
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of the fifth berkeley symposium on mathematical statistics and probability, vol I, Statistics, University of California Press
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1(1): 24–45
Omiecinski E (2003) Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1): 57–69
Ozdal MM, Aykanat C (2004) Hypergraph models and algorithms for data-pattern based clustering. Data Mining Knowl Discov 9(1): 29–57
Porter MF (1980) An algorithm for suffix stripping. In: Program 14(3): 130–137
Rijsbergen CJV (1979) Information retrieval, 2nd edn. Butterworths, London
Sander J, Ester M, Kriegel H-P, Xu X (1998) Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Mining Knowl Discov 2(2): 169–194
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining
Steinbach M, Tan P-N, Xiong H, Kumar V (2004) Generalizing the notion of support. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 689–694
Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Proceedings of AAAI: workshop of artificial intelligence for web search, AAAI, pp 58–64. http://citeseer.nj.nec.com/strehl00impact.html
TREC (1996) http://trec.nist.gov
Tung AKH, Ng RT, Lakshmanan LVS, Han J (2001) Constraint-based clustering in large databases. In: den Bussche JV, Vianu V (eds) Database theory-ICDT 2001, 8th International Conference, pp 405–419
Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of the 1999 ACM CIKM international conference on information and knowledge management, pp 483–490
Xiong H, He X, Ding C, Zhang Y, Kumar V, Holbrook S (2005) Identification of functional modules in protein complexes via hyperclique pattern discovery. In: Proceedings of the pacific symposium on biocomputing
Xiong H, Steinbach M, Tan P-N, Kumpar V (2004) HICAP: Hierarchial Clustering with Pattern Preservation. In: Proceedings of 2004 SIAM International Conference on Data Mining (SDM), pp 279–290
Xiong H, Tan P, Kumar V (2003) Mining strong affinity association patterns in data sets with skewed support distribution. In: Proceedings of the third IEEE international conference on data mining (ICDM), pp 387–394
Xiong H, Tan P-N, Kumar V (2006) Hyperclique pattern discovery. Data Mining Knowl Discov J 13(2): 219–242
Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the 2002 ACM CIKM international conference on information and knowledge management, ACM Press, New York, pp 515–524
Zhao Y, Karypis G (2004) Criterion functions for document clustering: experiments and analysis. Mach Learn 55(3): 311–331
Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3): 374–384
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xiong, H., Steinbach, M., Ruslim, A. et al. Characterizing pattern preserving clustering. Knowl Inf Syst 19, 311–336 (2009). https://doi.org/10.1007/s10115-008-0148-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0148-0