Skip to main content
Log in

Hierarchical document clustering using local patterns

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The global pattern mining step in existing pattern-based hierarchical clustering algorithms may result in an unpredictable number of patterns. In this paper, we propose IDHC, a pattern-based hierarchical clustering algorithm that builds a cluster hierarchy without mining for globally significant patterns. IDHC first discovers locally promising patterns by allowing each instance to “vote” for its representative size-2 patterns in a way that ensures an effective balance between local pattern frequency and pattern significance in the dataset. The cluster hierarchy (i.e., the global model) is then directly constructed using these locally promising patterns as features. Each pattern forms an initial (possibly overlapping) cluster, and the rest of the cluster hierarchy is obtained by following a unique iterative cluster refinement process. By effectively utilizing instance-to-cluster relationships, this process directly identifies clusters for each level in the hierarchy, and efficiently prunes duplicate clusters. Furthermore, IDHC produces cluster labels that are more descriptive (patterns are not artificially restricted), and adapts a soft clustering scheme that allows instances to exist in suitable nodes at various levels in the cluster hierarchy. We present results of experiments performed on 16 standard text datasets, and show that IDHC outperforms state-of-the-art hierarchical clustering algorithms in terms of average entropy and FScore measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Angiulli F, Ianni G, Palopoli L (2001) On the complexity of mining association rules. In: Proceedings of Nono Convegno Nazionale su Sistemi Evoluti di Basi di Dati (SEBD), pp 177–184

  • Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: Proceedings of international conference on knowledge discovery and data mining, pp 436–442

  • Brijs T, Vanhoof K, Wets G (2003) Defining interestingness for association rules. Int J Inf Theor Appl 10(4): 370–376

    Google Scholar 

  • Carter C, Hamilton H, Cercone N (1997) Share based measures for itemsets. In: Proceedings of the first European symposium on principles of data mining and knowledge discovery, pp 14–24

  • Clifton C, Coolie R, Rennie J (2004) TopCat: Data mining for topic identification in a text corpus. IEEE Trans Knowl Data Eng 16(8): 949–964

    Article  Google Scholar 

  • Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM international conference on data mining, pp 59–70

  • Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3). http://portal.acm.org/citation.cfm?id=1132960.1132963

  • Gunopulos D, Khardon R, Mannila H, Saluja S, Toivonen H, Sharma RS (2003) Discovering all most specific sentences. ACM Trans Database Syst 28(2): 140–174

    Article  Google Scholar 

  • Han E.H., Karypis G, Kumar V, Mobasher B (1997) Clustering based on association rule hypergraphs. In: Proceedings of research issues on data mining and knowledge discovery, pp 59–70

  • Karypis G (2003) CLUTO: A software package for clustering high dimensional datasets. http://www-users.cs.umn.edu/~karypis/cluto/

  • Knobbe A, Crémilleux B, Fürnkranz J, Scholz M (2008) From local patterns to global models: the LeGo approach to data mining. In: Proceedings of local patterns to global models workshop (ECML/PKDD), pp 1–16

  • Li Y, Chung S. M, (2005) Text document clustering based on frequent word sequences. In: Proceedings of the 14th ACM international conference on information and knowledge management (CIKM), pp 293–294

  • Malik H, Kender JR (2006) High quality, efficient hierarchical document clustering using closed interesting itemsets. In: Proceedings of sixth IEEE international conference on data mining, pp 991–996

  • Malik H, Kender JR (2007) Optimizing frequency queries for data mining applications. In: Proceedings of Seventh IEEE International Conference on Data Mining, pp 595–600

  • Malik H, Kender JR (2008) Instance driven hierarchical clustering of document collections. In: Proceedings of local patterns to global models workshop (ECML/PKDD)

  • Moerchen F, Brinker K, Neubauer C (2007) Any-time clustering of high frequency news streams. In: Proceedings of data mining case studies workshop, the thirteenth ACM SIGKDD international conference on knowledge discovery and data mining

  • Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newslett 6(1): 90–105

    Article  Google Scholar 

  • Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24((5): 513–523

    Article  Google Scholar 

  • Tan P, Kumar V, Sristava J (2002) Selecting the right interestingness measure for association patterns. In: Proceedings of 8th international conference on knowledge discovery and data mining, pp 32–41

  • Wang J, Karypis G (2004) SUMMARY: efficient summarizing transactions for clustering. In: Proceedings of fourth IEEE international conference on data mining, pp 241–248

  • Wu C (2006) Mining top-K frequent closed itemsets is not in APX. In: Proceedings of PAKDD, pp 435–439

  • Xiong H, Steinbach M, Tan PN, Kumar V (2004) HICAP: hierarchical clustering with pattern preservation. In: Proceedings of SIAM international conference on data mining, pp 279–290

  • Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, pp 412–420

  • Yu H, Searsmith D, Li X, Han J (2004) Scalable construction of topic directory with nonparametric closed termset mining. In: Proceedings of fourth IEEE international conference on data mining, pp 563–566

  • Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2): 141–168

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hassan H. Malik.

Additional information

Responsible editor: Johannes Fürnkranz and Arno Knobbe.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Malik, H.H., Kender, J.R., Fradkin, D. et al. Hierarchical document clustering using local patterns. Data Min Knowl Disc 21, 153–185 (2010). https://doi.org/10.1007/s10618-010-0172-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-010-0172-z

Keywords

Navigation