Skip to main content
Log in

High-dimensional clustering: a clique-based hypergraph partitioning framework

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Hypergraph partitioning has been considered as a promising method to address the challenges of high-dimensional clustering. With objects modeled as vertices and the relationship among objects captured by the hyperedges, the goal of graph partitioning is to minimize the edge cut. Therefore, the definition of hyperedges is vital to the clustering performance. While several definitions of hyperedges have been proposed, a systematic understanding of desired characteristics of hyperedges is still missing. To that end, in this paper, we first provide a unified clique perspective of the definition of hyperedges, which serves as a guide to define hyperedges. With this perspective, based on the concepts of shared (reverse) nearest neighbors, we propose two new types of clique hyperedges and analyze their properties regarding purity and size issues. Finally, we present an extensive evaluation using real-world document datasets. The experimental results show that, with shared (reverse) nearest neighbor-based hyperedges, the clustering performance can be improved significantly in terms of various external validation measures without the need for fine tuning of parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD conference on management of data, pp 207–216

  2. Arya S, Mount DM, Netanyahu NS, Silverman R, Wu AY (1998) An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J ACM 45(6):891–923

    Article  MATH  MathSciNet  Google Scholar 

  3. Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, Reading

    Google Scholar 

  4. Bandyopadhyay S, Maulik U (2002) An evolutionary technique based on k-means algorithm for optimal clustering in \(R^N\). Inf Sci 146(1–4):221–237

    Article  MATH  MathSciNet  Google Scholar 

  5. Cheeseman P, Stutz J (1996) Bayesian classification (AutoClass): theory and results. In: Advances in knowledge discovery and data mining, pp 153–180

  6. Chen C, Tseng F, Liang T (2011) An integration of fuzzy association rules and wordnet for document clustering. Knowl Inf Syst 28(3):687–708

    Article  Google Scholar 

  7. Ertoz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 3rd SIAM international conference on data mining, pp 47–58

  8. Fodeh S, Punch B, Tan P (2011) On ontology-driven document clustering using core semantic features. Knowl Inf Syst 28(2):395–421

    Article  Google Scholar 

  9. France SL, Carroll JD, Xiong H (2012) Distance metrics for high dimensional nearest neighborhood recovery: compression and normalization. Inf Sci 184(1):92–110

    Article  MathSciNet  Google Scholar 

  10. Han E-H, Karypis G, Kumar V, Mobasher B (1998) Hypergraph based clustering in high-dimensional data sets: a summary of results. IEEE Data Eng Bull 21(1):15–22

    Google Scholar 

  11. Hu T, Sung SY (2006) Finding centroid clusterings with entropy-based criteria. Knowl Inf Syst 10(4):505–514

    Article  Google Scholar 

  12. Hu T, Sung SY, Xiong H, Fu Q (2008) Discovery of maximum length frequent itemsets. Inf Sci 178(1):69–87

    Article  MathSciNet  Google Scholar 

  13. Hu T, Tan CL, Tang Y, Sung SY, Xiong H, Qu C (2008) Co-clustering bipartite with pattern preservation for topic extraction. Int J Artif Intell Tools 17(1):87–107

    Article  Google Scholar 

  14. Huang Y, Xiong H, Wu W, Deng P, Zhang Z (2007) Mining maximal hyperclique pattern: a hybrid search strategy. Inf Sci 177(3):703–721

    Article  MATH  MathSciNet  Google Scholar 

  15. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surveys 31(3):264–323

    Article  Google Scholar 

  16. Kalogeratos A, Likas A (2012) Text document clustering using global term context vectors. Knowl Inf Syst 31(3):455–474

    Article  Google Scholar 

  17. Karypis G (2003) CLUTO—software for clustering high-dimensional datasets. http://glaros.dtc.umn.edu/gkhome/views/cluto

  18. Karypis G, Aggarwal R, Kumar V, Shekhar S (1997) Multilevel hypergraph partitioning: applications in VLSI domain. In: Proceedings of the 34th conference on design automation, pp 526–529

  19. Korn F, Muthukrishnan S (2000) Influence sets based on reverse nearest neighbor queries. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, pp 201–212

  20. Leung C, Chan S, Chung F (2006) A collaborative filtering framework based on fuzzy association rules and multiple-level similarity. Knowl Inf Syst 10(3):357–381

    Article  Google Scholar 

  21. Lin TY, Chiang I-J (2005) A simplicial complex, a hypergraph, structure in the latent semantic space of document clustering. Int J Approx Reason 40(1–2):55–80

    Article  MATH  MathSciNet  Google Scholar 

  22. Liu C, Hu T, Ge Y, Xiong H (2012) Which distance metric is right: An evolutionary k-means view. In: Proceedings of the 12th SIAM international conference on data mining, pp 907–918

  23. Ni X, Quan X, Lu Z, Liu W, Hua B (2011) Short text clustering by finding core terms. Knowl Inf Syst 27(3):345–365

    Article  Google Scholar 

  24. Ozdal MM, Aykanat C (2004) Hypergraph models and algorithms for data-pattern-based clustering. Data Min Knowl Discov 9(1):29–57

    Article  MathSciNet  Google Scholar 

  25. Rajpathak D, Chougule R, Bandyopadhyay P (2012) A domain-specific decision support system for knowledge discovery using association and text mining. Knowl Inf Syst 31(3):405–432

    Article  Google Scholar 

  26. Rennie JD, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the 20th international conference on machine learning, pp 616–623

  27. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining

  28. Vadapalli S, Valluri SR, Karlapalem K (2006) A simple yet effective data clustering algorithm. In: Proceedings of the 6th IEEE international conference on data mining, pp 1108–1112

  29. Xia C, Hsu W, Lee ML, Ooi BC (2006) BORDER: Efficient computation of boundary points. IEEE Trans Knowl Data Eng 18(3):289–303

    Article  Google Scholar 

  30. Xiong H, Tan P-N, Kumar V (2006) Hyperclique pattern discovery. Data Min Knowl Discov 13(2):219–242

    Article  MathSciNet  Google Scholar 

  31. Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn 55(3):311–331

    Article  MATH  Google Scholar 

  32. Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

We would like to thank the editor and reviewers for their valuable comments. This work was supported by NSFC(61100136,61272067,70890082,71028002), GDNSF(S2012030006242) and NSF(CCF-1018151).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chuanren Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, T., Liu, C., Tang, Y. et al. High-dimensional clustering: a clique-based hypergraph partitioning framework. Knowl Inf Syst 39, 61–88 (2014). https://doi.org/10.1007/s10115-012-0609-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0609-3

Keywords

Navigation