Finding multiple stable clusterings

Hu, Juhua; Qian, Qi; Pei, Jian; Jin, Rong; Zhu, Shenghuo

doi:10.1007/s10115-016-0998-9

Finding multiple stable clusterings

Regular Paper
Published: 04 October 2016

Volume 51, pages 991–1021, (2017)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Juhua Hu¹,
Qi Qian²,
Jian Pei¹,
Rong Jin² &
…
Shenghuo Zhu²

506 Accesses
15 Citations
Explore all metrics

Abstract

Multi-clustering, which tries to find multiple independent ways to partition a data set into groups, has enjoyed many applications, such as customer relationship management, bioinformatics and healthcare informatics. This paper addresses two fundamental questions in multi-clustering: How to model quality of clusterings and how to find multiple stable clusterings (MSC). We introduce to multi-clustering the notion of clustering stability based on Laplacian eigengap, which was originally used by the regularized spectral learning method for similarity matrix learning. We mathematically prove that the larger the eigengap, the more stable the clustering. Furthermore, we propose a novel multi-clustering method MSC. An advantage of our method comparing to the state-of-the-art multi-clustering methods is that our method can provide users a feature subspace to understand each clustering solution. Another advantage is that MSC does not need users to specify the number of clusters and the number of alternative clusterings, which is usually difficult for users without any guidance. Our method can heuristically estimate the number of stable clusterings in a data set. We also discuss a practical way to make MSC applicable to large-scale data. We report an extensive empirical study that clearly demonstrates the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

References

Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD international conference on management of data. Seattle, WA, pp 94–105
Assent I, Krieger R, Müller E, Seidl T (2008) INSCY: indexing subspace clusters with in-process-removal of redundancy. In: Proceedings of the IEEE international conference on data mining. Pisa, Italy, pp 719–724
Azran A, Ghahramani Z (2006), Spectral methods for automatic multiscale data clustering, In: IEEE computer society conference on computer vision and pattern recognition. New York, NY, pp 190–197
Bae E, Bailey J (2006) COALA: a novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In: Proceedings of the IEEE international conference on data mining. Hong Kong, China, pp 53–62
Bae E, Bailey J, Dong G (2010) A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings. Data Min Knowl Disc 21(3):427–471
Article MathSciNet Google Scholar
Bailey J (2013) Alternative clustering analysis: a review. In: Aggarwal CC, Reddy CK (eds) Data clustering: algorithms and applications. Taylor & Francis, London, pp 535–550
Bickel P, Ritov Y, Tsybakov A (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37(4):1705–1732
Article MathSciNet MATH Google Scholar
Caruana R, Elhawary MF, Nguyen N, Smith C (2006) Meta clustering. In: Proceedings of the IEEE international conference on data mining. Hong Kong, China, pp 107–118
Chen, X, Cai D (2011) Large scale spectral clustering with landmark-based representation. In: Proceedings of the twenty-fifth AAAI conference on artificial intelligence. San Francisco, CA
Dang X, Bailey J (2015) A framework to uncover multiple alternative clusterings. Mach Learn 98(1–2):7–30
Article MathSciNet MATH Google Scholar
Dasgupta S, Ng V (2010) Mining clustering dimensions. In: Proceedings of the 27th international conference on machine learning. Haifa, Israel, pp 263–270
Daubechies I (1992) Ten lectures on wavelets. Society for Industrial and Applied Mathematics, Philadelphia
Book MATH Google Scholar
Domeniconi C, Al-Razgan M (2009) Weighted cluster ensembles: methods and analysis. ACM Trans Knowl Discov Data 2(4):17:1–17:40. doi:10.1145/1460797.1460800
Drineas P, Mahoney MW (2005) On the nyström method for approximating a gram matrix for improved kernel-based learning. J Mach Learn Res 6:2153–2175
MathSciNet MATH Google Scholar
Ester M, Kriegel H-P, Sander J, Xu X (1996), A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining. Portland, OR, pp 226–231
Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning. Washington, DC, pp 186–193
Hoyer PO (2004) Non-negative matrix factorization with sparseness constraints. J Mach Learn Res 5:1457–1469
MathSciNet MATH Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Article MATH Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Kailing K, Kriegel H-P, Kröger P (2004) Density-connected subspace clustering for high-dimensional data. In: Proceedings of the 4th SIAM international conference on data mining. Lake Buena Vista, FL, pp 246–257
Kriegel H-P, Kröger P, Renz M, Wurst S (2005) A generic framework for efficient subspace clustering of high-dimensional data. In: Proceedings of the IEEE international conference on data mining. Houston, TX, pp 250–257
Kumar A, Sabharwal Y, Sen S (2004) A simple linear time (1+\(\acute{\epsilon }\))-approximation algorithm for k-means clustering in any dimensions. In: Proceedings of the 45th symposium on foundations of computer science. Rome, Italy, pp 454–462
Kyrillidis AT, Becker S, Cevher V, Koch C (2013) Sparse projections onto the simplex. In: Proceedings of the 30th international conference on machine learning. Atlanta, GA, pp 235–243
Li M, Lian X, Kwok JT, Lu B (2011) Time and space efficient spectral clustering via column sampling. In: Proceedings of IEEE conference on computer vision and pattern recognition. Colorado Springs, CO, pp 2297–2304
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–136
Article MathSciNet MATH Google Scholar
Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416
Article MathSciNet Google Scholar
Meilǎ M, Shortreed S (2006) Regularized spectral learning. J Mach Learn Res 2006:1–20
Google Scholar
Müller E, Assent I, Krieger R, Günnemann S, Seidl T (2009) Densest: density estimation for data mining in high dimensional spaces. In: Proceedings of the 9th SIAM international conference on data mining. Sparks, NV, pp 173–184
Nagesh H, Goil S, Choudhary A (2001) Adaptive grids for clustering massive data sets. In: Proceedings of the 1st SIAM international conference on data mining. Chicago, IL, pp 1–17
Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems 14, pp 849–856
Roth V, Lange T (2003) Feature selection in clustering problems. In: Advances in neural information processing systems 16 [neural information processing systems, NIPS 2003, 8–13 Dec 2003, Vancouver and Whistler, British Columbia, Canada], pp 473–480
Sequeira K, Zaki MJ (2005) Schism: a new approach to interesting subspace mining. Int J Bus Intell Data Min 1(2):137–160
Article Google Scholar
Shapiro LG, Stockman GC (2001) Computer vision. Prentice Hall, San Diego
Google Scholar
Stewart GW, Sun J-G (1990) Matrix perturbation theory. Academic Press, San Diego
MATH Google Scholar
Tang C, Zhang A, Pei J (2003) Mining phenotypes and informative genes from gene expression data. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD’03). Washington, DC
Tibshirani R (1994) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
MathSciNet MATH Google Scholar
Wang JZ, Li J, Wiederhold G (2001) Simplicity: semantics-sensitive integrated matching for picture libraries. IEEE Trans Pattern Anal Mach Intell 23(9):947–963
Article Google Scholar
Williams CKI, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Advances in neural information processing systems 13, pp 682–688
Xiang S, Tong X, Ye J (2013) Efficient sparse group feature selection via nonconvex optimization. In: Proceedings of the 30th international conference on machine learning. Atlanta, GA, pp 284–292
Yan D, Huang L, Jordan MI (2009) Fast approximate spectral clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. France, Paris, pp 907–916
Zhao P, Yu B (2006) On model selection consistency of lasso. J Mach Learn Res 7:2541–2563
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
Juhua Hu & Jian Pei
Alibaba Group, Bellevue, WA, 98004, USA
Qi Qian, Rong Jin & Shenghuo Zhu

Authors

Juhua Hu
View author publications
You can also search for this author in PubMed Google Scholar
Qi Qian
View author publications
You can also search for this author in PubMed Google Scholar
Jian Pei
View author publications
You can also search for this author in PubMed Google Scholar
Rong Jin
View author publications
You can also search for this author in PubMed Google Scholar
Shenghuo Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juhua Hu.

Additional information

Hu and Pei’s research was supported in part by an NSERC Discovery Grant and the NSERC CRC Program. Qian and Jin were supported in part by NSF (IIS-1251031) and ONR (N000141410631) during the paper writing. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, J., Qian, Q., Pei, J. et al. Finding multiple stable clusterings. Knowl Inf Syst 51, 991–1021 (2017). https://doi.org/10.1007/s10115-016-0998-9

Download citation

Received: 08 December 2015
Revised: 04 September 2016
Accepted: 18 September 2016
Published: 04 October 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s10115-016-0998-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Finding multiple stable clusterings

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Finding multiple stable clusterings

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation