research-article

Learning multiple nonredundant clusterings

Authors:

Xiaoli Z. Fern,

Jennifer G. DyAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 4, Issue 3

Article No.: 15, Pages 1 - 32

https://doi.org/10.1145/1839490.1839496

Published: 22 October 2010 Publication History

Abstract

Real-world applications often involve complex data that can be interpreted in many different ways. When clustering such data, there may exist multiple groupings that are reasonable and interesting from different perspectives. This is especially true for high-dimensional data, where different feature subspaces may reveal different structures of the data. However, traditional clustering is restricted to finding only one single clustering of the data. In this article, we propose a new clustering paradigm for exploratory data analysis: find all non-redundant clustering solutions of the data, where data points in the same cluster in one solution can belong to different clusters in other partitioning solutions. We present a framework to solve this problem and suggest two approaches within this framework: (1) orthogonal clustering, and (2) clustering in orthogonal subspaces. In essence, both approaches find alternative ways to partition the data by projecting it to a space that is orthogonal to the current solution. The first approach seeks orthogonality in the cluster space, while the second approach seeks orthogonality in the feature space. We study the relationship between the two approaches. We also combine our framework with techniques for automatically finding the number of clusters in the different solutions, and study stopping criteria for determining when all meaningful solutions are discovered. We test our framework on both synthetic and high-dimensional benchmark data sets, and the results show that indeed our approaches were able to discover varied clustering solutions that are interesting and meaningful.

References

[1]

Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 94--105.

Digital Library

[2]

Akaike, H. 1974. A new look at the statistical model identification. IEEE Trans. Autom. Control 19, 6, 716--723.

[3]

Bae, E. and Bailey, J. 2006. Coala: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In Proceedings of the 6th International Conference on Data Mining. 53--62.

Digital Library

[4]

Bay, S. D. 1999. The UCI KDD archive. http://kdd.ics.uci.edu/.

[5]

Blake, C. and Merz, C. 1998. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html.

[6]

Caruana, R., Elhawary, M., Nguyen, N., and Smith, C. 2006. Meta clustering. In Proceedings of the 6th International Conference on Data Mining. Hong Kong, 107--118.

Digital Library

[7]

Chechik, G. and Tishby, N. 2003. Extracting relevant structures with side information. In Proceedings of the Advances in Neural Information Processing Systems 15 (NIPS).

[8]

CMU. 1997. CMU 4 universities WebKB data.

[9]

Cui, Y., Fern, X., and Dy, G. J. 2007. Non-redundant multi-view clustering via orthogonalization. In Proceedings of the IEEE International Conference on Data Mining. 133--142.

Digital Library

[10]

Ding, C., He, X., Zha, H., and Simon, H. 2002. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the IEEE International Conference on Data Mining. 147--154.

Digital Library

[11]

Domeniconi, C. and Al-Razgan, M. 2009. Weighted cluster ensembles: Methods and analysis. ACM Trans. Knowl. Disc. Data 2, 4 (January), 1--40.

Digital Library

[12]

Domeniconi, C., Gunopulos, D., Ma, S., Yan, B., Al-Razgan, M., and Papadopoulos, D. 2007. Locally adaptive metrics for clustering high dimensional data. Data Mining and Knowledge Discovery Journal 14, 63--97.

Digital Library

[13]

Duda, R. O. and Hart, P. E. 1973. Pattern Classification and Scene Analysis. Wiley & Sons, NY.

[14]

Dy, J. G. and Brodley, C. E. 2004. Feature selection for unsupervised learning. J. Mach. Learn. Res. 5, 845--889.

[15]

Fern, X. Z. and Brodley, C. E. 2003. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the International Conference on Machine Learning. 186--193.

[16]

Fern, X. Z. and Brodley, C. E. 2004. Solving cluster ensemble problems by bipartite graph partitioning. In Proceedings of the International Conference on Machine Learning.

Digital Library

[17]

Figueiredo, M. A. T., and Jain, A. K. 2002. Unsupervised learning of finite mixture models. IEEE Trans. Patt. Anal. Mach. Intell. 24, 3 (Mar.), 381--396.

Digital Library

[18]

Forgy, E. 1965. Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications. Biometrics 21, 768.

[19]

Fred, A. L. N. and Jain, A. K. 2005. Combining multiple clusterings using evidence accumulation. IEEE Trans. Patt. Anal. Mach. Intell. 27, 6 (June), 835--850.

Digital Library

[20]

Fred, A. L. N. and Jain, A. K. 2006. Learning pairwise similarity for data clustering. In Proceedings of the International Conference on Pattern Recognition (ICPR). Vol. 1. 925--928.

Digital Library

[21]

Fukunaga, K. 1990. Statistical Pattern Recognition (second edition). Academic Press, San Diego, CA.

[22]

Gondek, D. 2005. Non-redundant clustering. Ph.D. dissertation, Brown University.

Digital Library

[23]

Gondek, D. and Hofmann, T. 2003. Conditional information bottleneck clustering. In Proceedings of the 3rd IEEE International Conference on Data Mining, Workshop on Clustering Large Data Sets.

[24]

Gondek, D. and Hofmann, T. 2004. Non-redundant data clustering. In Proceedings of the 4th IEEE International Conference on Data Mining.

Digital Library

[25]

Gondek, D. and Hofmann, T. 2005. Non-redundant clustering with conditional ensembles. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05). 70--77.

Digital Library

[26]

Gondek, D., Vaithyanathan, S., and Garg, A. 2005. Clustering with model-level constraints. In Proceedings of SIAM International Conference on Data Mining.

[27]

Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Computing Surveys 31, 3, 264--323.

Digital Library

[28]

Jain, P., Meka, R., and Dhillon, I. S. 2008. Simultaneous unsupervised learning of disparate clusterings. In Proceedings of the 7th SIAM International Conference on Data Mining. 858--869.

[29]

Jolliffe, I. T. 1986. Principal Component Analysis. Springer-Verlag, New-York.

[30]

Kohonen, T., Nemeth, G., Bry, K. J., Jalanko, M., and Riittinen, H. 1979. Spectral classification of phenomes by learning subspaces. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 97--100.

[31]

Law, M., Figueiredo, M., and Jain, A. K. 2004. Simultaneous feature selection and clustering using a mixture model. IEEE Trans. Patt. Anal. Mach. Intell. 26, 9 (Sept.), 1154--1166.

Digital Library

[32]

Macqueen, J. 1967. Some methods for classifications and analysis of multivariate observations. Proceedings of the 5th Symposium on Mathematical Statistics and Probability, 1, 281--297.

[33]

Monti, S., Pablo, T., Mesirov, J., and Golub, T. 2003. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52, 91--118.

Digital Library

[34]

Oja, E. and Kuusela, M. 1983. The alsm algorithm—an improved subspace method of classification. Patt. Recog. 4, 16, 421--427.

[35]

Parsons, L., Haque, E., and Liu, H. 2004. Subspace clustering for high dimensional data: a review. SIGKDD Explor. Newsl. 6, 1, 90--105.

Digital Library

[36]

Pelleg, D. 2000. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conference on Machine Learning. 727--734.

Digital Library

[37]

Roth, V., Lange, T., Braun, M., and Buhmann, J. 2002. A resampling approach to cluster validation. In Proceedings of the International Conference on Computational Statistics. 123--129.

[38]

Schwarz, G. 1978. Estimating the dimension of a model. Ann. Stat. 6, 2, 461--464.

[39]

Strehl, A. and Ghosh, J. 2002a. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 583--617.

Digital Library

[40]

Strehl, A. and Ghosh, J. 2002b. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583--617.

Digital Library

[41]

Tibshirani, R., Walther, G., and Hastie, T. 2001. Estimating the number of clusters in a dataset via the gap statistic. J. Roy. Statist. Soc. 63, 2, 411--423.

[42]

Watanabe, L. and Pakvasa, N. 1973. Subspace method of pattern recognition. In Proceedings of the 1st International Joint Conference Pattern Recognition. 25--32.

Cited By

Yu ZWang DMeng XChen C(2022)Clustering Ensemble Based on Hybrid Multiview ClusteringIEEE Transactions on Cybernetics10.1109/TCYB.2020.303415752:7(6518-6530)Online publication date: Jul-2022
https://doi.org/10.1109/TCYB.2020.3034157
Zhang YChen XZhang YChen X(2022)MethodologyApplication-Oriented Higher Education10.1007/978-981-19-2647-1_6(83-110)Online publication date: 22-May-2022
https://doi.org/10.1007/978-981-19-2647-1_6
Quan WZhou QNan HChen YWang PZhu JLin E(2018)A user-satisfaction-based clustering methodProceedings of 2018 International Conference on Mathematics and Artificial Intelligence10.1145/3208788.3208789(56-6)Online publication date: 20-Apr-2018
https://dl.acm.org/doi/10.1145/3208788.3208789
Show More Cited By

Index Terms

Learning multiple nonredundant clusterings
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Combining Multiple Clusterings Using Evidence Accumulation

We explore the idea of evidence accumulation (EAC) for combining the results of multiple clusterings. First, a clustering ensemble a set of object partitions, is produced. Given a data set (n objects or patterns in d dimensions), different ways of ...
Simultaneous Unsupervised Learning of Disparate Clusterings

Most clustering algorithms produce a single clustering for a given dataset even when the data can be clustered naturally in multiple ways. In this paper, we address the difficult problem of uncovering disparate clusterings from the data in a totally ...
Finding multiple stable clusterings

Multi-clustering, which tries to find multiple independent ways to partition a data set into groups, has enjoyed many applications, such as customer relationship management, bioinformatics and healthcare informatics. This paper addresses two fundamental ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 4, Issue 3

October 2010

191 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/1839490

Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2010

Accepted: 01 April 2010

Revised: 01 August 2009

Received: 01 November 2008

Published in TKDD Volume 4, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
700
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yu ZWang DMeng XChen C(2022)Clustering Ensemble Based on Hybrid Multiview ClusteringIEEE Transactions on Cybernetics10.1109/TCYB.2020.303415752:7(6518-6530)Online publication date: Jul-2022
https://doi.org/10.1109/TCYB.2020.3034157
Zhang YChen XZhang YChen X(2022)MethodologyApplication-Oriented Higher Education10.1007/978-981-19-2647-1_6(83-110)Online publication date: 22-May-2022
https://doi.org/10.1007/978-981-19-2647-1_6
Quan WZhou QNan HChen YWang PZhu JLin E(2018)A user-satisfaction-based clustering methodProceedings of 2018 International Conference on Mathematics and Artificial Intelligence10.1145/3208788.3208789(56-6)Online publication date: 20-Apr-2018
https://dl.acm.org/doi/10.1145/3208788.3208789
Guedes GOgasawara EBezerra EXexeo G(2016)Discovering top-k non-redundant clusterings in attributed graphsNeurocomputing10.1016/j.neucom.2015.10.145210:C(45-54)Online publication date: 19-Oct-2016
https://dl.acm.org/doi/10.1016/j.neucom.2015.10.145
Liang PWongthanavasu S(2016)Hybrid linear matrix factorization for topic-coherent terms clusteringExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.06.02262:C(358-372)Online publication date: 15-Nov-2016
https://dl.acm.org/doi/10.1016/j.eswa.2016.06.022
Guedes GBezerra EOgasawara EXexéo GWainwright RCorchado JBechini AHong J(2015)Exploring multiple clusterings in attributed graphsProceedings of the 30th Annual ACM Symposium on Applied Computing10.1145/2695664.2696008(915-918)Online publication date: 13-Apr-2015
https://dl.acm.org/doi/10.1145/2695664.2696008
Niu DDy JJordan a(2014)Iterative Discovery of Multiple AlternativeClustering ViewsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2013.18036:7(1340-1353)Online publication date: 1-Jul-2014
https://dl.acm.org/doi/10.1109/TPAMI.2013.180
Rua EMaiorana ECastro JCampisi P(2012)Biometric Template Protection Using Universal Background ModelsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2011.21682137:1(269-282)Online publication date: 1-Feb-2012
https://dl.acm.org/doi/10.1109/TIFS.2011.2168213
Muller EGunnemann SFarber ISeidl T(2012)Discovering Multiple Clustering SolutionsProceedings of the 2012 IEEE 28th International Conference on Data Engineering10.1109/ICDE.2012.142(1207-1210)Online publication date: 1-Apr-2012
https://dl.acm.org/doi/10.1109/ICDE.2012.142

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents