skip to main content
10.1145/1081870.1081932acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Model-based overlapping clustering

Published: 21 August 2005 Publication History

Abstract

While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to finding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model proposed by Segal et al. [23] as a generalization of Gaussian mixture models, and we extend it to an overlapping clustering model based on mixtures of any regular exponential family distribution and the corresponding Bregman divergence. We provide the necessary algorithm modifications for this extension, and present results on synthetic data as well as subsets of 20-Newsgroups and EachMovie datasets.

References

[1]
A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. In KDD, 2004.
[2]
A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. In SDM, 2004.
[3]
S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In KDD, 2004.
[4]
A. Battle, E. Segal, and D. Koller. Probabilistic discovery of overlapping cellular processes and their regulation using gene expression data. In RECOMB, 2004.
[5]
J. C. Bezdek and S. K. Pal Fuzzy Models for Pattern Recognition. IEEE Press, 1992.
[6]
A. Bjorck. Numerical Methods for Least Squares Problems. Society for Industrial & Applied Math (SIAM), 1996.
[7]
Y. Censor and S. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, 1998.
[8]
Y. Cheng and G. M. Church. Biclustering of expression data. In ISMB, 2000.
[9]
V. Chvátal. Hard knapsack problems. Operations Research, 28(6):1402--1412, 1980.
[10]
M. Collins, S. Dasgupta, and R. Schapire. A generalization of principal component analysis to the exponential family. In NIPS, 2001.
[11]
M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. In COLT, 2000.
[12]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1--38, 1977.
[13]
I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42:143--175, 2001.
[14]
N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In IJCAI, 1999.
[15]
G. Gordon. Generalized2 linear2 models. In NIPS, 2001.
[16]
T. Hastie, R. Tibshirani, M. B. Eisen, A. Alizadeh, R. Levy, L. Staudt, W. C. Chan, D. Botstein, and P. Brown. Gene shaving as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology, 2000.
[17]
J. Kleinberg, C. H. Papadimitriou, and P. Raghavan. On the value of private information. In Proc. 8th Conf. on Theoretical Aspects of Rationality and Knowledge, 2001.
[18]
L. Lazzeroni and A. B. Owen. Plaid models for gene expression data. Statistica Sinica, 12(1):61--86, 2002.
[19]
D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2001.
[20]
W. T. McCormick, P. J. Schweitzer, and T. W. White. Problem decomposition and data reorganization by a clustering technique. Operations Research, 20:993--1009, 1972.
[21]
P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall/CRC, 1989.
[22]
M. Sahami, M. Hearst, and E. Saund. Applying the Multiple Cause Mixture Model to Text Categorization. In ICML, 1996.
[23]
E. Segal, A. Battle, and D. Koller. Decomposing gene expression into cellular processes. In PSB, 2003.

Cited By

View all
  • (2024)Well-connectedness and community detectionPLOS Complex Systems10.1371/journal.pcsy.00000091:3(e0000009)Online publication date: 5-Nov-2024
  • (2024)Mixtures of Probit Regression Models with Overlapping ClustersBayesian Analysis10.1214/23-BA137219:3Online publication date: 1-Sep-2024
  • (2024)Clustering and Topic Discovery of Multiway Data via Joint-NCMTF2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825741(1268-1275)Online publication date: 15-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
August 2005
844 pages
ISBN:159593135X
DOI:10.1145/1081870
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bregman divergences
  2. exponential model
  3. graphical model
  4. high-dimensional clustering
  5. overlapping clustering

Qualifiers

  • Article

Conference

KDD05

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)62
  • Downloads (Last 6 weeks)8
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Well-connectedness and community detectionPLOS Complex Systems10.1371/journal.pcsy.00000091:3(e0000009)Online publication date: 5-Nov-2024
  • (2024)Mixtures of Probit Regression Models with Overlapping ClustersBayesian Analysis10.1214/23-BA137219:3Online publication date: 1-Sep-2024
  • (2024)Clustering and Topic Discovery of Multiway Data via Joint-NCMTF2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825741(1268-1275)Online publication date: 15-Dec-2024
  • (2023)Impact of E-Learning Orientation, Moodle Usage, and Learning Planning on Learning Outcomes in On-Demand LecturesEducation Sciences10.3390/educsci1310100513:10(1005)Online publication date: 3-Oct-2023
  • (2023)Distributed Representation for Assembly CodeComputers10.3390/computers1211022212:11(222)Online publication date: 1-Nov-2023
  • (2023)A Cross-Sampling Method for Hidden Structure Extraction to Improve Imbalanced Multiclass Classification Accuracy2023 Sixth International Conference on Vocational Education and Electrical Engineering (ICVEE)10.1109/ICVEE59738.2023.10348228(353-358)Online publication date: 14-Oct-2023
  • (2022)Grid-Based Clustering Using Boundary DetectionEntropy10.3390/e2411160624:11(1606)Online publication date: 4-Nov-2022
  • (2022)AOC: Assembling overlapping communitiesQuantitative Science Studies10.1162/qss_a_002273:4(1079-1096)Online publication date: 20-Dec-2022
  • (2022)Federated Learning With Soft ClusteringIEEE Internet of Things Journal10.1109/JIOT.2021.31139279:10(7773-7782)Online publication date: 15-May-2022
  • (2022)Legal Entity Disambiguation for Financial Crime Detection2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020700(6639-6641)Online publication date: 17-Dec-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media