Article

Model-based overlapping clustering

Authors:

Arindam Banerjee,

Chase Krumpelman,

Raymond J. MooneyAuthors Info & Claims

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Pages 532 - 537

https://doi.org/10.1145/1081870.1081932

Published: 21 August 2005 Publication History

Abstract

While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to finding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model proposed by Segal et al. [23] as a generalization of Gaussian mixture models, and we extend it to an overlapping clustering model based on mixtures of any regular exponential family distribution and the corresponding Bregman divergence. We provide the necessary algorithm modifications for this extension, and present results on synthetic data as well as subsets of 20-Newsgroups and EachMovie datasets.

References

[1]

A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. In KDD, 2004.

Digital Library

[2]

A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. In SDM, 2004.

Digital Library

[3]

S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In KDD, 2004.

Digital Library

[4]

A. Battle, E. Segal, and D. Koller. Probabilistic discovery of overlapping cellular processes and their regulation using gene expression data. In RECOMB, 2004.

Digital Library

[5]

J. C. Bezdek and S. K. Pal Fuzzy Models for Pattern Recognition. IEEE Press, 1992.

[6]

A. Bjorck. Numerical Methods for Least Squares Problems. Society for Industrial & Applied Math (SIAM), 1996.

[7]

Y. Censor and S. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, 1998.

Digital Library

[8]

Y. Cheng and G. M. Church. Biclustering of expression data. In ISMB, 2000.

Digital Library

[9]

V. Chvátal. Hard knapsack problems. Operations Research, 28(6):1402--1412, 1980.

Digital Library

[10]

M. Collins, S. Dasgupta, and R. Schapire. A generalization of principal component analysis to the exponential family. In NIPS, 2001.

[11]

M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. In COLT, 2000.

Digital Library

[12]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1--38, 1977.

[13]

I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42:143--175, 2001.

Digital Library

[14]

N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In IJCAI, 1999.

Digital Library

[15]

G. Gordon. Generalized2 linear2 models. In NIPS, 2001.

[16]

T. Hastie, R. Tibshirani, M. B. Eisen, A. Alizadeh, R. Levy, L. Staudt, W. C. Chan, D. Botstein, and P. Brown. Gene shaving as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology, 2000.

[17]

J. Kleinberg, C. H. Papadimitriou, and P. Raghavan. On the value of private information. In Proc. 8th Conf. on Theoretical Aspects of Rationality and Knowledge, 2001.

Digital Library

[18]

L. Lazzeroni and A. B. Owen. Plaid models for gene expression data. Statistica Sinica, 12(1):61--86, 2002.

[19]

D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2001.

Digital Library

[20]

W. T. McCormick, P. J. Schweitzer, and T. W. White. Problem decomposition and data reorganization by a clustering technique. Operations Research, 20:993--1009, 1972.

Digital Library

[21]

P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall/CRC, 1989.

[22]

M. Sahami, M. Hearst, and E. Saund. Applying the Multiple Cause Mixture Model to Text Categorization. In ICML, 1996.

[23]

E. Segal, A. Battle, and D. Koller. Decomposing gene expression into cellular processes. In PSB, 2003.

Cited By

Park MTabatabaee YRamavarapu VLiu BPailodi VRamachandran RKorobskiy DAyres FChacko GWarnow T(2024)Well-connectedness and community detectionPLOS Complex Systems10.1371/journal.pcsy.00000091:3(e0000009)Online publication date: 5-Nov-2024
https://doi.org/10.1371/journal.pcsy.0000009
Ranciati SVinciotti VWit EGalimberti G(2024)Mixtures of Probit Regression Models with Overlapping ClustersBayesian Analysis10.1214/23-BA137219:3Online publication date: 1-Sep-2024
https://doi.org/10.1214/23-BA1372
Cobb BVelasquez RVuduc RPark H(2024)Clustering and Topic Discovery of Multiway Data via Joint-NCMTF2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825741(1268-1275)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825741
Show More Cited By

Index Terms

Model-based overlapping clustering
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Non-Exhaustive, Overlapping Co-Clustering
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

The goal of co-clustering is to simultaneously identify a clustering of the rows as well as the columns of a two dimensional data matrix. Most existing co-clustering algorithms are designed to find pairwise disjoint and exhaustive co-clusters. However, ...
An improved overlapping k-means clustering method for medical applications

The sensitivity of overlapping k-means algorithm to initialization is considered.The k-harmonic means method is effective for identifying initial cluster centroids.The proposed approach outperforms the original overlapping k-means algorithm. Data ...
K-means based method for overlapping document clustering
Special Section: Applied Machine Learning and Management of Volatility, Uncertainty, Complexity & Ambiguity (V.U.C.A)

Overlapping clustering algorithms have shown to be effective for clustering documents. However, the current overlapping document clustering algorithms produce a big number of clusters, which make them little useful for the user. Therefore, in this paper, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

August 2005

844 pages

ISBN:159593135X

DOI:10.1145/1081870

General Chair:
Robert Grossman
University of Illinois at Chicago & Open Data Partners, USA
,
Program Chairs:
Roberto Bayardo
IBM Almaden Research, USA
,
Kristin Bennett
RPI, USA

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD05

Sponsor:

KDD05: The Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 21 - 24, 2005

Illinois, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

130
Total Citations
View Citations
1,489
Total Downloads

Downloads (Last 12 months)62
Downloads (Last 6 weeks)8

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Park MTabatabaee YRamavarapu VLiu BPailodi VRamachandran RKorobskiy DAyres FChacko GWarnow T(2024)Well-connectedness and community detectionPLOS Complex Systems10.1371/journal.pcsy.00000091:3(e0000009)Online publication date: 5-Nov-2024
https://doi.org/10.1371/journal.pcsy.0000009
Ranciati SVinciotti VWit EGalimberti G(2024)Mixtures of Probit Regression Models with Overlapping ClustersBayesian Analysis10.1214/23-BA137219:3Online publication date: 1-Sep-2024
https://doi.org/10.1214/23-BA1372
Cobb BVelasquez RVuduc RPark H(2024)Clustering and Topic Discovery of Multiway Data via Joint-NCMTF2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825741(1268-1275)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825741
Aida S(2023)Impact of E-Learning Orientation, Moodle Usage, and Learning Planning on Learning Outcomes in On-Demand LecturesEducation Sciences10.3390/educsci1310100513:10(1005)Online publication date: 3-Oct-2023
https://doi.org/10.3390/educsci13101005
Yoshida KSuzuki KMatsuzawa T(2023)Distributed Representation for Assembly CodeComputers10.3390/computers1211022212:11(222)Online publication date: 1-Nov-2023
https://doi.org/10.3390/computers12110222
Yustanti WIriawan NIrhamah Nuryana IIndriyanti A(2023)A Cross-Sampling Method for Hidden Structure Extraction to Improve Imbalanced Multiclass Classification Accuracy2023 Sixth International Conference on Vocational Education and Electrical Engineering (ICVEE)10.1109/ICVEE59738.2023.10348228(353-358)Online publication date: 14-Oct-2023
https://doi.org/10.1109/ICVEE59738.2023.10348228
Du MWu F(2022)Grid-Based Clustering Using Boundary DetectionEntropy10.3390/e2411160624:11(1606)Online publication date: 4-Nov-2022
https://doi.org/10.3390/e24111606
Jakatdar ALiu BWarnow TChacko G(2022)AOC: Assembling overlapping communitiesQuantitative Science Studies10.1162/qss_a_002273:4(1079-1096)Online publication date: 20-Dec-2022
https://doi.org/10.1162/qss_a_00227
Li CLi GVarshney P(2022)Federated Learning With Soft ClusteringIEEE Internet of Things Journal10.1109/JIOT.2021.31139279:10(7773-7782)Online publication date: 15-May-2022
https://doi.org/10.1109/JIOT.2021.3113927
Fior JFavale TCagliero LGiordano DMellia MBaralis ERonchiadin SBaracco PMoncalvo D(2022)Legal Entity Disambiguation for Financial Crime Detection2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020700(6639-6641)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020700
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten