research-article

Model-based document clustering with a collapsed gibbs sampler

Authors:
Daniel David Walker

Brigham Young University, Provo, UT, USA

Brigham Young University, Provo, UT, USA
View Profile

,
Eric K. Ringger

Brigham Young University, Probo, UT, USA

Brigham Young University, Probo, UT, USA
View Profile

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2008Pages 704–712https://doi.org/10.1145/1401890.1401975

Published:24 August 2008Publication History

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 704–712

ABSTRACT

Model-based algorithms are emerging as a preferred method for document clustering. As computing resources improve, methods such as Gibbs sampling have become more common for parameter estimation in these models. Gibbs sampling is well understood for many applications, but has not been extensively studied for use in document clustering. We explore the convergence rate, the possibility of label switching, and chain summarization methodologies for document clustering on a particular model, namely a mixture of multinomials model, and show that fairly simple methods can be employed, while still producing clusterings of superior quality compared to those produced with the EM algorithm.

References

A. Banerjee and S. Basu. Topic models over text streams: A study of batch and online unsupervised learning. In Procedings of the SIAM International Conference on Data Mining, Minneapolis, Minnesota, April 2007.Google ScholarCross Ref
M. W. Berry, M. Brown, and B. Signer. 2001 topic annotated Enron email data set, 2007.Google Scholar
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
M.-H. Chen, Q.-M. Shao, and J. G. Ibrahim. Monte Carlo Methods in Bayesian Computation. Springer, 2000.Google ScholarCross Ref
I. S. Dhillon, Y. Guan, and J. Kogan. Iterative clustering of high dimensional text data augmented by local search. icdm, 00:131, 2002. Google ScholarDigital Library
B. Dom. An information-theoretic external cluster-validity measure. Technical Report RJ10219, IBM, Oct. 2001.Google Scholar
A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC, second edition, 2004.Google Scholar
S. Goldwater and T. L. Griffiths. A fully bayesian aproach to unsupervised part-of-speech tagging. In The 45th Annual Meeting of the Associaiton for Computational Linguistics (ACL'07), Prague, 2007.Google Scholar
T. L. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, volume 101 (suppl. 1), pages 5228--5235, 2004.Google ScholarCross Ref
A. Haghighi and D. Klein. Unsupervised coreference resolution in a nonparametric bayesian model. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 848--855, Prague, Czech Republic, June 2007. Association for Computational Linguistics.Google Scholar
L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193--218, Dec. 1985.Google ScholarCross Ref
T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Machine Learning: Proceedings of the Fourteenth International Conference (ICML'97), pages 143--151, 1997. Google ScholarDigital Library
M. Meila. Comparing clusterings--an information based distance. Journal of Multivariate Analysis, 98(5):873--895, 2007. Google ScholarDigital Library
M. Meila and D. Heckerman. An experimental comparison of model-based clustering methods. Machine Learning, 42(1-2):9--29, Jan. 2001. Google ScholarDigital Library
R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249--265, June 2000.Google Scholar
A. E. Raftery and S. M. Lewis. Implementing MCMC. Markov Chain Monte Carlo in Practice, pages 115--130, 1996.Google Scholar
S. Richardson and P. J. Green. On bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society. Series B, 59(4):731--792, 1997.Google ScholarCross Ref
A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster evaluation measure. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, Prague, June 2007.Google Scholar
M. M. Shafiei and E. E. Milios. Latent Dirichlet co-clustering. In ICDM '06: Proceedings of the Sixth International Conference on Data Mining, pages 542--551, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
M. Steinbach, G. Karypis, and B. Kumar. A comparison of document clustering techniques. Technical report, University of Minnesota, May 2000.Google Scholar
M. Stephens. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B, 64(4):795--809, 2000.Google ScholarCross Ref
Y. W. Teh, D. Newman, and M. Welling. A collapsed variational bayesian inference algorithm for latent Dirichlet allocation. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 1353--1360, Cambridge, MA, 2007. MIT Press.Google Scholar
D. Walker and E. Ringger. New social bookmarking data set. http://nlp.cs.byu.edu/mediawiki/index.php/Data#New_Social_Bookmarking, Oct. 2007.Google Scholar
S. Yu. Advanced Probabilistic Models for Clustering and Projection. PhD thesis, Fakultät für Mathematik, Informatik und Statistik der Ludwig-Maximilians-Universität München, 2006.Google Scholar
J. Zhang, Z. Ghahramani, and Y. Yang. A probabilistic model for online document clustering with application to novelty detection. In Advances in Neural Information Processing Systems, pages 1617--1624. MIT Press, 2005.Google ScholarDigital Library

Index Terms

Model-based document clustering with a collapsed gibbs sampler
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Mathematics of computing
  1. Probability and statistics
    1. Probabilistic algorithms
    2. Probabilistic reasoning algorithms
      1. Markov-chain Monte Carlo methods
      2. Sequential Monte Carlo methods

Recommendations

A fast universal self-tuned sampler within Gibbs sampling

Bayesian inference often requires efficient numerical approximation algorithms, such as sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC) methods. The Gibbs sampler is a well-known MCMC technique, widely applied in many signal processing ...
Read More
Text document clustering based on neighbors

Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as ...
Read More
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
collapsed samplers
document clustering
em
gibbs sampling
mcmc
practical guidelines
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 866
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Model-based document clustering with a collapsed gibbs sampler

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

A fast universal self-tuned sampler within Gibbs sampling

Text document clustering based on neighbors

Hybrid Bisect K-Means Clustering Algorithm

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Model-based document clustering with a collapsed gibbs sampler

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

A fast universal self-tuned sampler within Gibbs sampling

Text document clustering based on neighbors

Hybrid Bisect K-Means Clustering Algorithm

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media