ABSTRACT
Model-based algorithms are emerging as a preferred method for document clustering. As computing resources improve, methods such as Gibbs sampling have become more common for parameter estimation in these models. Gibbs sampling is well understood for many applications, but has not been extensively studied for use in document clustering. We explore the convergence rate, the possibility of label switching, and chain summarization methodologies for document clustering on a particular model, namely a mixture of multinomials model, and show that fairly simple methods can be employed, while still producing clusterings of superior quality compared to those produced with the EM algorithm.
- A. Banerjee and S. Basu. Topic models over text streams: A study of batch and online unsupervised learning. In Procedings of the SIAM International Conference on Data Mining, Minneapolis, Minnesota, April 2007.Google ScholarCross Ref
- M. W. Berry, M. Brown, and B. Signer. 2001 topic annotated Enron email data set, 2007.Google Scholar
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- M.-H. Chen, Q.-M. Shao, and J. G. Ibrahim. Monte Carlo Methods in Bayesian Computation. Springer, 2000.Google ScholarCross Ref
- I. S. Dhillon, Y. Guan, and J. Kogan. Iterative clustering of high dimensional text data augmented by local search. icdm, 00:131, 2002. Google ScholarDigital Library
- B. Dom. An information-theoretic external cluster-validity measure. Technical Report RJ10219, IBM, Oct. 2001.Google Scholar
- A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC, second edition, 2004.Google Scholar
- S. Goldwater and T. L. Griffiths. A fully bayesian aproach to unsupervised part-of-speech tagging. In The 45th Annual Meeting of the Associaiton for Computational Linguistics (ACL'07), Prague, 2007.Google Scholar
- T. L. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, volume 101 (suppl. 1), pages 5228--5235, 2004.Google ScholarCross Ref
- A. Haghighi and D. Klein. Unsupervised coreference resolution in a nonparametric bayesian model. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 848--855, Prague, Czech Republic, June 2007. Association for Computational Linguistics.Google Scholar
- L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193--218, Dec. 1985.Google ScholarCross Ref
- T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Machine Learning: Proceedings of the Fourteenth International Conference (ICML'97), pages 143--151, 1997. Google ScholarDigital Library
- M. Meila. Comparing clusterings--an information based distance. Journal of Multivariate Analysis, 98(5):873--895, 2007. Google ScholarDigital Library
- M. Meila and D. Heckerman. An experimental comparison of model-based clustering methods. Machine Learning, 42(1-2):9--29, Jan. 2001. Google ScholarDigital Library
- R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249--265, June 2000.Google Scholar
- A. E. Raftery and S. M. Lewis. Implementing MCMC. Markov Chain Monte Carlo in Practice, pages 115--130, 1996.Google Scholar
- S. Richardson and P. J. Green. On bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society. Series B, 59(4):731--792, 1997.Google ScholarCross Ref
- A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster evaluation measure. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, Prague, June 2007.Google Scholar
- M. M. Shafiei and E. E. Milios. Latent Dirichlet co-clustering. In ICDM '06: Proceedings of the Sixth International Conference on Data Mining, pages 542--551, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- M. Steinbach, G. Karypis, and B. Kumar. A comparison of document clustering techniques. Technical report, University of Minnesota, May 2000.Google Scholar
- M. Stephens. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B, 64(4):795--809, 2000.Google ScholarCross Ref
- Y. W. Teh, D. Newman, and M. Welling. A collapsed variational bayesian inference algorithm for latent Dirichlet allocation. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 1353--1360, Cambridge, MA, 2007. MIT Press.Google Scholar
- D. Walker and E. Ringger. New social bookmarking data set. http://nlp.cs.byu.edu/mediawiki/index.php/Data#New_Social_Bookmarking, Oct. 2007.Google Scholar
- S. Yu. Advanced Probabilistic Models for Clustering and Projection. PhD thesis, Fakultät für Mathematik, Informatik und Statistik der Ludwig-Maximilians-Universität München, 2006.Google Scholar
- J. Zhang, Z. Ghahramani, and Y. Yang. A probabilistic model for online document clustering with application to novelty detection. In Advances in Neural Information Processing Systems, pages 1617--1624. MIT Press, 2005.Google ScholarDigital Library
Index Terms
- Model-based document clustering with a collapsed gibbs sampler
Recommendations
A fast universal self-tuned sampler within Gibbs sampling
Bayesian inference often requires efficient numerical approximation algorithms, such as sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC) methods. The Gibbs sampler is a well-known MCMC technique, widely applied in many signal processing ...
Text document clustering based on neighbors
Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as ...
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global InformatizationIn this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Comments