skip to main content
10.1145/1401890.1401975acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Model-based document clustering with a collapsed gibbs sampler

Published:24 August 2008Publication History

ABSTRACT

Model-based algorithms are emerging as a preferred method for document clustering. As computing resources improve, methods such as Gibbs sampling have become more common for parameter estimation in these models. Gibbs sampling is well understood for many applications, but has not been extensively studied for use in document clustering. We explore the convergence rate, the possibility of label switching, and chain summarization methodologies for document clustering on a particular model, namely a mixture of multinomials model, and show that fairly simple methods can be employed, while still producing clusterings of superior quality compared to those produced with the EM algorithm.

References

  1. A. Banerjee and S. Basu. Topic models over text streams: A study of batch and online unsupervised learning. In Procedings of the SIAM International Conference on Data Mining, Minneapolis, Minnesota, April 2007.Google ScholarGoogle ScholarCross RefCross Ref
  2. M. W. Berry, M. Brown, and B. Signer. 2001 topic annotated Enron email data set, 2007.Google ScholarGoogle Scholar
  3. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M.-H. Chen, Q.-M. Shao, and J. G. Ibrahim. Monte Carlo Methods in Bayesian Computation. Springer, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  5. I. S. Dhillon, Y. Guan, and J. Kogan. Iterative clustering of high dimensional text data augmented by local search. icdm, 00:131, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Dom. An information-theoretic external cluster-validity measure. Technical Report RJ10219, IBM, Oct. 2001.Google ScholarGoogle Scholar
  7. A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC, second edition, 2004.Google ScholarGoogle Scholar
  8. S. Goldwater and T. L. Griffiths. A fully bayesian aproach to unsupervised part-of-speech tagging. In The 45th Annual Meeting of the Associaiton for Computational Linguistics (ACL'07), Prague, 2007.Google ScholarGoogle Scholar
  9. T. L. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, volume 101 (suppl. 1), pages 5228--5235, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  10. A. Haghighi and D. Klein. Unsupervised coreference resolution in a nonparametric bayesian model. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 848--855, Prague, Czech Republic, June 2007. Association for Computational Linguistics.Google ScholarGoogle Scholar
  11. L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193--218, Dec. 1985.Google ScholarGoogle ScholarCross RefCross Ref
  12. T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Machine Learning: Proceedings of the Fourteenth International Conference (ICML'97), pages 143--151, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Meila. Comparing clusterings--an information based distance. Journal of Multivariate Analysis, 98(5):873--895, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Meila and D. Heckerman. An experimental comparison of model-based clustering methods. Machine Learning, 42(1-2):9--29, Jan. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249--265, June 2000.Google ScholarGoogle Scholar
  16. A. E. Raftery and S. M. Lewis. Implementing MCMC. Markov Chain Monte Carlo in Practice, pages 115--130, 1996.Google ScholarGoogle Scholar
  17. S. Richardson and P. J. Green. On bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society. Series B, 59(4):731--792, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  18. A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster evaluation measure. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, Prague, June 2007.Google ScholarGoogle Scholar
  19. M. M. Shafiei and E. E. Milios. Latent Dirichlet co-clustering. In ICDM '06: Proceedings of the Sixth International Conference on Data Mining, pages 542--551, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Steinbach, G. Karypis, and B. Kumar. A comparison of document clustering techniques. Technical report, University of Minnesota, May 2000.Google ScholarGoogle Scholar
  21. M. Stephens. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B, 64(4):795--809, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  22. Y. W. Teh, D. Newman, and M. Welling. A collapsed variational bayesian inference algorithm for latent Dirichlet allocation. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 1353--1360, Cambridge, MA, 2007. MIT Press.Google ScholarGoogle Scholar
  23. D. Walker and E. Ringger. New social bookmarking data set. http://nlp.cs.byu.edu/mediawiki/index.php/Data#New_Social_Bookmarking, Oct. 2007.Google ScholarGoogle Scholar
  24. S. Yu. Advanced Probabilistic Models for Clustering and Projection. PhD thesis, Fakultät für Mathematik, Informatik und Statistik der Ludwig-Maximilians-Universität München, 2006.Google ScholarGoogle Scholar
  25. J. Zhang, Z. Ghahramani, and Y. Yang. A probabilistic model for online document clustering with application to novelty detection. In Advances in Neural Information Processing Systems, pages 1617--1624. MIT Press, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Model-based document clustering with a collapsed gibbs sampler

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
            August 2008
            1116 pages
            ISBN:9781605581934
            DOI:10.1145/1401890
            • General Chair:
            • Ying Li,
            • Program Chairs:
            • Bing Liu,
            • Sunita Sarawagi

            Copyright © 2008 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 24 August 2008

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

            Upcoming Conference

            KDD '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader