skip to main content
10.1145/2542050.2542079acmotherconferencesArticle/Chapter ViewAbstractPublication PagessoictConference Proceedingsconference-collections
research-article

Document clustering using dirichlet process mixture model of von Mises-Fisher distributions

Published:05 December 2013Publication History

ABSTRACT

Document clustering has become an increasingly important technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. This paper proposes a Dirichlet process mixture (DPM) model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. We have developed a mean-field variational inference algorithm for the DPM model of vMFs that is applied to clustering text documents. Using this model, the number of clusters is determined automatically after the clustering process rather than pre-estimated. We conducted extensive experiments to evaluate the proposed approach on a large number of high dimensional text datasets. Empirical experimental results over NMI (Normalized Mutual Information) and Purity evaluation measures demonstrate that our approach outperforms the four state-of-the-art clustering algorithms.

References

  1. Charles E. Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. Annals of Statistics, 2(6), November 1974.Google ScholarGoogle ScholarCross RefCross Ref
  2. Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphere using von mises-fisher distributions. J. Mach. Learn. Res., 6: 1345--1382, December 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. David M. Blei and Michael I. Jordan. Variational inference for dirichlet process mixtures. Bayesian Analysis, 1(1): 121--144, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  4. Charles Elkan. Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In Proceedings of the 23rd international conference on Machine learning, ICML '06, pages 289--296, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Thomas S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2): 209--230, 1973.Google ScholarGoogle ScholarCross RefCross Ref
  6. Chris Fraley and Adrian E. Raftery. How many clusters? which clustering method? answers via model-based cluster analysis. Comput. J., 41(8): 578--588, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  7. Hemant Ishwaran and Lancelot F. James. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453): 161--173, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  8. George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20(1): 359--392, December 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Sinae Kim, Mahlet G. Tadesse, and Marina Vannucci. Variable selection in clustering via dirichlet process mixture models. Biometrika, 93(4): 877--893, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  10. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. Google ScholarGoogle ScholarCross RefCross Ref
  11. Kanti V. Mardia and El S. A. M. Atoum. Bayesian inference for the von Mises-Fisher distribution. Biometrika, 63(1): 203--206, January 1976.Google ScholarGoogle ScholarCross RefCross Ref
  12. Radford M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2): 249--265, 2000.Google ScholarGoogle Scholar
  13. Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849--856. MIT Press, 2001.Google ScholarGoogle Scholar
  14. Jayaram Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4: 639--650, 1994.Google ScholarGoogle Scholar
  15. Alexander Strehl. Relationship-based clustering and cluster ensembles for high-dimensional data mining. PhD thesis, 2002. AAI3088578. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Andrew T. A. Wood. Simulation of the von Mises Fisher distribution. Communications in Statistics-Simulation and Computation, 23(1): 157--164, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  17. Guan Yu, Ruizhang Huang, and Zhaojun Wang. Document clustering via dirichlet process mixture model with feature selection. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '10, pages 763--772, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Shi Zhong and Joydeep Ghosh. Generative model-based document clustering: a comparative study. Knowl. Inf. Syst., 8(3): 374--384, September 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Document clustering using dirichlet process mixture model of von Mises-Fisher distributions

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      SoICT '13: Proceedings of the 4th Symposium on Information and Communication Technology
      December 2013
      345 pages
      ISBN:9781450324540
      DOI:10.1145/2542050

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 December 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SoICT '13 Paper Acceptance Rate40of80submissions,50%Overall Acceptance Rate147of318submissions,46%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader