ABSTRACT
Document clustering has become an increasingly important technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. This paper proposes a Dirichlet process mixture (DPM) model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. We have developed a mean-field variational inference algorithm for the DPM model of vMFs that is applied to clustering text documents. Using this model, the number of clusters is determined automatically after the clustering process rather than pre-estimated. We conducted extensive experiments to evaluate the proposed approach on a large number of high dimensional text datasets. Empirical experimental results over NMI (Normalized Mutual Information) and Purity evaluation measures demonstrate that our approach outperforms the four state-of-the-art clustering algorithms.
- Charles E. Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. Annals of Statistics, 2(6), November 1974.Google ScholarCross Ref
- Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphere using von mises-fisher distributions. J. Mach. Learn. Res., 6: 1345--1382, December 2005. Google ScholarDigital Library
- David M. Blei and Michael I. Jordan. Variational inference for dirichlet process mixtures. Bayesian Analysis, 1(1): 121--144, 2006.Google ScholarCross Ref
- Charles Elkan. Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In Proceedings of the 23rd international conference on Machine learning, ICML '06, pages 289--296, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- Thomas S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2): 209--230, 1973.Google ScholarCross Ref
- Chris Fraley and Adrian E. Raftery. How many clusters? which clustering method? answers via model-based cluster analysis. Comput. J., 41(8): 578--588, 1998.Google ScholarCross Ref
- Hemant Ishwaran and Lancelot F. James. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453): 161--173, 2001.Google ScholarCross Ref
- George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20(1): 359--392, December 1998. Google ScholarDigital Library
- Sinae Kim, Mahlet G. Tadesse, and Marina Vannucci. Variable selection in clustering via dirichlet process mixture models. Biometrika, 93(4): 877--893, 2006.Google ScholarCross Ref
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. Google ScholarCross Ref
- Kanti V. Mardia and El S. A. M. Atoum. Bayesian inference for the von Mises-Fisher distribution. Biometrika, 63(1): 203--206, January 1976.Google ScholarCross Ref
- Radford M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2): 249--265, 2000.Google Scholar
- Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849--856. MIT Press, 2001.Google Scholar
- Jayaram Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4: 639--650, 1994.Google Scholar
- Alexander Strehl. Relationship-based clustering and cluster ensembles for high-dimensional data mining. PhD thesis, 2002. AAI3088578. Google ScholarDigital Library
- Andrew T. A. Wood. Simulation of the von Mises Fisher distribution. Communications in Statistics-Simulation and Computation, 23(1): 157--164, 1994.Google ScholarCross Ref
- Guan Yu, Ruizhang Huang, and Zhaojun Wang. Document clustering via dirichlet process mixture model with feature selection. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '10, pages 763--772, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- Shi Zhong and Joydeep Ghosh. Generative model-based document clustering: a comparative study. Knowl. Inf. Syst., 8(3): 374--384, September 2005. Google ScholarDigital Library
Index Terms
- Document clustering using dirichlet process mixture model of von Mises-Fisher distributions
Recommendations
Document classification using semi-supervived mixture model of von Mises-Fisher distributions on document manifold
SoICT '13: Proceedings of the 4th Symposium on Information and Communication TechnologyDocument classifications is essential to information retrieval and text mining. In real life, unlabeled data is readily available whereas labeled ones are often laborious, expensive and slow to obtain. This paper proposes a novel Document Classification ...
Nonparametric localized feature selection via a dirichlet process mixture of generalized dirichlet distributions
ICONIP'12: Proceedings of the 19th international conference on Neural Information Processing - Volume Part IIIIn this paper, we propose a novel Bayesian nonparametric statistical approach of simultaneous clustering and localized feature selection for unsupervised learning. The proposed model is based on a mixture of Dirichlet processes with generalized ...
Nonparametric Clustering with Dirichlet Process Mixture Model
ICECC '12: Proceedings of the 2012 International Conference on Electronics, Communications and ControlClustering is one of the most useful techniques in machine learning and data mining. In cluster analysis, model selection which is how to determine the number of clusters is an important issue. Unlike in supervised learning, there are no class labels ...
Comments