research-article

Document clustering using dirichlet process mixture model of von Mises-Fisher distributions

Authors:
Nguyen Kim Anh

Hanoi University of Science and Technology

Hanoi University of Science and Technology
View Profile

,
Nguyen The Tam

Hanoi University of Science and Technology

Hanoi University of Science and Technology
View Profile

,
Ngo Van Linh

Hanoi University of Science and Technology

Hanoi University of Science and Technology
View Profile

SoICT '13: Proceedings of the 4th Symposium on Information and Communication TechnologyDecember 2013Pages 131–138https://doi.org/10.1145/2542050.2542079

Published:05 December 2013Publication History

SoICT '13: Proceedings of the 4th Symposium on Information and Communication Technology

Pages 131–138

ABSTRACT

Document clustering has become an increasingly important technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. This paper proposes a Dirichlet process mixture (DPM) model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. We have developed a mean-field variational inference algorithm for the DPM model of vMFs that is applied to clustering text documents. Using this model, the number of clusters is determined automatically after the clustering process rather than pre-estimated. We conducted extensive experiments to evaluate the proposed approach on a large number of high dimensional text datasets. Empirical experimental results over NMI (Normalized Mutual Information) and Purity evaluation measures demonstrate that our approach outperforms the four state-of-the-art clustering algorithms.

References

Charles E. Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. Annals of Statistics, 2(6), November 1974.Google ScholarCross Ref
Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphere using von mises-fisher distributions. J. Mach. Learn. Res., 6: 1345--1382, December 2005. Google ScholarDigital Library
David M. Blei and Michael I. Jordan. Variational inference for dirichlet process mixtures. Bayesian Analysis, 1(1): 121--144, 2006.Google ScholarCross Ref
Charles Elkan. Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In Proceedings of the 23rd international conference on Machine learning, ICML '06, pages 289--296, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
Thomas S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2): 209--230, 1973.Google ScholarCross Ref
Chris Fraley and Adrian E. Raftery. How many clusters? which clustering method? answers via model-based cluster analysis. Comput. J., 41(8): 578--588, 1998.Google ScholarCross Ref
Hemant Ishwaran and Lancelot F. James. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453): 161--173, 2001.Google ScholarCross Ref
George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20(1): 359--392, December 1998. Google ScholarDigital Library
Sinae Kim, Mahlet G. Tadesse, and Marina Vannucci. Variable selection in clustering via dirichlet process mixture models. Biometrika, 93(4): 877--893, 2006.Google ScholarCross Ref
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. Google ScholarCross Ref
Kanti V. Mardia and El S. A. M. Atoum. Bayesian inference for the von Mises-Fisher distribution. Biometrika, 63(1): 203--206, January 1976.Google ScholarCross Ref
Radford M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2): 249--265, 2000.Google Scholar
Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849--856. MIT Press, 2001.Google Scholar
Jayaram Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4: 639--650, 1994.Google Scholar
Alexander Strehl. Relationship-based clustering and cluster ensembles for high-dimensional data mining. PhD thesis, 2002. AAI3088578. Google ScholarDigital Library
Andrew T. A. Wood. Simulation of the von Mises Fisher distribution. Communications in Statistics-Simulation and Computation, 23(1): 157--164, 1994.Google ScholarCross Ref
Guan Yu, Ruizhang Huang, and Zhaojun Wang. Document clustering via dirichlet process mixture model with feature selection. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '10, pages 763--772, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
Shi Zhong and Joydeep Ghosh. Generative model-based document clustering: a comparative study. Knowl. Inf. Syst., 8(3): 374--384, September 2005. Google ScholarDigital Library

Index Terms

Document clustering using dirichlet process mixture model of von Mises-Fisher distributions
1. Information systems
  1. Information retrieval

Recommendations

Document classification using semi-supervived mixture model of von Mises-Fisher distributions on document manifold
SoICT '13: Proceedings of the 4th Symposium on Information and Communication Technology

Document classifications is essential to information retrieval and text mining. In real life, unlabeled data is readily available whereas labeled ones are often laborious, expensive and slow to obtain. This paper proposes a novel Document Classification ...
Read More
Nonparametric localized feature selection via a dirichlet process mixture of generalized dirichlet distributions
ICONIP'12: Proceedings of the 19th international conference on Neural Information Processing - Volume Part III

In this paper, we propose a novel Bayesian nonparametric statistical approach of simultaneous clustering and localized feature selection for unsupervised learning. The proposed model is based on a mixture of Dirichlet processes with generalized ...
Read More
Nonparametric Clustering with Dirichlet Process Mixture Model
ICECC '12: Proceedings of the 2012 International Conference on Electronics, Communications and Control

Clustering is one of the most useful techniques in machine learning and data mining. In cluster analysis, model selection which is how to determine the number of clusters is an important issue. Unlike in supervised learning, there are no class labels ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SoICT '13: Proceedings of the 4th Symposium on Information and Communication Technology
December 2013
345 pages
ISBN:9781450324540
DOI:10.1145/2542050
General Chairs:
Thang Huynh Quyet
HUST, Vietnam
,
Binh Nguyen Thanh
DUT, Vietnam
,
Program Chairs:
Tien Do Van
BME, Hungary
,
Marc Bui
EPHE, France
,
Son Ngo Hong
HUST, Vietnam
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 December 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Bayesian nonparametrics
clustering
probabilistic graphical models
variational inference
Qualifiers
- research-article
Conference

Acceptance Rates
SoICT '13 Paper Acceptance Rate40of80submissions,50%Overall Acceptance Rate147of318submissions,46%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 181
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Document clustering using dirichlet process mixture model of von Mises-Fisher distributions

SoICT '13: Proceedings of the 4th Symposium on Information and Communication Technology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Document classification using semi-supervived mixture model of von Mises-Fisher distributions on document manifold

Nonparametric localized feature selection via a dirichlet process mixture of generalized dirichlet distributions

Nonparametric Clustering with Dirichlet Process Mixture Model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Document clustering using dirichlet process mixture model of von Mises-Fisher distributions

SoICT '13: Proceedings of the 4th Symposium on Information and Communication Technology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Document classification using semi-supervived mixture model of von Mises-Fisher distributions on document manifold

Nonparametric localized feature selection via a dirichlet process mixture of generalized dirichlet distributions

Nonparametric Clustering with Dirichlet Process Mixture Model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media