ABSTRACT
Some models of textual corpora employ text generation methods involving n-gram statistics, while others use latent topic variables inferred using the "bag-of-words" assumption, in which word order is ignored. Previously, these methods have not been combined. In this work, I explore a hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model. The model hyperparameters are inferred using a Gibbs EM algorithm. On two data sets, each of 150 documents, the new model exhibits better predictive accuracy than either a hierarchical Dirichlet bigram language model or a unigram topic model. Additionally, the inferred topics are less dominated by function words than are topics discovered using unigram statistics, potentially making them more meaningful.
- Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine Learning, 50, 5--43.Google ScholarCross Ref
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993--1022. Google ScholarDigital Library
- Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228--5235.Google ScholarCross Ref
- Griffiths, T. L., & Steyvers, M. (2005). Topic modeling toolbox. http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm.Google Scholar
- Griffiths, T. L., Steyvers, M., Blei, D. M., & Tenenbaum, J. B. (2004). Integrating topics and syntax. Advances in Neural Information Processing Systems.Google Scholar
- Jelinek, F., & Mercer, R. (1980). Interpolated estimation of Markov source parameters from sparse data. In E. Gelsema and L. Kanal (Eds.), Pattern recognition in practice, 381--402. North-Holland publishing company.Google Scholar
- Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773--795.Google ScholarCross Ref
- MacKay, D. J. C., & Peto, L. C. B. (1995). A hierarchical Dirichlet language model. Natural Language Engineering, 1, 289--307.Google ScholarCross Ref
- Minka, T. P. (2003). Estimating a Dirichlet distribution. http://research.microsoft.com/~minka/papers/dirichlet/.Google Scholar
- Rennie, J. (2005). 20 newsgroups data set. http://people.csail.mit.edu/jrennie/20Newsgroups/.Google Scholar
Index Terms
- Topic modeling: beyond bag-of-words
Recommendations
Topic modelling for qualitative studies
Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation LDA. However, examples of qualitative studies that ...
Modeling topic hierarchies with the recursive chinese restaurant process
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementTopic models such as latent Dirichlet allocation (LDA) and hierarchical Dirichlet processes (HDP) are simple solutions to discover topics from a set of unannotated documents. While they are simple and popular, a major shortcoming of LDA and HDP is that ...
Comments