skip to main content
10.1145/1143844.1143967acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
Article

Topic modeling: beyond bag-of-words

Published:25 June 2006Publication History

ABSTRACT

Some models of textual corpora employ text generation methods involving n-gram statistics, while others use latent topic variables inferred using the "bag-of-words" assumption, in which word order is ignored. Previously, these methods have not been combined. In this work, I explore a hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model. The model hyperparameters are inferred using a Gibbs EM algorithm. On two data sets, each of 150 documents, the new model exhibits better predictive accuracy than either a hierarchical Dirichlet bigram language model or a unigram topic model. Additionally, the inferred topics are less dominated by function words than are topics discovered using unigram statistics, potentially making them more meaningful.

References

  1. Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine Learning, 50, 5--43.Google ScholarGoogle ScholarCross RefCross Ref
  2. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228--5235.Google ScholarGoogle ScholarCross RefCross Ref
  4. Griffiths, T. L., & Steyvers, M. (2005). Topic modeling toolbox. http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm.Google ScholarGoogle Scholar
  5. Griffiths, T. L., Steyvers, M., Blei, D. M., & Tenenbaum, J. B. (2004). Integrating topics and syntax. Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  6. Jelinek, F., & Mercer, R. (1980). Interpolated estimation of Markov source parameters from sparse data. In E. Gelsema and L. Kanal (Eds.), Pattern recognition in practice, 381--402. North-Holland publishing company.Google ScholarGoogle Scholar
  7. Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773--795.Google ScholarGoogle ScholarCross RefCross Ref
  8. MacKay, D. J. C., & Peto, L. C. B. (1995). A hierarchical Dirichlet language model. Natural Language Engineering, 1, 289--307.Google ScholarGoogle ScholarCross RefCross Ref
  9. Minka, T. P. (2003). Estimating a Dirichlet distribution. http://research.microsoft.com/~minka/papers/dirichlet/.Google ScholarGoogle Scholar
  10. Rennie, J. (2005). 20 newsgroups data set. http://people.csail.mit.edu/jrennie/20Newsgroups/.Google ScholarGoogle Scholar

Index Terms

  1. Topic modeling: beyond bag-of-words

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        ICML '06: Proceedings of the 23rd international conference on Machine learning
        June 2006
        1154 pages
        ISBN:1595933832
        DOI:10.1145/1143844

        Copyright © 2006 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 June 2006

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        ICML '06 Paper Acceptance Rate140of548submissions,26%Overall Acceptance Rate140of548submissions,26%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader