Learning author-topic models from text corpora

Published: 29 January 2010 Publication History


We propose an unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multi-author paper is a mixture of the distributions associated with the authors. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to three large text corpora: 150,000 abstracts from the CiteSeer digital library, 1740 papers from the Neural Information Processing Systems (NIPS) Conferences, and 121,000 emails from the Enron corporation. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. Experiments based on perplexity scores for test documents and precision-recall for document retrieval are used to illustrate systematic differences between the proposed author-topic model and a number of alternatives. Extensions to the model, allowing for example, generalizations of the notion of an author, are also briefly discussed.


  Exploratory image data analysis for quality improvement hypothesis generationQuality Engineering10.1080/08982112.2023.228530536:4(693-712)Online publication date: 22-Jan-2024
  Collaboration of issuing agencies and topic evolution of health informatisation policies in ChinaJournal of Information Science10.1177/0165551522107432349:6(1692-1710)Online publication date: 1-Dec-2023
  Textual Analytics on 'Azadi Ka Amrit Mahotsav': Exploring Indian citizens' ideas for achieving Aatmanirbhar Bharat2023 Third International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT)10.1109/ICAECT57570.2023.10118308(1-8)Online publication date: 5-Jan-2023
  • Show More Cited By



Julien Velcin

For large text corpora, the task of extracting and following information about topics, authors, and opinions is very challenging. Applications are numerous and relate to various domains, including social networks. The authors' proposed model is a novel contribution to this research area. It is highly related to other probabilistic models, such as latent Dirichlet allocation (LDA) [1] and McCallum's model [2]. In this paper, Rosen-Zvi et al. propose a new generative model for document collection. Their author-topic (AT) model differs from McCallum's in the way that each author is associated with a distribution over topics. This approach leads to numerous applications such as word sense disambiguation and information retrieval (IR), which are described in detail. Although they present a well-grounded, detailed theoretical basis, the choice of fixing hyperparameters ? and ? could have been discussed in more depth. The paper lacks a formal and experimental comparison with a different type of approach, such as a graph-based one [3]. Also, the authors compare their approach with term frequency-inverse document frequency (tf-idf) as if it were an algorithm. In fact, tf-idf is a formula that (sometimes) gives a better representation of textual data, typically in an IR task. Hence, the comparison between AT models and tf-idf needs more in-depth investigation. In summary, the authors present an interesting and well-grounded model. That being said, potential readers should be fairly familiar with Bayesian statistics. Online Computing Reviews Service

Information & Contributors


Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 28, Issue 1
January 2010
157 pages
Issue’s Table of Contents
Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 January 2010
Accepted: 01 March 2009
Revised: 01 October 2008
Received: 01 September 2007
Published in TOIS Volume 28, Issue 1


Author Tags

  1. Gibbs sampling
  2. Topic models
  3. author models
  4. perplexity
  5. unsupervised learning


