skip to main content
research-article

Learning author-topic models from text corpora

Published: 29 January 2010 Publication History

Abstract

We propose an unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multi-author paper is a mixture of the distributions associated with the authors. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to three large text corpora: 150,000 abstracts from the CiteSeer digital library, 1740 papers from the Neural Information Processing Systems (NIPS) Conferences, and 121,000 emails from the Enron corporation. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. Experiments based on perplexity scores for test documents and precision-recall for document retrieval are used to illustrate systematic differences between the proposed author-topic model and a number of alternatives. Extensions to the model, allowing for example, generalizations of the notion of an author, are also briefly discussed.

References

[1]
Berry, M. W., Dumais, S. T., and O'Brien, G. W. 1994. Using linear algebra for intelligent information retrieval. SIAM Rev. 573--595.
[2]
Blei, D. and Lafferty, J. 2006a. Correlated topic models. In Advances in Neural Information Processing Systems 18, Y. Weiss, B. Schölkopf, and J. Platt Eds., MIT Press, Cambridge, MA, 147--154.
[3]
Blei, D. and Lafferty, J. 2006b. Correlated topic models. In Proceedings of the 23rd International Conference on Machine Learning. ACM Press, New York, NY, 113--120.
[4]
Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022.
[5]
Box, G. E. P. and Tiao, G. C. 1973. Bayesian Inference in Statistical Analysis. Addison-Wesley, Reading, MA.
[6]
Brooks, S. 1998. Markov chain Monte Carlo method and its application. Statistician 47, 69--100.
[7]
Buntine, W. L. and Jakulin, A. 2004. Applying discrete PCA in data analysis. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. M. Chickering and J. Halpern Eds. Morgan Kaufmann Publishers, San Francisco, CA, 59--66.
[8]
Canny, J. 2004. GaP: a factor model for discrete data. In Proceedings of the 27th Annual International Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, New York, NY, 122--129.
[9]
Chemudugunta, C., Smyth, P., and Steyvers, M. 2007. Modeling general and specific aspects of documents with a probabilistic topic model. In Advances in Neural Information Processing Systems 19, B. Schölkopf, J. Platt, and T. Hoffman Eds., MIT Press, Cambridge, MA, 241--248.
[10]
Cohn, D. and Hofmann, T. 2001. The missing link—a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems 13, T. K. Leen, T. G. Dietterich, and V. Tresp Eds., MIT Press, Cambridge, MA, 430--436.
[11]
Cutting, D. R., Karger, D., Pedersen, J. O., and Tukey, J. W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 318--329.
[12]
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 6, 391--407.
[13]
Dhillon, I. S. and Modha, D. S. 2001. Concept decompositions for large sparse text data using clustering. Mach. Learn. 42, 1/2, 143--175.
[14]
Diederich, J., Kindermann, J., Leopold, E., and Paass, G. 2003. Authorship attribution with support vector machines. Appl. Intell. 19, 1, 109--123.
[15]
Erosheva, E., Fienberg, S., and Lafferty, J. 2004. Mixed-membership models of scientific publications. Proc. Nat. Acad. Sci. 101, 5220--5227.
[16]
Erten, C., Harding, P. J., Kobourov, S. G., Wampler, K., and Yee, G. 2003. Exploring the computing literature using temporal graph visualization. Tech. rep., Department of Computer Science, University of Arizona.
[17]
Gilks, W., Richardson, S., and Spiegelhalter, D. 1996. Markov Chain Monte Carlo in Practice. Chapman & Hall, New York, NY.
[18]
Gray, A., Sallis, P., and MacDonell, S. 1997. Software forensics: Extending authorship analysis techniques to computer programs. In Proceedings of the 3rd Biannual Conference of the International Association of Forensic Linguists (IAFL). 1--8.
[19]
Griffiths, T. L. and Steyvers, M. 2004. Finding scientific topics. Proc. Nat. Acad. Sci. 101, 5228--5235.
[20]
Griffiths, T. L., Steyvers, M., Blei, D. M., and Tenenbaum, J. B. 2005. Integrating topics and syntax. In Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou Eds., MIT Press, Cambridge, MA.
[21]
Gruber, A., Rosen-Zvi, M., and Weiss, Y. 2007. Hidden topic Markov models. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS).
[22]
Hofmann, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 50--57.
[23]
Holmes, D. I. 1998. The evolution of stylometry in humanities scholarship. Literary Ling. Comput. 13, 3, 111--117.
[24]
Iyer, R. and Ostendorf, M. 1999. Modelling long distance dependence in language: Topic mixtures versus dynamic cache models. IEEE Trans. Speech Audio Process. 7, 1, 30--39.
[25]
Kautz, H., Selman, B., and Shah, M. 1997. Referral Web: combining social networks and collaborative filtering. Comm. ACM 40, 3, 63--65.
[26]
Kjell, B. 1994. Authorship determination using letter pair frequency features with neural network classifiers. Literary Ling. Comput. 9, 2, 119--124.
[27]
Lagus, K., Honkela, T., Kaski, S., and Kohonen, T. 1999. WEBSOM for textual data mining. Artif. Intell. Rev. 13, 5-6, 345--364.
[28]
Lawrence, S., Giles, C. L., and Bollacker, K. 1999. Digital libraries and autonomous citation indexing. IEEE Comput. 32, 6, 67--71.
[29]
Lee, D. D. and Seung, H. S. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788.
[30]
Li, W. and McCallum, A. 2006. DAG-structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning. ACM Press, New York, NY, 577--584.
[31]
McCain, K. W. 1990. Mapping authors in intellectual space: a technical overview. J. Amer. Soc. Inform. Sci. 41, 6, 433--443.
[32]
McCallum, A. 1999. Multi-label text classification with a mixture model trained by EM. In AAAI Workshop on Text Learning.
[33]
McCallum, A., Corrada-Emmanuel, A., and Wang, X. 2005. Topic and role discovery in social networks. In Proceedings of the 19th International Joint Conference on Artificial Intelligence, 786--791.
[34]
McCallum, A., Nigam, K., and Ungar, L. H. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, 169--178.
[35]
Mei, Q. and Zhai, C. 2006. A mixture model for contextual text mining. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, New York, NY, 649--655.
[36]
Minka, T. and Lafferty, J. 2002. Expectation-propagation for the generative aspect model. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers, San Francisco, CA, 352--359.
[37]
Mosteller, F. and Wallace, D. 1964. Inference and Disputed Authorship: The Federalist Papers. Addison-Wesley, Reading, MA.
[38]
Mutschke, P. 2003. Mining networks and central entities in digital libraries: a graph theoretic approach applied to co-author networks. Advanced in Intelligent Data Analysis, V, Lecture Notes in Computer Science, vol. 2810, Springer Verlag, 155--166.
[39]
Newman, M. 2001. Scientific collaboration networks. I. Network construction and fundamental results. Phys. Rev. E 64, 1, 016131.
[40]
Ponte, J. M. and Croft, W. B. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 275--281.
[41]
Popescul, A., Ungar, L. H., Flake, G. W., Lawrence, S., and Giles, C. L. 2000. Clustering and identifying temporal trends in document databases. In Proceedings of the IEEE Advances in Digital Libraries 2000. IEEE Computer Society, Los Alamitos, CA, 173--182.
[42]
Pritchard, J., Stephens, M., and Donnelly, P. 2000. Inference of population structure using multilocus genotype data. Genetics 155, 945--959.
[43]
Roberts, G. O. and Sahu, S. K. 1997. Updating schemes, correlation structure, blocking and parameterisation for the Gibbs sampler. J. Royal Statist. Soc. B, 59, 291--317.
[44]
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. 1995. Okapi at TREC-3. In Proceedings of TREC. 109--126.
[45]
Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, M. Chickering and J. Halpern, Eds. Morgam Kaufmann, San Francisco, CA, 487--494.
[46]
Sparck Jones, K., Walker, S., and Robertson, S. E. 2000. A probabilistic model of information retrieval: development and comparative experiments. Inform. Proc. Manag. 36, 6, 779--808.
[47]
Steyvers, M., Smyth, P., Rosen-Zvi, M., and Griffiths, T. 2004. Probabilistic author-topic models for information discovery. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, 306--315.
[48]
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. 2005. Hierarchical Dirichlet processes. In Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou Eds., MIT Press, Cambridge, MA.
[49]
Thisted, R. and Efron, B. 1987. Did Shakespeare write a newly-discovered poem? Biometrika 74, 445--455.
[50]
Ueda, N. and Saito, K. 2003. Parametric mixture models for multi-labeled text. In Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun, and K. Obermayer Eds., MIT Press, Cambridge, MA, 721--728.
[51]
Wei, X. and Croft, W. B. 2006. LDA-based document models for ad hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 178--185.
[52]
White, S. and Smyth, P. 2003. Algorithms for estimating relative importance in networks. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, 266--275.
[53]
Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Inform. Retriev. 1, 1-2, 69--90.

Cited By

View all
  • (2024)Exploratory image data analysis for quality improvement hypothesis generationQuality Engineering10.1080/08982112.2023.228530536:4(693-712)Online publication date: 22-Jan-2024
  • (2023)Collaboration of issuing agencies and topic evolution of health informatisation policies in ChinaJournal of Information Science10.1177/0165551522107432349:6(1692-1710)Online publication date: 1-Dec-2023
  • (2023)Textual Analytics on ‘Azadi Ka Amrit Mahotsav’: Exploring Indian citizens' ideas for achieving Aatmanirbhar Bharat2023 Third International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT)10.1109/ICAECT57570.2023.10118308(1-8)Online publication date: 5-Jan-2023
  • Show More Cited By

Recommendations

Reviews

Julien Velcin

For large text corpora, the task of extracting and following information about topics, authors, and opinions is very challenging. Applications are numerous and relate to various domains, including social networks. The authors' proposed model is a novel contribution to this research area. It is highly related to other probabilistic models, such as latent Dirichlet allocation (LDA) [1] and McCallum's model [2]. In this paper, Rosen-Zvi et al. propose a new generative model for document collection. Their author-topic (AT) model differs from McCallum's in the way that each author is associated with a distribution over topics. This approach leads to numerous applications such as word sense disambiguation and information retrieval (IR), which are described in detail. Although they present a well-grounded, detailed theoretical basis, the choice of fixing hyperparameters ? and ? could have been discussed in more depth. The paper lacks a formal and experimental comparison with a different type of approach, such as a graph-based one [3]. Also, the authors compare their approach with term frequency-inverse document frequency (tf-idf) as if it were an algorithm. In fact, tf-idf is a formula that (sometimes) gives a better representation of textual data, typically in an IR task. Hence, the comparison between AT models and tf-idf needs more in-depth investigation. In summary, the authors present an interesting and well-grounded model. That being said, potential readers should be fairly familiar with Bayesian statistics. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 28, Issue 1
January 2010
157 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1658377
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 January 2010
Accepted: 01 March 2009
Revised: 01 October 2008
Received: 01 September 2007
Published in TOIS Volume 28, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Gibbs sampling
  2. Topic models
  3. author models
  4. perplexity
  5. unsupervised learning

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)67
  • Downloads (Last 6 weeks)4
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Exploratory image data analysis for quality improvement hypothesis generationQuality Engineering10.1080/08982112.2023.228530536:4(693-712)Online publication date: 22-Jan-2024
  • (2023)Collaboration of issuing agencies and topic evolution of health informatisation policies in ChinaJournal of Information Science10.1177/0165551522107432349:6(1692-1710)Online publication date: 1-Dec-2023
  • (2023)Textual Analytics on ‘Azadi Ka Amrit Mahotsav’: Exploring Indian citizens' ideas for achieving Aatmanirbhar Bharat2023 Third International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT)10.1109/ICAECT57570.2023.10118308(1-8)Online publication date: 5-Jan-2023
  • (2022)UGC Knowledge Features and Their Influences on the Stock Market: An Empirical Study Based on Topic ModelingInformation10.3390/info1310045413:10(454)Online publication date: 27-Sep-2022
  • (2022)Selecting Workers Wisely for Crowdsourcing When Copiers and Domain Experts Co-existFuture Internet10.3390/fi1402003714:2(37)Online publication date: 24-Jan-2022
  • (2022)An improved author-topic (AT) model with authorship credit allocation schemesJournal of Information Science10.1177/0165551522113353051:1(184-204)Online publication date: 23-Nov-2022
  • (2022)Posterior Summaries of Grocery Retail Topic Models: Evaluation, Interpretability and CredibilityJournal of the Royal Statistical Society Series C: Applied Statistics10.1111/rssc.1254671:3(562-588)Online publication date: 9-Apr-2022
  • (2022)Learning to Build Accurate Service Representations and VisualizationIEEE Transactions on Services Computing10.1109/TSC.2020.300130715:3(1551-1563)Online publication date: 1-May-2022
  • (2022)Robustness, replicability and scalability in topic modellingJournal of Informetrics10.1016/j.joi.2021.10122416:1(101224)Online publication date: Feb-2022
  • (2022)Hierarchical Bayesian text modeling for the unsupervised joint analysis of latent topics and semantic clustersInternational Journal of Approximate Reasoning10.1016/j.ijar.2022.05.002147:C(23-39)Online publication date: 1-Aug-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media