research-article

Learning author-topic models from text corpora

Authors:

Michal Rosen-Zvi,

Chaitanya Chemudugunta,

Thomas Griffiths,

Padhraic Smyth,

Mark SteyversAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 28, Issue 1

Article No.: 4, Pages 1 - 38

https://doi.org/10.1145/1658377.1658381

Published: 29 January 2010 Publication History

Get Access

Abstract

We propose an unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multi-author paper is a mixture of the distributions associated with the authors. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to three large text corpora: 150,000 abstracts from the CiteSeer digital library, 1740 papers from the Neural Information Processing Systems (NIPS) Conferences, and 121,000 emails from the Enron corporation. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. Experiments based on perplexity scores for test documents and precision-recall for document retrieval are used to illustrate systematic differences between the proposed author-topic model and a number of alternatives. Extensions to the model, allowing for example, generalizations of the notion of an author, are also briefly discussed.

References

[1]

Berry, M. W., Dumais, S. T., and O'Brien, G. W. 1994. Using linear algebra for intelligent information retrieval. SIAM Rev. 573--595.

Digital Library

Google Scholar

[2]

Blei, D. and Lafferty, J. 2006a. Correlated topic models. In Advances in Neural Information Processing Systems 18, Y. Weiss, B. Schölkopf, and J. Platt Eds., MIT Press, Cambridge, MA, 147--154.

Google Scholar

[3]

Blei, D. and Lafferty, J. 2006b. Correlated topic models. In Proceedings of the 23rd International Conference on Machine Learning. ACM Press, New York, NY, 113--120.

Google Scholar

[4]

Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022.

Crossref

Google Scholar

[5]

Box, G. E. P. and Tiao, G. C. 1973. Bayesian Inference in Statistical Analysis. Addison-Wesley, Reading, MA.

Google Scholar

[6]

Brooks, S. 1998. Markov chain Monte Carlo method and its application. Statistician 47, 69--100.

Google Scholar

[7]

Buntine, W. L. and Jakulin, A. 2004. Applying discrete PCA in data analysis. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. M. Chickering and J. Halpern Eds. Morgan Kaufmann Publishers, San Francisco, CA, 59--66.

Digital Library

Google Scholar

[8]

Canny, J. 2004. GaP: a factor model for discrete data. In Proceedings of the 27th Annual International Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, New York, NY, 122--129.

Digital Library

Google Scholar

[9]

Chemudugunta, C., Smyth, P., and Steyvers, M. 2007. Modeling general and specific aspects of documents with a probabilistic topic model. In Advances in Neural Information Processing Systems 19, B. Schölkopf, J. Platt, and T. Hoffman Eds., MIT Press, Cambridge, MA, 241--248.

Google Scholar

[10]

Cohn, D. and Hofmann, T. 2001. The missing link—a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems 13, T. K. Leen, T. G. Dietterich, and V. Tresp Eds., MIT Press, Cambridge, MA, 430--436.

Google Scholar

[11]

Cutting, D. R., Karger, D., Pedersen, J. O., and Tukey, J. W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 318--329.

Digital Library

Google Scholar

[12]

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 6, 391--407.

Crossref

Google Scholar

[13]

Dhillon, I. S. and Modha, D. S. 2001. Concept decompositions for large sparse text data using clustering. Mach. Learn. 42, 1/2, 143--175.

Digital Library

Google Scholar

[14]

Diederich, J., Kindermann, J., Leopold, E., and Paass, G. 2003. Authorship attribution with support vector machines. Appl. Intell. 19, 1, 109--123.

Digital Library

Google Scholar

[15]

Erosheva, E., Fienberg, S., and Lafferty, J. 2004. Mixed-membership models of scientific publications. Proc. Nat. Acad. Sci. 101, 5220--5227.

Crossref

Google Scholar

[16]

Erten, C., Harding, P. J., Kobourov, S. G., Wampler, K., and Yee, G. 2003. Exploring the computing literature using temporal graph visualization. Tech. rep., Department of Computer Science, University of Arizona.

Google Scholar

[17]

Gilks, W., Richardson, S., and Spiegelhalter, D. 1996. Markov Chain Monte Carlo in Practice. Chapman & Hall, New York, NY.

Google Scholar

[18]

Gray, A., Sallis, P., and MacDonell, S. 1997. Software forensics: Extending authorship analysis techniques to computer programs. In Proceedings of the 3rd Biannual Conference of the International Association of Forensic Linguists (IAFL). 1--8.

Google Scholar

[19]

Griffiths, T. L. and Steyvers, M. 2004. Finding scientific topics. Proc. Nat. Acad. Sci. 101, 5228--5235.

Crossref

Google Scholar

[20]

Griffiths, T. L., Steyvers, M., Blei, D. M., and Tenenbaum, J. B. 2005. Integrating topics and syntax. In Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou Eds., MIT Press, Cambridge, MA.

Google Scholar

[21]

Gruber, A., Rosen-Zvi, M., and Weiss, Y. 2007. Hidden topic Markov models. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS).

Google Scholar

[22]

Hofmann, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, 50--57.

Digital Library

Google Scholar

[23]

Holmes, D. I. 1998. The evolution of stylometry in humanities scholarship. Literary Ling. Comput. 13, 3, 111--117.

Crossref

Google Scholar

[24]

Iyer, R. and Ostendorf, M. 1999. Modelling long distance dependence in language: Topic mixtures versus dynamic cache models. IEEE Trans. Speech Audio Process. 7, 1, 30--39.

Crossref

Google Scholar

[25]

Kautz, H., Selman, B., and Shah, M. 1997. Referral Web: combining social networks and collaborative filtering. Comm. ACM 40, 3, 63--65.

Digital Library

Google Scholar

[26]

Kjell, B. 1994. Authorship determination using letter pair frequency features with neural network classifiers. Literary Ling. Comput. 9, 2, 119--124.

Crossref

Google Scholar

[27]

Lagus, K., Honkela, T., Kaski, S., and Kohonen, T. 1999. WEBSOM for textual data mining. Artif. Intell. Rev. 13, 5-6, 345--364.

Digital Library

Google Scholar

[28]

Lawrence, S., Giles, C. L., and Bollacker, K. 1999. Digital libraries and autonomous citation indexing. IEEE Comput. 32, 6, 67--71.

Digital Library

Google Scholar

[29]

Lee, D. D. and Seung, H. S. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788.

Crossref

Google Scholar

[30]

Li, W. and McCallum, A. 2006. DAG-structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning. ACM Press, New York, NY, 577--584.

Digital Library

Google Scholar

[31]

McCain, K. W. 1990. Mapping authors in intellectual space: a technical overview. J. Amer. Soc. Inform. Sci. 41, 6, 433--443.

Crossref

Google Scholar

[32]

McCallum, A. 1999. Multi-label text classification with a mixture model trained by EM. In AAAI Workshop on Text Learning.

Google Scholar

[33]

McCallum, A., Corrada-Emmanuel, A., and Wang, X. 2005. Topic and role discovery in social networks. In Proceedings of the 19th International Joint Conference on Artificial Intelligence, 786--791.

Digital Library

Google Scholar

[34]

McCallum, A., Nigam, K., and Ungar, L. H. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, 169--178.

Digital Library

Google Scholar

[35]

Mei, Q. and Zhai, C. 2006. A mixture model for contextual text mining. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, New York, NY, 649--655.

Digital Library

Google Scholar

[36]

Minka, T. and Lafferty, J. 2002. Expectation-propagation for the generative aspect model. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers, San Francisco, CA, 352--359.

Digital Library

Google Scholar

[37]

Mosteller, F. and Wallace, D. 1964. Inference and Disputed Authorship: The Federalist Papers. Addison-Wesley, Reading, MA.

Google Scholar

[38]

Mutschke, P. 2003. Mining networks and central entities in digital libraries: a graph theoretic approach applied to co-author networks. Advanced in Intelligent Data Analysis, V, Lecture Notes in Computer Science, vol. 2810, Springer Verlag, 155--166.

Google Scholar

[39]

Newman, M. 2001. Scientific collaboration networks. I. Network construction and fundamental results. Phys. Rev. E 64, 1, 016131.

Crossref

Google Scholar

[40]

Ponte, J. M. and Croft, W. B. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 275--281.

Digital Library

Google Scholar

[41]

Popescul, A., Ungar, L. H., Flake, G. W., Lawrence, S., and Giles, C. L. 2000. Clustering and identifying temporal trends in document databases. In Proceedings of the IEEE Advances in Digital Libraries 2000. IEEE Computer Society, Los Alamitos, CA, 173--182.

Digital Library

Google Scholar

[42]

Pritchard, J., Stephens, M., and Donnelly, P. 2000. Inference of population structure using multilocus genotype data. Genetics 155, 945--959.

Crossref

Google Scholar

[43]

Roberts, G. O. and Sahu, S. K. 1997. Updating schemes, correlation structure, blocking and parameterisation for the Gibbs sampler. J. Royal Statist. Soc. B, 59, 291--317.

Crossref

Google Scholar

[44]

Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. 1995. Okapi at TREC-3. In Proceedings of TREC. 109--126.

Google Scholar

[45]

Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, M. Chickering and J. Halpern, Eds. Morgam Kaufmann, San Francisco, CA, 487--494.

Digital Library

Google Scholar

[46]

Sparck Jones, K., Walker, S., and Robertson, S. E. 2000. A probabilistic model of information retrieval: development and comparative experiments. Inform. Proc. Manag. 36, 6, 779--808.

Digital Library

Google Scholar

[47]

Steyvers, M., Smyth, P., Rosen-Zvi, M., and Griffiths, T. 2004. Probabilistic author-topic models for information discovery. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, 306--315.

Digital Library

Google Scholar

[48]

Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. 2005. Hierarchical Dirichlet processes. In Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou Eds., MIT Press, Cambridge, MA.

Google Scholar

[49]

Thisted, R. and Efron, B. 1987. Did Shakespeare write a newly-discovered poem? Biometrika 74, 445--455.

Crossref

Google Scholar

[50]

Ueda, N. and Saito, K. 2003. Parametric mixture models for multi-labeled text. In Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun, and K. Obermayer Eds., MIT Press, Cambridge, MA, 721--728.

Google Scholar

[51]

Wei, X. and Croft, W. B. 2006. LDA-based document models for ad hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 178--185.

Digital Library

Google Scholar

[52]

White, S. and Smyth, P. 2003. Algorithms for estimating relative importance in networks. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, 266--275.

Digital Library

Google Scholar

[53]

Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Inform. Retriev. 1, 1-2, 69--90.

Digital Library

Google Scholar

Cited By

View all

Zhang YAllen TRodriguez Buno R(2024)Exploratory image data analysis for quality improvement hypothesis generationQuality Engineering10.1080/08982112.2023.228530536:4(693-712)Online publication date: 22-Jan-2024
https://doi.org/10.1080/08982112.2023.2285305
Zhang WYao REvans RHuang WCao GShen L(2023)Collaboration of issuing agencies and topic evolution of health informatisation policies in ChinaJournal of Information Science10.1177/0165551522107432349:6(1692-1710)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1177/01655515221074323
HimaBindu IReddy SHaragopal VSarojamma B(2023)Textual Analytics on ‘Azadi Ka Amrit Mahotsav’: Exploring Indian citizens' ideas for achieving Aatmanirbhar Bharat2023 Third International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT)10.1109/ICAECT57570.2023.10118308(1-8)Online publication date: 5-Jan-2023
https://doi.org/10.1109/ICAECT57570.2023.10118308
Show More Cited By

Index Terms

Learning author-topic models from text corpora

Recommendations

Probabilistic author-topic models for information discovery
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics,...
Topic models and a revisit of text-related applications
PIKM '08: Proceedings of the 2nd PhD workshop on Information and knowledge management

Topic models such as aspect model or LDA have been shown as a promising approach for text modeling. Unlike many previous models that restrict each document to a single topic, topic models support the important idea that each document could be relevant ...
Topic sentiment mixture: modeling facets and opinions in weblogs
WWW '07: Proceedings of the 16th international conference on World Wide Web

In this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent ...

Reviews

Reviewer: Julien Velcin

For large text corpora, the task of extracting and following information about topics, authors, and opinions is very challenging. Applications are numerous and relate to various domains, including social networks. The authors' proposed model is a novel contribution to this research area. It is highly related to other probabilistic models, such as latent Dirichlet allocation (LDA) [1] and McCallum's model [2]. In this paper, Rosen-Zvi et al. propose a new generative model for document collection. Their author-topic (AT) model differs from McCallum's in the way that each author is associated with a distribution over topics. This approach leads to numerous applications such as word sense disambiguation and information retrieval (IR), which are described in detail. Although they present a well-grounded, detailed theoretical basis, the choice of fixing hyperparameters ? and ? could have been discussed in more depth. The paper lacks a formal and experimental comparison with a different type of approach, such as a graph-based one [3]. Also, the authors compare their approach with term frequency-inverse document frequency (tf-idf) as if it were an algorithm. In fact, tf-idf is a formula that (sometimes) gives a better representation of textual data, typically in an IR task. Hence, the comparison between AT models and tf-idf needs more in-depth investigation. In summary, the authors present an interesting and well-grounded model. That being said, potential readers should be fairly familiar with Bayesian statistics. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

ACM Transactions on Information Systems Volume 28, Issue 1

January 2010

157 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/1658377

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 January 2010

Accepted: 01 March 2009

Revised: 01 October 2008

Received: 01 September 2007

Published in TOIS Volume 28, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

213
Total Citations
View Citations
2,320
Total Downloads

Downloads (Last 12 months)67
Downloads (Last 6 weeks)4

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Zhang YAllen TRodriguez Buno R(2024)Exploratory image data analysis for quality improvement hypothesis generationQuality Engineering10.1080/08982112.2023.228530536:4(693-712)Online publication date: 22-Jan-2024
https://doi.org/10.1080/08982112.2023.2285305
Zhang WYao REvans RHuang WCao GShen L(2023)Collaboration of issuing agencies and topic evolution of health informatisation policies in ChinaJournal of Information Science10.1177/0165551522107432349:6(1692-1710)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1177/01655515221074323
HimaBindu IReddy SHaragopal VSarojamma B(2023)Textual Analytics on ‘Azadi Ka Amrit Mahotsav’: Exploring Indian citizens' ideas for achieving Aatmanirbhar Bharat2023 Third International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT)10.1109/ICAECT57570.2023.10118308(1-8)Online publication date: 5-Jan-2023
https://doi.org/10.1109/ICAECT57570.2023.10118308
Li NChen KHe H(2022)UGC Knowledge Features and Their Influences on the Stock Market: An Empirical Study Based on Topic ModelingInformation10.3390/info1310045413:10(454)Online publication date: 27-Sep-2022
https://doi.org/10.3390/info13100454
Fang XSi SSun GSheng QWu WWang KLv H(2022)Selecting Workers Wisely for Crowdsourcing When Copiers and Domain Experts Co-existFuture Internet10.3390/fi1402003714:2(37)Online publication date: 24-Jan-2022
https://doi.org/10.3390/fi14020037
Xu SLi LWang CAn XYang G(2022)An improved author-topic (AT) model with authorship credit allocation schemesJournal of Information Science10.1177/0165551522113353051:1(184-204)Online publication date: 23-Nov-2022
https://doi.org/10.1177/01655515221133530
Vega Carrasco MManolopoulou IO'Sullivan JPrior RMusolesi M(2022)Posterior Summaries of Grocery Retail Topic Models: Evaluation, Interpretability and CredibilityJournal of the Royal Statistical Society Series C: Applied Statistics10.1111/rssc.1254671:3(562-588)Online publication date: 9-Apr-2022
https://doi.org/10.1111/rssc.12546
Zhang JFan YZhang JBai B(2022)Learning to Build Accurate Service Representations and VisualizationIEEE Transactions on Services Computing10.1109/TSC.2020.300130715:3(1551-1563)Online publication date: 1-May-2022
https://doi.org/10.1109/TSC.2020.3001307
Ballester OPenner O(2022)Robustness, replicability and scalability in topic modellingJournal of Informetrics10.1016/j.joi.2021.10122416:1(101224)Online publication date: Feb-2022
https://doi.org/10.1016/j.joi.2021.101224
Costa GOrtale R(2022)Hierarchical Bayesian text modeling for the unsupervised joint analysis of latent topics and semantic clustersInternational Journal of Approximate Reasoning10.1016/j.ijar.2022.05.002147:C(23-39)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1016/j.ijar.2022.05.002
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Probabilistic author-topic models for information discovery

Topic models and a revisit of text-related applications

Topic sentiment mixture: modeling facets and opinions in weblogs

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations