Abstract
Statistical language models can learn relationships between topics discussed in a document collection and persons, organizations and places mentioned in each document. We present a novel combination of statistical topic models and named-entity recognizers to jointly analyze entities mentioned (persons, organizations and places) and topics discussed in a collection of 330,000 New York Times news articles. We demonstrate an analytic framework which automatically extracts from a large collection: topics; topic trends; and topics that relate entities.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Klimt, B., Yang, Y.: A New Dataset for Email Classification Research. In: 15th European Conference on Machine Learning (2004)
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1, 67–88 (1999)
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2002)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. American Society of Information Science 41(6), 391–407 (1990)
Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using Linear Algebra for Intelligent Information Retrieval. SIAM Review 37, 573–595 (1994)
Hofmann, T.: Probabilistic Latent Semantic Indexing. In: 22nd Int’l. Conference on Research and Development in Information Retrieval (1999)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 1, 993–1022 (2003)
Minka, T., La, J.: Expectation-Propagation for the Generative Aspect Model. In: 18th Conference on Uncertainty and Artificial Intelligence (2002)
Griffiths, T.L., Steyvers, M.: Finding Scientific Topics. National Academy of Sciences 101 (suppl. 1), 5228–5235 (2004)
Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of Population Structure using Multilocus Genotype Data. Genetics 155, 945–959 (2000)
Buntine, W., Perttu, S., Tuulos, V.: Using Discrete PCA on Web Pages. In: Proceedings of the Workshop W1 on Statistical Approaches for Web Mining (SAWM), Italy, pp. 99–110 (2004)
McCallum, A., Corrada-Emmanuel, A., Wang, X.: Topic and Role Discovery in Social Networks. In: 19th Joint Conference on Artificial Intelligence (2005)
Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic Author-Topic Models for Information Discovery. In: 10th ACM SIGKDD (2004)
Newman, D.J., Block, S.: Probabilistic Topic Decomposition of an Eighteenth-Century Newspaper. Journal American Society for Information Science and Technology (2006)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The Author-Topic Model for Authors and Documents. In: 20th Int’l. Conference on Uncertainty in AI (2004)
Blei, D., Jordan, M.: Modeling Annotated Data. In: 26th International ACM SIGIR, pp. 127–134 (2003)
Griffiths, T., Steyvers, M., Blei, D.M., Tenenbaum, J.B.: Integrating Topics and Syntax. Advances in Neural Information Processing Systems 17 (2004)
Steyvers, M., Griffiths, T.L.: Probabilistic Topic Models. In: Landauer, T. (ed.) Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, Mahwah (2006)
Brill E.: Some Advances in Transformation-Based Part of Speech Tagging. National Conference on Artificial Intelligence (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Newman, D., Chemudugunta, C., Smyth, P., Steyvers, M. (2006). Analyzing Entities and Topics in News Articles Using Statistical Topic Models. In: Mehrotra, S., Zeng, D.D., Chen, H., Thuraisingham, B., Wang, FY. (eds) Intelligence and Security Informatics. ISI 2006. Lecture Notes in Computer Science, vol 3975. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11760146_9
Download citation
DOI: https://doi.org/10.1007/11760146_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34478-0
Online ISBN: 978-3-540-34479-7
eBook Packages: Computer ScienceComputer Science (R0)