poster

A time-dependent topic model for multiple text streams

Authors:
Liangjie Hong

Lehigh University, Bethlehem, PA, USA

Lehigh University, Bethlehem, PA, USA
View Profile

,
Byron Dom

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Siva Gurumurthy

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Kostas Tsioutsiouliklis

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2011Pages 832–840https://doi.org/10.1145/2020408.2020551

Published:21 August 2011Publication History

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 832–840

ABSTRACT

In recent years social media have become indispensable tools for information dissemination, operating in tandem with traditional media outlets such as newspapers, and it has become critical to understand the interaction between the new and old sources of news. Although social media as well as traditional media have attracted attention from several research communities, most of the prior work has been limited to a single medium. In addition temporal analysis of these sources can provide an understanding of how information spreads and evolves. Modeling temporal dynamics while considering multiple sources is a challenging research problem. In this paper we address the problem of modeling text streams from two news sources - Twitter and Yahoo! News. Our analysis addresses both their individual properties (including temporal dynamics) and their inter-relationships. This work extends standard topic models by allowing each text stream to have both local topics and shared topics. For temporal modeling we associate each topic with a time-dependent function that characterizes its popularity over time. By integrating the two models, we effectively model the temporal dynamics of multiple correlated text streams in a unified framework. We evaluate our model on a large-scale dataset, consisting of text streams from both Twitter and news feeds from Yahoo! News. Besides overcoming the limitations of existing models, we show that our work achieves better perplexity on unseen data and identifies more coherent topics. We also provide analysis of finding real-world events from the topics obtained by our model.

References

A. Ahmed and E. P. Xing. Timeline: A dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. In Proceedings of the 26th International Conference on Conference on Uncertainty in Artificial Intelligence (UAI), pages 20--29, 2010.Google Scholar
A. Aji and E. Agichtein. Deconstructing interaction dynamics in knowledge sharing communities. In International Conference on Social Computing, Behavioral Modeling, and Prediction, pages 273--281, 2010. Google ScholarDigital Library
L. Alsumait, D. Barbará, J. Gentle, and C. Domeniconi. Topic significance ranking of LDA generative models. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD), pages 67--82, 2009. Google ScholarDigital Library
D. M. Blei and J. D. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pages 113--120, 2006. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
C. Chemudugunta, P. Smyth, and M. Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In NIPS, pages 241--248, 2006.Google Scholar
G. Doyle and C. Elkan. Accounting for burstiness in topic models. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 281--288, 2009. Google ScholarDigital Library
M. Galassi, J. Davies, J. Theiler, B. Gough, G. Jungman, P. Alken, M. Booth, and F. Rossi. GNU Scientific Library Reference Manual - Third Edition (v1.12). Network Theory Ltd., 2009. http://www.gnu.org/software/gsl/. Google ScholarDigital Library
M. Goetz, J. Leskovec, M. McGlohon, and C. Faloutsos. Modeling blog dynamics. In International AAAI Conference on Weblogs and Social Media (ICWSM), 2009.Google Scholar
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, pages 5228--5235, 2004.Google ScholarCross Ref
T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42:177--196, 2001. Google ScholarDigital Library
T. Iwata, T. Yamada, Y. Sakurai, and N. Ueda. Online multiscale dynamic topic models. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 663--672, 2010. Google ScholarDigital Library
A. Java, X. Song, T. Finin, and B. Tseng. Why we Twitter: Understanding microblogging usage and communities. In Proceedings of the 9th WebKDD, pages 56--65, 2007. Google ScholarDigital Library
J. Kleinberg. Bursty and hierarchical structure in streams. Journal Data Mining and Knowledge Discovery, 7(4):373--397, 2003. Google ScholarDigital Library
J. Kleinberg. Temporal dynamics of on-line information streams. In Data Stream Management: Processing High-Speed Data Streams, 2005.Google Scholar
J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 497--506, 2009. Google ScholarDigital Library
D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(3):503--528, 1989. Google ScholarDigital Library
T. Masada, D. Fukagawa, A. Takasu, T. Hamada, Y. Shibata, and K. Oguri. Dynamic hyperparameter optimization for Bayesian topical trend analysis. In Proceeding of the 18th ACM Conference on Information and Knowledge Management (CIKM), pages 1831--1834, 2009. Google ScholarDigital Library
T. P. Minka. Estimating a Dirichlet distribution. Technical report, 2009. http://research.microsoft.com/en-us/um /people/minka/papers/dirichlet/.Google Scholar
R. M. Nallapati, S. Ditmore, J. D. Lafferty, and K. Ung. Multiscale topic tomography. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 520--529, 2007. Google ScholarDigital Library
D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models. The Journal of Machine Learning Research, 10:1801--1828, 2009. Google ScholarDigital Library
M. Paul. Cross-collection topic models: Automatically comparing and contrasting text. Master's thesis, UIUC, 2009.Google Scholar
M. Paul and R. Girju. Cross-cultural analysis of blogs and forums with mixed-collection topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1408--1417. Association for Computational Linguistics, 2009. Google ScholarDigital Library
I. Pruteanu-Malinici, L. Ren, J. Paisley, E. Wang, and L. Carin. Hierarchical Bayesian modeling of topics in time-stamped documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:996--1011, June 2010. Google ScholarDigital Library
C. Wang, D. M. Blei, and D. Heckerman. Continuous time dynamic topic models. In Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence (UAI), pages 579--586, 2008.Google Scholar
X. Wang and A. McCallum. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 424--433, 2006. Google ScholarDigital Library
X. Wang, C. Zhai, X. Hu, and R. Sproat. Mining correlated bursty topic patterns from coordinated text streams. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 784--793, 2007. Google ScholarDigital Library
X. Wang, K. Zhang, X. Jin, and D. Shen. Mining common topics from multiple asynchronous text streams. In Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM), pages 192--201, 2009. Google ScholarDigital Library
X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 178--185, 2006. Google ScholarDigital Library
J. Yang and J. Leskovec. Patterns of temporal variation in online media. In Proceedings of the fourth ACM International Conference on Web search and Data Mining (WSDM), pages 177--186, 2011. Google ScholarDigital Library
C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 743--748, 2004. Google ScholarDigital Library
J. Zhang, Y. Song, C. Zhang, and S. Liu. Evolutionary hierarchical Dirichlet processes for multiple correlated time-varying corpora. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1079--1088, 2010. Google ScholarDigital Library
W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing Twitter and traditional media using topic models. In ECIR, pages 338--349, 2011. Google ScholarDigital Library

Index Terms

A time-dependent topic model for multiple text streams
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Empirical study of topic modeling in Twitter
SOMA '10: Proceedings of the First Workshop on Social Media Analytics

Social networks such as Facebook, LinkedIn, and Twitter have been a crucial source of information for a wide spectrum of users. In Twitter, popular information that is deemed important by the community propagates through the network. Studying the ...
Read More
Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Aspect-based opinion mining is widely applied to review data to aggregate or summarize opinions of a product, and the current state-of-the-art is achieved with Latent Dirichlet Allocation (LDA)-based model. Although social media data like tweets are ...
Read More
Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data
Abstract
Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2011
1446 pages
ISBN:9781450308137
DOI:10.1145/2020408
General Chair:
Chid Apte
IBM Research
,
Program Chairs:
Joydeep Ghosh
UT Austin
,
Padhraic Smyth
UC Irvine
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 August 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
news
temporal dynamics
text streams
topic models
twitter
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 58
  Total Citations
  View Citations
- 1,398
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A time-dependent topic model for multiple text streams

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Empirical study of topic modeling in Twitter

Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon

Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data