ABSTRACT
In recent years social media have become indispensable tools for information dissemination, operating in tandem with traditional media outlets such as newspapers, and it has become critical to understand the interaction between the new and old sources of news. Although social media as well as traditional media have attracted attention from several research communities, most of the prior work has been limited to a single medium. In addition temporal analysis of these sources can provide an understanding of how information spreads and evolves. Modeling temporal dynamics while considering multiple sources is a challenging research problem. In this paper we address the problem of modeling text streams from two news sources - Twitter and Yahoo! News. Our analysis addresses both their individual properties (including temporal dynamics) and their inter-relationships. This work extends standard topic models by allowing each text stream to have both local topics and shared topics. For temporal modeling we associate each topic with a time-dependent function that characterizes its popularity over time. By integrating the two models, we effectively model the temporal dynamics of multiple correlated text streams in a unified framework. We evaluate our model on a large-scale dataset, consisting of text streams from both Twitter and news feeds from Yahoo! News. Besides overcoming the limitations of existing models, we show that our work achieves better perplexity on unseen data and identifies more coherent topics. We also provide analysis of finding real-world events from the topics obtained by our model.
- A. Ahmed and E. P. Xing. Timeline: A dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. In Proceedings of the 26th International Conference on Conference on Uncertainty in Artificial Intelligence (UAI), pages 20--29, 2010.Google Scholar
- A. Aji and E. Agichtein. Deconstructing interaction dynamics in knowledge sharing communities. In International Conference on Social Computing, Behavioral Modeling, and Prediction, pages 273--281, 2010. Google ScholarDigital Library
- L. Alsumait, D. Barbará, J. Gentle, and C. Domeniconi. Topic significance ranking of LDA generative models. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD), pages 67--82, 2009. Google ScholarDigital Library
- D. M. Blei and J. D. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pages 113--120, 2006. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- C. Chemudugunta, P. Smyth, and M. Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In NIPS, pages 241--248, 2006.Google Scholar
- G. Doyle and C. Elkan. Accounting for burstiness in topic models. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 281--288, 2009. Google ScholarDigital Library
- M. Galassi, J. Davies, J. Theiler, B. Gough, G. Jungman, P. Alken, M. Booth, and F. Rossi. GNU Scientific Library Reference Manual - Third Edition (v1.12). Network Theory Ltd., 2009. http://www.gnu.org/software/gsl/. Google ScholarDigital Library
- M. Goetz, J. Leskovec, M. McGlohon, and C. Faloutsos. Modeling blog dynamics. In International AAAI Conference on Weblogs and Social Media (ICWSM), 2009.Google Scholar
- T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, pages 5228--5235, 2004.Google ScholarCross Ref
- T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42:177--196, 2001. Google ScholarDigital Library
- T. Iwata, T. Yamada, Y. Sakurai, and N. Ueda. Online multiscale dynamic topic models. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 663--672, 2010. Google ScholarDigital Library
- A. Java, X. Song, T. Finin, and B. Tseng. Why we Twitter: Understanding microblogging usage and communities. In Proceedings of the 9th WebKDD, pages 56--65, 2007. Google ScholarDigital Library
- J. Kleinberg. Bursty and hierarchical structure in streams. Journal Data Mining and Knowledge Discovery, 7(4):373--397, 2003. Google ScholarDigital Library
- J. Kleinberg. Temporal dynamics of on-line information streams. In Data Stream Management: Processing High-Speed Data Streams, 2005.Google Scholar
- J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 497--506, 2009. Google ScholarDigital Library
- D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(3):503--528, 1989. Google ScholarDigital Library
- T. Masada, D. Fukagawa, A. Takasu, T. Hamada, Y. Shibata, and K. Oguri. Dynamic hyperparameter optimization for Bayesian topical trend analysis. In Proceeding of the 18th ACM Conference on Information and Knowledge Management (CIKM), pages 1831--1834, 2009. Google ScholarDigital Library
- T. P. Minka. Estimating a Dirichlet distribution. Technical report, 2009. http://research.microsoft.com/en-us/um /people/minka/papers/dirichlet/.Google Scholar
- R. M. Nallapati, S. Ditmore, J. D. Lafferty, and K. Ung. Multiscale topic tomography. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 520--529, 2007. Google ScholarDigital Library
- D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models. The Journal of Machine Learning Research, 10:1801--1828, 2009. Google ScholarDigital Library
- M. Paul. Cross-collection topic models: Automatically comparing and contrasting text. Master's thesis, UIUC, 2009.Google Scholar
- M. Paul and R. Girju. Cross-cultural analysis of blogs and forums with mixed-collection topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1408--1417. Association for Computational Linguistics, 2009. Google ScholarDigital Library
- I. Pruteanu-Malinici, L. Ren, J. Paisley, E. Wang, and L. Carin. Hierarchical Bayesian modeling of topics in time-stamped documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:996--1011, June 2010. Google ScholarDigital Library
- C. Wang, D. M. Blei, and D. Heckerman. Continuous time dynamic topic models. In Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence (UAI), pages 579--586, 2008.Google Scholar
- X. Wang and A. McCallum. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 424--433, 2006. Google ScholarDigital Library
- X. Wang, C. Zhai, X. Hu, and R. Sproat. Mining correlated bursty topic patterns from coordinated text streams. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 784--793, 2007. Google ScholarDigital Library
- X. Wang, K. Zhang, X. Jin, and D. Shen. Mining common topics from multiple asynchronous text streams. In Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM), pages 192--201, 2009. Google ScholarDigital Library
- X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 178--185, 2006. Google ScholarDigital Library
- J. Yang and J. Leskovec. Patterns of temporal variation in online media. In Proceedings of the fourth ACM International Conference on Web search and Data Mining (WSDM), pages 177--186, 2011. Google ScholarDigital Library
- C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 743--748, 2004. Google ScholarDigital Library
- J. Zhang, Y. Song, C. Zhang, and S. Liu. Evolutionary hierarchical Dirichlet processes for multiple correlated time-varying corpora. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1079--1088, 2010. Google ScholarDigital Library
- W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing Twitter and traditional media using topic models. In ECIR, pages 338--349, 2011. Google ScholarDigital Library
Index Terms
- A time-dependent topic model for multiple text streams
Recommendations
Empirical study of topic modeling in Twitter
SOMA '10: Proceedings of the First Workshop on Social Media AnalyticsSocial networks such as Facebook, LinkedIn, and Twitter have been a crucial source of information for a wide spectrum of users. In Twitter, popular information that is deemed important by the community propagates through the network. Studying the ...
Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge ManagementAspect-based opinion mining is widely applied to review data to aggregate or summarize opinions of a product, and the current state-of-the-art is achieved with Latent Dirichlet Allocation (LDA)-based model. Although social media data like tweets are ...
Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data
AbstractTopic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic ...
Comments