Abstract
Surveillance of epidemic outbreaks and spread from social media is an important tool for governments and public health authorities. Machine learning techniques for nowcasting the Flu have made significant inroads into correlating social media trends to case counts and prevalence of epidemics in a population. There is a disconnect between data-driven methods for forecasting Flu incidence and epidemiological models that adopt a state based understanding of transitions, that can lead to sub-optimal predictions. Furthermore, models for epidemiological activity and social activity like on Twitter predict different shapes and have important differences. In this paper, we propose two temporal topic models (one unsupervised model as well as one improved weakly-supervised model) to capture hidden states of a user from his tweets and aggregate states in a geographical region for better estimation of trends. We show that our approaches help fill the gap between phenomenological methods for disease surveillance and epidemiological models. We validate our approaches by modeling the Flu using Twitter in multiple countries of South America. We demonstrate that our models can consistently outperform plain vocabulary assessment in Flu case-count predictions, and at the same time get better Flu-peak predictions than competitors. We also show that our fine-grained modeling can reconcile some contrasting behaviors between epidemiological and social models.
Similar content being viewed by others
Notes
Code and vocabulary can be found here: http://people.cs.vt.edu/liangzhe/code/hfstm-a.html.
References
Achrekar H, Gandhe A, Lazarus R, Yu S-H, and Liu B (2011) Predicting flu trends using twitter data. In: 2011 IEEE conference on computer communications workshops (INFOCOM WKSHPS). pp 702–707
Anderson RM, May RM (1991) Infectious diseases of humans. Oxford University Press, Oxford
Andrews M, Vigliocco G (2010) The hidden markov topic model: a probabilistic model of semantic representation. Top Cogn Sci 2(1):101–113
Aramaki E, Maskawa S, Morita M (2011) Twitter catches the flu: detecting influenza epidemics using twitter. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP ’11). pp 1568–1576
Beretta E, Takeuchi Y (1995) Global stability of an SIR epidemic model with time delays. J Math Biol 33(3):250–260
Blasiak S, Rangwala H (2011) A hidden Markov model variant for sequence classification. In: The 21nd international joint conference on artificial intelligence. pp 1192–1197
Blei D, Carin L, Dunson D (2010) Probabilistic topic models. Signal Process Mag IEEE 27(6):55–65
Blei D, Lafferty J (2006) Dynamic topic models. In: The 23rd international conference on machine learning. pp 113–120
Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Brennan SP, Sadilek A, Kautz HA (2013) Towards understanding global spread of disease from everyday interpersonal interactions. In: Proceedings of the 23rd international joint conference on artificial intelligence. AAAI Press, pp 2783–2789
Butler D (2013) When Google got Flu wrong. Nature 494(7436):155–156
Chakraborty P, Khadivi P, Lewis B, Mahendiran A, Chen J, Butler P, Nsoesie E, Mekaru S, Brownstein J, Marathe M, Ramakrishnan N (2014) Forecasting a moving target: ensemble models for ili case count predictions. In: 2014 SIAM international conference on data mining (SDM ’14)
Chen L, Hossain KSMT, Butler P, Ramakrishnan N, Prakash BA (2014) Flu gone viral: Syndromic surveillance of flu on twitter using temporal topic models. In: Proceedings of the fifth IEEE international conference on data mining (ICDM ’14)
Christakis NA, Fowler JH (2010) Social network sensors for early detection of contagious outbreaks. PLoS One 5(9):e12948
Crane R, Sornette D (2008) Robust dynamic classes revealed by measuring the response function of a social system. Proc Natl Acad Sci 105(41):15649–15653
Culotta A (2010) Towards detecting influenza epidemics by analyzing twitter messages. In: Proceedings of the first workshop on social media analytics. ACM, pp 115–122
Ginsberg J, Mohebbi M, Patel R, Brammer L, Smolinski M, Brilliant L (2008) Detecting influenza epidemics using search engine query data. Nature 457(7232):1012–1014
Glance N, Hurst M, Tomokiyo T (2004) Blogpulse: automated trend discovery for weblogs. WWW 2004 workshop on the weblogging ecosystem: aggregation, analysis and dynamics
Gruber A, Weiss Y, Rosen-Zvi M (2007) Hidden topic markov models. In: International conference on artificial intelligence and statistics. pp 163–170
Hethcote HW (2000) The mathematics of infectious diseases. Soc Ind Appl Math SIAM Rev 42(4):599–653
Hong L, Yin D, Guo J, Davison B (2011) Tracking trends: incorporating term volume into temporal topic models. In: the 17th ACM SIGKDD international conference on knowledge discovery and data mining. pp 484–492
Jacquez J, Simon C (1993) The stochastic SI model with recruitment and deaths I. Comparison with the closed SIS model. Math Biosci 117(1):77–125
Lamb A, Paul MJ, Dredze M (2013) Separating fact from fear: tracking flu infections on twitter. In: North American chapter of the association for computational linguistics (NAACL). pp 789–795
Lampos V, Cristianini N (2012) Nowcasting events from the social web with statistical learning. ACM Trans Intell Syst Technol 3(4):72
Lampos V, De Bie T, Cristianini N (2010) Flu detector: tracking epidemics on twitter. In: Proceedings of the 2010 European conference on machine learning and knowledge discovery in databases: Part III (ECML PKDD’10). pp 599–602
Lazer DM, Kennedy R, King G, Vespignani A (2014) The parable of google flu: traps in big data analysis. Science 343(6176):1203–1205
Lee K, Agrawal A, Choudhary A (2013) Real-time disease surveillance using twitter data: demonstration on flu and cancer. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, pp 1474–1477
Li J, Cardie C (2013) Early stage influenza detection from twitter. arXiv:1309.7340
Li M, Muldowney J (1995) Global stability for the seir model in epidemiology. Math Biosci 125(2):155–164
Matsubara Y, Sakurai Y, Prakash BA, Li L, Faloutsos C (2012) Rise and fall patterns of information diffusion: model and implications. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’12). pp 6–14
PAHO (2012). Epidemic disease database, pan american health organization. http://www.ais.paho.org/phip/viz/ed_flu.asp
Paul M, Dredze M (2011) You are what you tweet: analyzing twitter for public health. In: Fifth international AAAI conference on weblogs and social media (ICWSM 2011). pp 265–272
Paul M, Girju R (2010) A two-dimensional topic-aspect model for discovering multi-faceted topics. Urbana 51:61801
Romero DM, Meeder B, Kleinberg J (2011) Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter. In: Proceedings of the 20th international conference on world wide web (WWW ’11). ACM, New York. pp 695–704
Sadilek A, Kautz H, Silenzio V (2012) Predicting disease transmission from geo-tagged micro-blog data. In: AAAI conference on artificial intelligence
Spasojevic N, Yan J, Rao A, Bhattacharyya P (2014) Lasta: large scale topic assignment on multiple social networks. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’14). ACM, New York. pp 1809–1818
Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T (2004) Probabilistic author-topic models for information discovery. In: The 10th ACM SIGKDD international conference on knowledge discovery and data mining. pp 306–315
Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’06). pp 424–433
Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In: Proceedings of the fourth ACM international conference on web search and data mining. ACM. pp 177–186
Yang J, McAuley J, Leskovec J, LePendu P, Shah N (2014a) Finding progression stages in time-evolving event sequences. In: Proceedings of the 23rd international conference on world wide web (WWW ’14). pp 783–794
Yang S-H, Kolcz A, Schlaikjer A, Gupta P (2014b) Large-scale high-precision topic modeling on twitter. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’14). ACM, New York, pp 1907–1916
Zhao S, Zhong L, Wickramasuriya J, Vasudevan V (2011) Human as real-time sensors of social and physical events, A case study of twitter and sports games. arXiv:1106.4300
Acknowledgments
This material is based upon work supported by the National Science Foundation under Grant No. IIS-1353346, by the Maryland Procurement Office under Contract H98230-14-C-0127, by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) Contract Number D12PC000337, and by the VT College of Engineering. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the respective funding agencies.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Charu Aggarwal.
Appendix
Appendix
1.1 HFSTM-A-FIT
In this appendix, we show the equations we designed for HFSTM-A-FIT. Note that the outlines of the HFSTM-FIT algorithm is similar to HFSTM-A-FIT, one can derive equations for HFSTM-FIT from the content we show below.
Let K, T, N, and U be the number of states, number of tweets per user, number of words per tweet, and total number of users. Let \(O=<O_1,O_2,\ldots ,O_T>\) and \(S=<S_1,S_2,\ldots ,S_T>\) the observed sequences of tweets and hidden states respectively for a particular user.
Here is a list of symbols that we will use.
-
1.
\(\epsilon \): the prior for the binary state switching variable, which determines whether state of a tweet is drawn from the transition probability matrix or simply copied from the state of the previous tweet (a number in (0, 1])
-
2.
\(\pi \): initial state probability (size is \(1\times K\))
-
3.
\(\eta \): tansition probability matrix (size is \(K\times K\))
-
4.
\(\phi \): word distrtibution for each state (size is \(K\times W\), where W is the total number of keywords for all of the states)
-
5.
\(w_{tn}\): the nth word in the tth tweet
-
6.
\(\lambda \): the background switch variable
-
7.
c: the topic switch variable
-
8.
y: the observed aspect value
For HFSTM-A, as mentioned in Sect. 3.3, the value of \(\lambda \) is biased by the observed aspect value y. We use \(\lambda \) instead of \(\lambda _{y}\) in the following for brevity, but remember the \(\lambda \) value in the equations is actually calculated using:
We want to learn all the parameters given the tweet sequence. For compact notation we use \(H=(\epsilon ,\pi ,\eta ,\phi ,\lambda ,c)\). In HFSTM-A-FIT, we use forward backward procedure for which we define forward variable \(A_t(i)\) and backward variable \(B_t(i)\) as follows.
Let \(\gamma _t(i)\) be the probability of being in state \(S_i\) at for tth tweet given the observed tweet sequence O and other model parameters. For each user the size of \(\gamma \) is \(2K\times T\) (with the first K states as the states which are copies of the previous state, and the second K states which are derived after a transition). This probability can be expressed by the forward and backward probabilities.
We have two switch variables in the model: l, x. If \(l=1\), the word is generated either by states or topics, if \(l=0\) it’s generated by background. If \(x=0\), the word is generated by topics, if \(x=1\) it’s by states.
For \(l_i=1\), which means that \(w_i\) is generated by either state or topics.
For \(l_i=0\), \(w_i\) is generated by background.
For \(x_i=0\), \(w_i\) is generated by topics.
For \(x_i=1\), \(w_i\) is generated by states.
Forward variable: We now further expand the forward variable in more details. The Initialization is as follows:
For \(1\le i\le K\):
For \(K+1\le i\le 2K\): \(A_1(i)=0\)
Induction is as follows:
For \(1\le j\le K\):
For \(K+1\le j\le 2K\):
Backward variable: The initialization for backward variable is as follows:
For \(1\le i\le 2K\):
Induction is as follows:
For \(1\le i\le K\):
For \(K+1 \le i \le 2K\):
Define z as follows:
Let \(\xi _t(i,j)\) be the probability of being in state \(S_i\) at time t, and state \(S_j\) at time \(t+1\), given O and other model parameters.
To express \(\xi _t(i,j)\), we have the following definition.
For \(1\le i\le 2K\) and \(1\le j\le K\):
For \(1\le i\le K\) and \(K+1\le j\le 2K\):
For \(K+1\le i\le 2K\) and \(K+1\le j\le 2K\):
Correspondingly, we have the following \(\xi \) values according to the different i, j value range:
Estimation of parameters:
We use the following equations to estimate the parameter values in the M-step.
For estimating \(\epsilon \):
For estimating \(\pi \):
For estimating \(\eta \):
For estimating \(\lambda \):
For estimating c:
For estimating \(\phi \):
Rights and permissions
About this article
Cite this article
Chen, L., Tozammel Hossain, K.S.M., Butler, P. et al. Syndromic surveillance of Flu on Twitter using weakly supervised temporal topic models. Data Min Knowl Disc 30, 681–710 (2016). https://doi.org/10.1007/s10618-015-0434-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-015-0434-x