Abstract
We describe a novel method for analyzing topics extracted from Twitter by utilizing the dynamic wavelet fingerprint technique (DWFT). Topics are derived from 7 different tweet storms analyzed in the study by using a dynamic topic model. Using the time series of each topic, we run DWFT analyses to get a two-dimensional, time-scale, binary image. Gaussian mixture model clustering is used to identify individual objects, or storm cells, that are characteristic to specific local behaviors commonly occurring in topics. The DWFT time series transformation is volume agnostic, meaning we can compare tweet storms of different intensities. We find that we can identify behavior, localized in time, that is characteristic to how different topics propagate through Twitter. The use of dynamic topic models and the DWFT create the basis for future applications as a real-time Twitter analysis system for flagging fake news.
Similar content being viewed by others
Notes
Free Twitter API only allows user to pull tweets from past 7 days.
References
Abramson A (2018) Brett Kavanaugh confirmed to supreme court after fight that divided America. Time. URL http://time.com/5417538/bett-kavanaugh-confirmed-senate-supreme-court/
Afroz S, Brennan M, Greenstadt R (2012) Detecting hoaxes, frauds, and deception in writing style online. In: 2012 IEEE symposium on security and privacy, IEEE, pp 461–475
Ahmed H, Traore I, Saad S (2017) Detection of online fake news using n-gram analysis and machine learning techniques. In: International conference on intelligent, secure, and dependable systems in distributed and cloud environments, Springer, pp 127–138
Ailem M, Salah A, Nadif M (2017) Non-negative matrix factorization meets word embedding. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, ACM, pp 1081–1084
Allen S (2018) France vs. Australia 2018 world cup: Paul pogba’s late goal gives france a 2-1 win. The Washington Post. URL https://www.washingtonpost.com/news/soccer-insider/wp/2018/06/16/france-vs-australia-2018-world-cup/?utm_term=.68b0580d2f68
AlSumait L, Barbará D, Domeniconi C (2008) On-line LDA: adaptive topic models for mining text streams with applications to topic detection and tracking. In: 2008 eighth IEEE international conference on data mining, IEEE, pp 3–12
Barone M (2017) What identity politics hath wrought. National review. URL https://www.nationalreview.com/2017/08/charlottesville-white-supremacy-antifa-identity-politics/
Baziotis C, Pelekis N, Doulkeridis C (2017) DataStories at SemEval-2017 task 4: deep LSTM with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017), association for computational linguistics, Vancouver, Canada, pp 747–754. https://doi.org/10.18653/v1/S17-2126. URL https://www.aclweb.org/anthology/S17-2126
Berenson T, Abramson A (2019) Mueller finds not Trump-Russia conspiracy, but stops short of exonerating trump on obstruction: attorney general. Time. URL http://time.com/5557779/bill-barr-mueller-report-congress-conclusions/
Bertoncini CA (2010) Applications of pattern classification to time-domain signals. PhD dissertation, William and Mary, Department of Physics
Bertoncini CA, Hinders MK (2010) Fuzzy classification of roof fall predictors in microseismic monitoring. Measurement 43(10):1690–1701. https://doi.org/10.1016/j.measurement.2010.09.015. URL http://www.sciencedirect.com/science/article/pii/S0263224110002113
Bertoncini CA, Rudd K, Nousain B, Hinders M (2012) Wavelet fingerprinting of radio-frequency identification (RFID) tags. IEEE Trans Ind Electron 59(12):4843–4850. https://doi.org/10.1109/TIE.2011.2179276
Bessi A, Ferrara E (2016) Social bots distort the 2016 us presidential election online discussion. First Monday 21:11–7
Bingham J, Hinders M (2009) Lamb wave characterization of corrosion-thinning in aircraft stringers: experiment and three-dimensional simulation. J Acoust Soc Am 126(1):103–113. https://doi.org/10.1121/1.3132505
Bingham J, Hinders M, Friedman A (2009) Lamb wave detection of limpet mines on ship hulls. Ultrasonics 49(8):706–722. https://doi.org/10.1016/j.ultras.2009.05.009. URL http://www.sciencedirect.com/science/article/pii/S0041624X0900064X
Bird S, Loper E, Klein E (2009) Natural language processing with python. O’Reilly Media Inc, Sebastopol
Blei DM, Lafferty JD (2005) Correlated topic models. In: Proceedings of the 18th international conference on neural information processing systems, MIT Press, Cambridge, MA, USA, NIPS’05, pp 147–154. URL http://dl.acm.org/citation.cfm?id=2976248.2976267
Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning, ACM, pp 113–120
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Boatwright BC, Linvill DL, Warren PL (2018) Troll factories: the internet research agency and state-sponsored agenda building. Resource Centre on Media Freedom in Europe
Bourgonje P, Schneider JM, Rehm G (2017) From clickbait to fake news detection: an approach based on detecting the stance of headlines to articles. In: Proceedings of the 2017 EMNLP workshop: natural language processing meets journalism, pp 84–89
Bradshaw S, Howard P (2017) Troops, trolls and troublemakers: a global inventory of organized social media manipulation. University of Oxford, Oxford
Bradshaw S, Howard NP (2018) The global organization of social media disinformation campaigns. J Int Aff 71(1.5):23–32
Brüggermann D, Hermey Y, Orth C, Schneider D, Selzer S, Spanakis G (2016) Towards a topic discovery and tracking system with application to news items. In: International workshop on future and emerging trends in language technology, Springer, pp 183–197
Cassidy J (2019) More questions emerge about mueller’s punt on obstruction of justice. The New Yorker. URL https://www.newyorker.com/news/our-columnists/more-questions-emerge-about-muellers-punt-on-obstruction-of-justice
Cataldi M, Di Caro L, Schifanella C (2010) Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the tenth international workshop on multimedia data mining, ACM, p 4
Chen Y, Amiri H, Li Z, Chua TS (2013) Emerging topic detection for organizations from microblogs. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 43–52
Chen Y, Conroy NJ, Rubin VL (2015a) Misleading online content: recognizing clickbait as false news. In: Proceedings of the 2015 ACM on workshop on multimodal deception detection, ACM, pp 15–19
Chen Y, Zhang H, Wu J, Wang X, Liu R, Lin M (2015b) Modeling emerging, evolving and fading topics using dynamic soft orthogonal NMF with sparse representation. In: 2015 IEEE international conference on data mining, IEEE, pp 61–70
Chen Y, Zhang H, Liu R, Ye Z (2018) Soft orthogonal non-negative matrix factorization with sparse representation: static and dynamic. Neurocomputing 310:148–164
Chen Z, Subramanian D (2018) An unsupervised approach to detect spam campaigns that use Botnets on twitter. arXiv preprint arXiv:180405232
Cleary G (2019) Twitterbots: anatomy of a propaganda campaign. Symantec. URL https://www.symantec.com/blogs/threat-intelligence/twitterbots-propaganda-disinformation
Cohen L (1995) Time-frequency analysis, vol 778. Prentice hall, Upper Saddle River
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: International conference on computer vision and pattern recognition (CVPR’05), IEEE Computer Society, vol 1, pp 886–893
Dale R (2017) NLP in a post-truth world. Nat Lang Eng 23(2):319–324. https://doi.org/10.1017/S1351324917000018
Daubechies I (1992) Ten lectures on wavelets, vol 61. SIAM, Bangkok
Deisenroth M, Faisal AA, Ong CS (2019) Mathematics for machine learning. Cambridge University Press, Cambridge
Dieckman EA (2014) Use of pattern classification algorithms to interpret passive and active data streams from a walking-speed robotic sensor platform. PhD dissertation, William and Mary, Department of Applied Science
Espinoza I, Mendoza M, Ortega P, Rivera D, Weiss F (2018) Viscovery: trend tracking in opinion forums based on dynamic topic models. CoRR. arXiv:abs/1805.00457
Freelon D, McIlwain CD, Clark MD (2016) Beyond the hashtags. Technical report, Center for Media and Social Impact. URL https://cmsimpact.org/resource/beyond-hashtags-ferguson-blacklivesmatter-online-struggle-offline-justice/
Glasser SB (2019) Donald trump went to Vietnam, and Michael Cohen made it hell. The New Yorker. URL https://www.newyorker.com/news/letter-from-trumps-washington/donald-trump-went-to-vietnam-and-michael-cohen-made-it-hell
Goff S, Wallace A (2018) Mexico delivers a world cup earthquake with defeat of Germany, the defending champ. Washington Post. URL https://www.washingtonpost.com/news/soccer-insider/wp/2018/06/17/germany-vs-mexico-2018-world-cup/?utm_term=.8069e718d68d
Green E (2017) Why Charlottesville marchers were obsessed with jews. The Atlantic. URL https://www.theatlantic.com/politics/archive/2017/08/nazis-racism-charlottesville/536928/
Guille A, Hacid H, Favre C, Zighed DA (2013) Information diffusion in online social networks: a survey. SIGMOD Rec 42(2):17–28. https://doi.org/10.1145/2503792.2503797
Gunia A (2019) President trump dismesses Michael Cohen’s testimony as ‘95% lies’ during post-summit press conference. Time. URL http://time.com/5540817/donald-trump-reaction-michael-cohen-testimony/
Hamidian S, Diab MT (2016) Rumor identification and belief investigation on twitter. In: Proceedings of NAACL-HLT, pp 3–8
Hanson VD (2018) Kavanaugh casualties. National Review. URL https://www.nationalreview.com/2018/10/kavanaugh-confirmation-fight-casualties-left-never-trumpers-metoo/
He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. In: Proceedings of the 18th international conference on neural information processing systems, MIT Press, Cambridge, MA, USA, NIPS’05, pp 507–514. URL http://dl.acm.org/citation.cfm?id=2976248.2976312
Heim J (2017) Recounting a day of rage, hate, violence, and death. Washington Post. URL https://www.washingtonpost.com/graphics/2017/local/charlottesville-timeline/?utm_term=.d673b9006a90
Hinders M, Bingham J, Rudd K, Jones R, Leonard K (2006) Wavelet thumbprint analysis of time domain reflectometry signals for wiring flaw detection. In: Thompson DO, Chimenti DE (eds) Review of progress in quantitative nondestructive evaluation, vol 25, American Institute of Physics Conference Series, vol 820, pp 641–648. https://doi.org/10.1063/1.2184587
Hou J, Hinders MK (2002) Dynamic wavelet fingerprint identification of ultrasound signals. Mater Eval 60(9):1089–1093
Hou J, Leonard KR, Hinders MK (2004) Automatic multi-mode lamb wave arrival time extraction for improved tomographic reconstruction. Inverse Probl 20(6):1873–1888. https://doi.org/10.1088/0266-5611/20/6/012
Hou J, Rose ST, Hinders MK (2005) Ultrasonic periodontal probing based on the dynamic wavelet fingerprint. EURASIP J Adv Signal Process 2005:1137–1146. https://doi.org/10.1155/ASP.2005.1137
Jain A, Chandrasekaran B (1982) 39 dimensionality and sample size considerations in pattern recognition practice. In: Classification pattern recognition and reduction of dimensionality, handbook of statistics, vol 2, Elsevier, pp 835–855. 10.1016/S0169-7161(82)02042-2. URL http://www.sciencedirect.com/science/article/pii/S0169716182020422
Jin F, Dougherty E, Saraf P, Cao Y, Ramakrishnan N (2013) Epidemiological modeling of news and rumors on twitter. In: Proceedings of the 7th workshop on social network mining and analysis, ACM, p 8
Jin Z, Cao J, Zhang Y, Luo J (2016) News verification by exploiting conflicting social viewpoints in microblogs. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, AAAI Press, AAAI’16, pp 2972–2978. URL http://dl.acm.org/citation.cfm?id=3016100.3016318
Jin Z, Cao J, Zhang Y, Zhou J, Tian Q (2017) Novel visual and statistical image features for microblogs news verification. Trans Multi 19(3):598–608. https://doi.org/10.1109/TMM.2016.2617078
Jähnichen P, Wenzel F, Kloft M, Mandt S (2018) Scalable generalized dynamic topic models. In: Storkey A, Perez-Cruz F (eds) Proceedings of the twenty-first international conference on artificial intelligence and statistics, PMLR, Playa Blanca, Lanzarote, Canary Islands. Proceedings of machine learning research, vol 84, pp 1427–1435. URL http://proceedings.mlr.press/v84/jahnichen18a.html
Kwon S, Cha M, Jung K, Chen W, Wang Y (2013) Prominent features of rumor propagation in online social media. In: 2013 IEEE 13th international conference on data mining, pp 1103–1108. https://doi.org/10.1109/ICDM.2013.61
Kwon S, Cha M, Jung K (2017) Rumor detection over varying time windows. PLoS ONE 12(1):e0168344. https://doi.org/10.1371/journal.pone.0168344. URL http://europepmc.org/articles/PMC5230768
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’01, pp 282–289. URL http://dl.acm.org/citation.cfm?id=645530.655813
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Xing EP, Jebara T (eds) Proceedings of the 31st international conference on machine learning, PMLR, Bejing, China. Proceedings of machine learning research, vol 32, pp 1188–1196. URL http://proceedings.mlr.press/v32/le14.html
Leskovec J, Backstrom L, Kleinberg J (2009) Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’09, pp 497–506. https://doi.org/10.1145/1557019.1557077
Linvill DL, Boatwright BC, Grant WJ, Warren PL (2019) “The Russians are hacking my brain!” investigating Russia’s internet research agency twitter tactics during the 2016 united states presidential campaign. Comput Hum Behav 99:292–300. https://doi.org/10.1016/j.chb.2019.05.027
Littman J (2018a) Charlottesville 2018 tweet ids. Technical report, Harvard Dataverse. https://doi.org/10.7910/DVN/DVLJTO
Littman J (2018b) Winter Olympics 2018 tweet ids. Technical report, Harvard Dataverse. https://doi.org/10.7910/DVN/YMJPFC
Liu Y, Liu Z, Chua TS, Sun M (2015) Topical word embeddings. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence, AAAI Press, AAAI’15, pp 2418–2424. URL http://dl.acm.org/citation.cfm?id=2886521.2886657
Ma J, Gao W, Wei Z, Lu Y, Wong KF (2015) Detect rumors using time series of social context information on microblogging websites. In: Proceedings of the 24th ACM international on conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’15, pp 1751–1754. https://doi.org/10.1145/2806416.2806607
MacGillis A (2019) The tragedy of Baltimore. ProPublica. URL https://www.propublica.org/article/the-tragedy-of-Baltimore
Mathis-Lilley B (2015) Freddie gray died of single “high-energy” injury, leaked autopsy report says. Slate. URL https://www.propublica.org/article/the-tragedy-of-Baltimore
Matsubara Y, Sakurai Y, Prakash BA, Li L, Faloutsos C (2012) Rise and fall patterns of information diffusion: model and implications. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’12, pp 6–14. https://doi.org/10.1145/2339530.2339537
McCarthy A (2016) Freddie gray case: the war on cops... in the courtroom. Nat Rev. URL https://www.nationalreview.com/corner/war-cops-freddy-gray-another-acquittal-brian-rice/
Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, ACM, New York, NY, USA, SIGIR ’13, pp 889–892. https://doi.org/10.1145/2484028.2484166
Miller CA, Hinders MK (2014) Classification of flaw severity using pattern recognition for guided wave-based structural health monitoring. Ultrasonics 54(1):247–258. https://doi.org/10.1016/j.ultras.2013.04.020. URL http://www.sciencedirect.com/science/article/pii/S0041624X13001406
Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing, Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP ’11, pp 262–272. URL http://dl.acm.org/citation.cfm?id=2145432.2145462
Mueller III R (2019) Report on the investigation into Russian interference in the 2016 presidental election, vol I of II. United States Department of Justice
Niu L, Dai X, Zhang J, Chen J (2015) Topic2vec: learning distributed representations of topics. In: 2015 International conference on Asian language processing (IALP), pp 193–196. https://doi.org/10.1109/IALP.2015.7451564
Oliphant TE (2006) A guide to NumPy. Trelgol Publishing, New York
Panisson A, Gauvin L, Quaggiotto M, Cattuto C (2014) Mining concurrent topical activity in microblog streams. arXiv e-prints arXiv:1403.1403
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830. URL http://dl.acm.org/citation.cfm?id=1953048.2078195
Potthast M, Kiesel J, Reinartz K, Bevendorff J, Stein B (2018) A stylometric inquiry into hyperpartisan and fake news. In: Proceedings of the 56th annual meeting of the association for computational linguistics, vol 1: Long Papers, Association for Computational Linguistics, Melbourne, Australia, pp 231–240. URL https://www.aclweb.org/anthology/P18-1022
Pratt W (2013) Introduction to digital image processing. CRC Press, Boca Raton
Qazvinian V, Rosengren E, Radev DR, Mei Q (2011) Rumor has it: Identifying misinformation in microblogs. In: Proceedings of the conference on empirical methods in natural language processing, Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP ’11, pp 1589–1599. URL http://dl.acm.org/citation.cfm?id=2145432.2145602
Raleigh H (2019) Trump was right to walk away from kim jung-un’s bad deal. The Federalist. URL https://thefederalist.com/2019/03/04/trump-right-walk-away-bad-deal-still-hope-denuclearization/
Ross K (2019) The false dichotomy of the mueller report. Washington Examiner. URL https://www.washingtonexaminer.com/opinion/the-false-dichotomy-of-the-mueller-report
Rubin VL, Lukoianova T (2015) Truth and deception at the rhetorical structure level. J Assoc Inf Sci Technol 66(5):905–917. https://doi.org/10.1002/asi.23216
Saha A, Sindhwani V (2012) Learning evolving and emerging topics in social media: a dynamic nmf approach with temporal regularization. In: Proceedings of the fifth ACM international conference on web search and data mining, ACM, New York, NY, USA, WSDM ’12, pp 693–702. https://doi.org/10.1145/2124295.2124376
Shao C, Ciampaglia GL, Varol O, Flammini A, Menczer F (2017) The spread of fake news by social bots, pp 96–104. arXiv preprint arXiv:170707592
Shu K, Sliva A, Wang S, Tang J, Liu H (2017) Fake news detection on social media: a data mining perspective. SIGKDD Explor Newsl 19(1):22–36. https://doi.org/10.1145/3137597.3137600
Skinner E, Kirn S, Hinders M (2019) Development of underwater beacon for arctic through-ice communication via satellite. Cold Reg Sci Technol 160:58–79. https://doi.org/10.1016/j.coldregions.2019.01.010. URL http://www.sciencedirect.com/science/article/pii/S0165232X18302787
Sorkin AD (2018) Brett kavanaugh and the g.o.p.’s bargin with trump. The New Yorker. URL https://www.newyorker.com/magazine/2018/10/15/brett-kavanaugh-and-the-gops-bargain-with-trump
Toloşi L, Tagarev A, Georgeiv G (2016) An analysis of event-agnostic features for rumour classification in twitter. In: Tenth international AAAI conference on web and social media
Tweepy (2017) Streaming with tweepy–tweepy 3.5.0 documentation. URL http://tweepy.readthedocs.io/en/v3.5.0/streaming_how_to.html
van der Walt S, Schönberger JL, Nunez-Iglesias J, Boulogne F, Warner JD, Yager N, Gouillart E, Ta Yu (2014) scikit-image: image processing in python. PeerJ 2:e453. https://doi.org/10.7717/peerj.453
Wallace-Wells B (2016) Baltimore and the future of protest politics. The New Yorker. URL https://www.newyorker.com/news/benjamin-wallace-wells/baltimore-and-the-future-of-protest-politics
Wang WY (2017) “Liar, liar pants on fire”: a new benchmark dataset for fake news detection. In: Proceedings of the 55th annual meeting of the association for computational linguistics (vol 2: Short Papers), Association for Computational Linguistics, Vancouver, Canada, pp 422–426. https://doi.org/10.18653/v1/P17-2067. URL https://www.aclweb.org/anthology/P17-2067
Webb H, Burnap P, Procter R, Rana O, Stahl BC, Williams M, Housley W, Edwards A, Jirotka M (2016) Digital wildfires: propagation, verification, regulation, and responsible innovation. ACM Trans Inf Syst 34(3):15:1–15:23. https://doi.org/10.1145/2893478
Wit E, Heuvel Evd, Romeijn JW (2012) ‘All models are wrong...’: an introduction to model uncertainty. Stat Neerl 66(3):217–236. https://doi.org/10.1111/j.1467-9574.2012.00530.x
Wu K, Yang S, Zhu KQ (2015) False rumors detection on Sina Weibo by propagation structures. In: 2015 IEEE 31st international conference on data engineering, pp 651–662. https://doi.org/10.1109/ICDE.2015.7113322
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, ACM, New York, NY, USA, SIGIR ’03, pp 267–273. https://doi.org/10.1145/860435.860485
Xun G, Li Y, Gao J, Zhang A (2017) Collaboratively improving topic discovery and word embeddings by coordinating global and local contexts. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’17, pp 535–543. https://doi.org/10.1145/3097983.3098009
Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: Clough P, Foley C, Gurrin C, Jones GJF, Kraaij W, Lee H, Mudoch V (eds) Advances in information retrieval. Springer, Heidelberg, pp 338–349
Zhao Z, Resnick P, Mei Q (2015) Enquiring minds: early detection of rumors in social media from enquiry posts. In: Proceedings of the 24th international conference on world wide web, international world wide web conferences, Steering Committee, Republic and Canton of Geneva, Switzerland, WWW ’15, pp 1395–1405. https://doi.org/10.1145/2736277.2741637
Zhu P, Zuo W, Zhang L, Hu Q, Shiu SC (2015) Unsupervised feature selection by regularized self-representation. Pattern Recogn 48(2):438–446. https://doi.org/10.1016/j.patcog.2014.08.006
Zubiaga A, Liakata M, Procter R (2016a) Learning reporting dynamics during breaking news for rumour detection in social media. CoRR. arXiv:abs/1610.07363
Zubiaga A, Liakata M, Procter R, Wong Sak Hoi G, Tolmie P (2016b) Analysing how people orient to and spread rumours in social media by looking at conversational threads. PLoS ONE 11(3):1–29. https://doi.org/10.1371/journal.pone.0150989
Zubiaga A, Liakata M, Procter R (2017) Exploiting context for rumour detection in social media. In: Ciampaglia GL, Mashhadi A, Yasseri T (eds) Social informatics. Springer, Cham, pp 109–123
Acknowledgements
We would like to acknowledge Dr. William Fehlman for innumerable conversations about topic modeling and its many applications, particularly to Twitter data. This work was performed [in part] using computing facilities at the College of William and Mary which were provided by contributions from the National Science Foundation, the Commonwealth of Virginia Equipment Trust Fund and the Office of Naval Research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix: Data sets
1.1 Brett Kavanaugh confirmation hearings
The confirmation hearings of Supreme Court Justice Brett Kavanaugh was a highly partisan and contentious affair. Shortly after President Donald Trump tapped Kavanaugh as his choice to fill an open seat on the Supreme Court many people, particularly on the political left, were angry. Kavanaugh, known as conservative judge, was replacing Justice Anthony Kennedy who had been a swing vote for most of his 30 years on the court. During the confirmation hearings, an accusation of sexual assault from Kavanaugh’s time in high school emerged. The accusation from Dr. Christine Blasey Ford was impossible to verify as it occurred in the 1980s (Abramson 2018; Hanson 2018; Sorkin 2018). With the background of the #MeToo movement a Twitter firestorm emerged. Many on the left, most of whom would have already been against Kavanaugh, argued that he was unfit to be on the Supreme Court due to these allegations as well as his demeanor in the hearings. While, on the right, many argued that there was no concrete evidence proving the allegations so Kavanaugh should not be treated as if he were guilty.
Figure 16 shows the time series of tweets through the three days of Kavanaugh’s confirmation hearings in the Senate. Three key spikes occur in the time series, labeled A, B, and C. The first two occur on October 5th in quick succession. First the Senate passed the procedural vote to proceed with the confirmation process, this spike is labeled A. Three hours later Senator Susan Collins of Maine, seen as a swing vote in the confirmation process, announced that she would vote in favor of confirming Kavanaugh, spike labeled B. Finally on October 6th, the largest spike occurs, this is when the Senate officially confirmed Kavanaugh to fill Anthony Kennedy’s seat on the Supreme Court, labeled C.
It is important to note that this dataset filtered out all retweets. The data were collected using the Python Tweepy module by streaming tweets that included the terms ‘Kavanaugh’ and ‘Supreme Court’ (Tweepy 2017).
1.2 Freddie gray riots
On April 19, 2015 Freddie Gray died from spinal injuries while in Baltimore police custody. Gray suffered spinal cord injuries during a ride in a police van in which he was handcuffed but not buckled in, thus unable to brace himself when he fell during rough portions of the ride (Mathis-Lilley 2015; McCarthy 2016). There were questions about why the Baltimore police arrested him as well as his treatment while in police custody. The questionable arrest coupled with apparent mistreatment of Gray while in custody was enough to cause an uproar, but put in the context of other recent police killings of people of color such as Tamir Rice and Eric Garner caused riots in the streets of Baltimore and protests across the nation (MacGillis 2019; Wallace-Wells 2016). This reignited ongoing feuds between groups who either defended the victims as being wrongfully treated, or defended the police claiming they were trying to do their job. Much of this played out though protests, but they can also be seen playing out through Twitter. American University’s Center for Media and Social Impact (CMSI) did a comprehensive study on hashtag activism, in particular hashtag movements that centered around Black Lives Matter in 2014 and 2015 (Freelon et al. 2016). CMSI published their dataset, which is what we are using in this study. We specifically focus on the tweets included in the CMSI dataset from April 25, 2015 until April 30, 2015 with time steps of 5 min. We filtered through the dataset to ensure we were only including tweets relevant to Freddie Gray so we used the filter terms: Freddie, Gray, #Baltimore, #Baltimoreriots, #Baltimoreuprising, #freddiegray, and Baltimore. This leads to a total of about 811,000 tweets over the 5 day span, shown in Fig. 17.
1.3 Michael Cohen testimony and North Korea summit
On February 27, 2019 two major events in American politics coincided. Beginning in the late morning Michael Cohen, President Donald Trump’s former personal lawyer, testified to the House Oversight and Reform Committee about his work with Trump before and during the election. Much of this testimony included accusations about criminal and unethical activity by Donald Trump and his family. Later in the afternoon President Trump had a summit with Kim Jong Un, the Supreme Leader of North Korea. The summit ended with President Trump walking away from table and with no deal between the two nations and vastly different interpretations of the day’s events depending one’s political leanings (Glasser 2019; Gunia 2019; Raleigh 2019).
A dataset consisting of roughly 5 million tweets over the 24 h period both of these events took place, was streamed using the Python module Tweepy (2017) and filtered using the terms: Trump, Cohen, North Korea, Hanoi, and Kim. It is clear that the time series in Fig. 18 does not have the same rhythmic nature of Figs. 16 and 17. There are a few reasons for this. First, those datasets occur over multiple days; thus, there is the natural rhythm of night and day that affects tweet volume. Second, we added the term ‘Trump’ to the query items, which is too general and caused the streamer to be rate limited to roughly 15,000 tweets per 5 min. Twitter does not publish what the actual rate limit is, but the flat line right around 15,000 over most of the time series is a good indicator that we were capped at this volume.
1.4 Winter Olympics
The Winter Olympics were held in February of 2018 in Pyeongchang, South Korea. The games featured many stars in their respective sports, both established and new, that captivated the audience. Harvard’s dataverse has published a dataset of about 13 million tweets that contained one of the hashtags: #olympics, #pyeongchang2018, #winterolympics, and the Korean hashtag which translated to “Pyeongchang Winter Olympics” (Littman 2018b).
Figure 19 shows what the time series of tweet volume looks like over the 28 days time period. There is periodicity in the tweet volume, likely due to how the Olympics are aired on tape delay during prime time (8 p.m–11 p.m Eastern) every night in the US. The tweet volume begins to spike on February 9th, which coincides with the opening ceremonies. After that there are regular spikes every day until the closing ceremonies on February 25.
1.5 Charlottesville riots
On August 11 and 12 of 2017 a white supremacist group held their Unite the Right rally in Charlottesville, VA. The rally was met with thousands of protesters in the streets of Charlottesville culminating in the slaughter of a protester named Heather Heyer (Barone 2017; Green 2017; Heim 2017).
Harvard Dataverse published a set of Tweet IDs about this event (Littman 2018a). They searched Twitter for tweets containing the hashtags: #charlottesville, #standwithcharlottesville, #defendCville, #HeatherHeyer, or #UnityCville. All together they produced about 3.5 million tweets, shown as a time series in Fig. 20. The dataset spans the few days leading up to the rally until the day after. The three peaks seen in the data occur on the evening of August 11, labeled A, when the rally goers marched through the University of Virginia campus carrying tiki torches. The next peak, labeled B, corresponds to the march and ensuing riots throughout the day of August 12. Finally the last peak, labeled C, comes from residual fallout of the day and ongoing discussions on social media about the events that occurred.
1.6 Mueller report
After the 2016 U.S. presidential election the Department of Justice deputy attorney general Rod Rosenstein appointed Robert Mueller as the head of a Special Council to investigate any collusion the Trump campaign might have had with Russia to aid in his election. The Special Council also looked into the possibility that Donald Trump obstructed justice by attempting to derail the investigation (Berenson and Abramson 2019; Cassidy 2019; Ross 2019). After almost two years of work Mueller released his report to Attorney General William Barr on March 22, 2019 (Mueller III 2019). Given the great interest in the Mueller investigations over two years and the divisive nature of the Trump administration, this set off a major tweet storm. Many on the left were hoping this would lead an indictment of President Trump and cause the Democrat lead House of Representatives to begin impeachment proceedings, while many on the right were hoping to see President Trump exonerated from all charges.
Figure 21 shows the time series of the tweet volume over the weekend the Mueller Report was submitted.
On the afternoon of March 22nd, it was announced that Mueller had turned his report over to William Barr. The findings of the report were not made public until William Barr sent a letter to Congress on the afternoon of March 24th summarizing what Robert Mueller put in his report. The actual report Mueller submitted to Barr was not released to the public over the duration of the dataset. In Barr’s letter he said that Mueller had exonerated President Trump of any collusion with Russia, however there was not enough evidence to either indict nor exonerate President Trump on obstruction of justice. In Fig. 21 the data began streaming almost as soon as it was announced Mueller had submitted his report to Barr. After this there is the normal variation of tweet volume until Barr sent his letter to Congress, the sharp dip in data shortly after noon on March 23rd is because the tweet stream needed to be reset. There is a sharp spike and plateau around the evening of March 24th when Barr submitted his letter. The plateau is likely due to rate limits on Twitter’s API.
1.7 World Cup
Soccer is the most popular sport in the world and the World Cup brings all the greatest soccer players into one competition to play for national pride. We wanted to analyze the tweet volume of English language tweets about the first few days of the Group Stage of the 2018 World Cup. Within this time frame there were some memorable games, such as Mexico’s surprising victory over the defending World Cup champs Germany, and France’s late victory over the underdog Australia (Allen 2018; Goff and Wallace 2018). Figure 22 shows the time series tweet volume for the first few days of the World Cup. In total we were able to stream 15,936 tweets from the first few days of the World Cup. All tweets were gathered using Twitter’s API and Tweepy (2017). The low volume is likely due to the absence of the United States, who failed to qualify. By filtering only English tweets most of our data came from the United States where there was less interest in the competition. However, this lower volume will lead to an interesting comparison of tweet storms based on overall tweet volume.
DTM details
1.1 U,V initialization
The U and V matrices are initialized in the method used by Saha and Sindhwani (2012). To initialize the U(t) matrix,
where \(U(t-1)\) is the set of all non-faded topics from the previous time step, and \(U_\mathrm{emerge}\) is a \(M\times k_\mathrm{emerge}\) matrix with random, non-negative entries, where \(k_\mathrm{emerge}\) is a parameter set for the number of topics to add each time step. To initialize V(t),
where \(V_{11}\) represents old documents and old topics, \(V_{12}\) is new documents and old topics, \(V_{21}\) is old documents and new topics, and \(V_{22}\) is new documents and new topics. Both \(V_{12}\) and \(V_{22}\) are randomly initialized as we have no assumption about what topics the new documents will have. \(V_{11}\) will be initialized as \(V(t-1)\), and \(V_{21}\) is initialized to all 0s, because we have already derived the topic distribution for those documents. On the first time step the model is run both U and V are entirely randomly initialized. All random initializations are normal distributions of non-negative numbers with the mean at the average value of D(w) divided by the total number of topics. This method of random initialization is used in the NMF implementation in Pedregosa et al. (2011).
1.2 Topic streams
Topics are tracked through time using topic streams. There are two topic streams: the evolving topic stream and the faded topic stream. Each entry in the stream contains information about that specific topic such as: topic terms, weights, coherence, the number of tweets mentioning that topic, the time stamp the topic began, and the time stamp the topic faded. The number of tweets entry gives the raw number of tweets with a nonzero entry in the V matrix, which is what is used for time series representations of topics. Coherence is calculated using Eq. (26)
1.3 Removing fading topics
Before updating the model all fading topics need to be eliminated from U and V. A fading topic is defined to be a topic that is no longer representative of the documents inside D(w). A topic is no longer representative of D(w) when less than some predefined percentage of tweets mention that topic, in our case 0.5% of tweets in D(w). Once a topic is said to have faded its entry in the active topic stream is moved to the faded topic stream and the corresponding column in U and row in V are removed.
1.4 Checking emerging topics
Faded topics are saved so they can be compared to future emerging topics. A topic that has faded can become an evolving topic again if they are similar enough. This is referred to as a reemerging topic (Brüggermann et al. 2016). Cosine similarity is one method to measure the similarity between the two topics,
where \(v_\mathrm{faded}\) is the term-topic vector for the fading topic and \(v_\mathrm{emerging}\) is the term-topic vector for the emerging topic. However, the vocabulary in the model is updated in time and terms no longer in use are dropped to save memory. This means that topic vectors from one point in time cannot be compared to topic vectors at another point because the entries will correspond to different vocabulary terms. To combat this we save the top n terms from the topic vector each time step, n is usually 10. Topic similarity is then calculated by comparing the top terms of two different topics. If they have enough similar words—i.e., 8 out of the top 10 are the same—then they can be considered the same topic and the faded topic is reclassified as a reemerged topic.
Feature extraction
DWFT creates a black and white image with objects that look similar to human fingerprints. Humans are adept at finding patterns in images, so utilizing wavelet fingerprints of time series allows us to use our own acuity to identify where patterns are and what features might be important for identifying those patterns. Before feature extraction we need to identify each individual object in the fingerprint. Figure 3 shows the feature extraction process. Each shade of gray in Fig. 3 at the bottom left represents a different object. Identifying objects becomes important when extracting features for analysis. To do this we use 8-connectivity (Bertoncini 2010) which identifies groups of nonzero pixels touching each other at any point and gives them a common label. Due to the nature of fingerprints inner ridges do not always touch outer ridges, though they represent the same object they are labeled as two different objects. Thus, we check each object in a fingerprint to ensure it is not surrounded by another object. If it is then both objects are relabeled to be the same.
Feature extraction from wavelet fingerprints follows those in Bertoncini (2010) and Dieckman (2014). Let I(a, b) represent the binary image matrix for a wavelet fingerprint where a is the scale coordinate and b is the translation coordinate and let P be a \(2 \times N\) matrix that represents all nonzero pixels in I(a, b) with the row P(b, i) representing the b value for the ith entry and P(a, i) representing the a value of the ith entry. The first features extracted are the parameters of an ellipse that most closely matches the shape of the fingerprint. To calculate these we use the formula for central moments given by
where \((\bar{x}, \bar{y})\) is the center of the object and f(x, y) is the value of the pixel in the image. Since the image being analyzed is a binary image (31) can be simplified to
where (\(c_a, c_b\)) is the location of the centroid of the wavelet fingerprint as calculated by
Using the central moments of the fingerprint, the properties of the ellipse can be found using
where
\(x_\mathrm{maj}\) is the semimajor axis, \(x_\mathrm{min}\) is the semiminor axis, ecc is the eccentricity, and \(\theta \) is the orientation angle of the ellipse. After the ellipse is derived for the fingerprint degree 2 and 4 polynomials are calculated using the polyfit function in the Numpy library in Python (Oliphant 2006). To calculate the polynomial coefficients the outermost values of the wavelet fingerprint are found. If there are multiple outer b values for a single a value then the lowest value for a is used as the outer point for the polynomial fit. The image on the right side of Fig. 3 shows the ellipse (red), degree 2 (blue) and degree 4 (green) polynomials fit to a single object in the fingerprint shown at the bottom left of Fig. 3. The object is the object centered near \(b = 475\) in the fingerprint.
Features based on the area of the fingerprint are also calculated. First is the area of the fingerprint, A, which is simply the number of on pixels in the fingerprint image. Then the area of a bounding box, \(A_{BB}\), or the box that completely surrounds the wavelet fingerprint in the image space. These measures are used to calculate the ratio of the area of the fingerprint and the area of the bounding box, also known as the extent
Filled Area is the total number of nonzero pixels in I(a, b) if all the holes inside the fingerprint are redefined to be one. Convex image area \(A_\mathrm{C}\) is defined as the area of the smallest convex polygon that can contain the fingerprint, this is calculated using the skimage library in Python (van der Walt et al. 2014). Solidity is the ratio of the area of I(a, b) to the area of the convex image
A topological feature is added called Euler Number. The Euler Number is a measure of the difference in the number of holes and the number of items in an image (Pratt 2013). It is calculated as
where \(n\{Q_i\}\) represents the number of bit quads or \(2\times 2\) segments of the fingerprint I(a, b) that have i nonzero entries. \(Q_D\) is a special type of \(Q_2\) bit quads where there are nonzeros entries on either diagonal
Three more features are added onto the feature vector: \(c_a\), the length in time of the object, and the diameter of a circle with the same area as the fingerprint, calculated by
This gives a total feature vector of length 21 for each fingerprint.
Lastly we want to create a set of features describing the gradient of the object. To do this we use Histograms of Oriented Gradients (HOG) (Dalal and Triggs 2005). HOG calculates the gradient of an image at all points and then pools gradients into different bins to create a histogram describing the distribution of gradients over some window. For our application we are using HOG features to describe the shape of an object in a fingerprint I(a, b).
Usually the first step in calculating HOG features is to normalize the image, but since I(a, b) is a binary image, normalization will have no effect. For us, the first step will be to calculate the gradients. A kernel is used in image convolution to calculate the gradient at each point by defining two matrices \(g_x\) and \(g_y\), both of the same shape as I where
If in either case one of the indices goes out of bounds on I, then it is set to the value of the nearest pixel. Then the gradient and angle can be calculated at each point by
where, again, G and \(\theta \) are both of the same dimension as I. The final step is to create the histograms. In traditional HOG a window, w, is defined and a weighted histogram is defined for every \(w\times w\) window. However, this requires all images to have the same shape. This is too restrictive for our case. Either we would routinely cut off information from objects by enforcing a limit on T, or we would leave too much empty space in images leading to too much useless information. So we create one histogram for all pixels in I(a, b). There are two different methods for calculating gradients, signed and unsigned. Signed bins gradient vectors for all angles from 0 to \(2\pi \), while unsigned only goes from 0 to \(\pi \) and antiparallel gradient vectors are placed in the same bin—i.e., a gradient of \(\pi /2\) is binned with \(-\,\pi /2\). Due to the nature of binary images, in the unsigned case there are only four possible angles: 0, \(\pi /4\), \(\pi /2\), and \(3\pi /2\). Thus, we define a four-dimensional HOG feature vector \(\mathbf {h}\), for which there is one entry for each possible angle. For each instance of one of these angles, the magnitude of that angle in G(a, b) is added to the corresponding bin in \(\mathbf {h}\). All HOG vectors, \(\mathbf {h}\), were then normalized by the total time, T, of the given object to ensure all HOG feature values were weighted similarly. We then append \(\mathbf {h}\) to the full feature vector for I(a, b).
A feature vector of dimension m, where m is the total number of features, is derived for every object which had more then 200 nonzero pixels. The value 200 was selected because many small objects represent noise in the data and below about 200 pixels it was difficult to fit well defined polynomials to the objects. All feature vectors are then combined into the matrix \(F \in \mathbb {R}^{O \times m}\), where O is the total number of objects. Each vector, \(\mathbf {f}_o\) in F will be clustered to find the predominant types of objects created in tweet storms.
Rights and permissions
About this article
Cite this article
Kirn, S.L., Hinders, M.K. Dynamic wavelet fingerprint for differentiation of tweet storm types. Soc. Netw. Anal. Min. 10, 4 (2020). https://doi.org/10.1007/s13278-019-0617-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-019-0617-3