Abstract
Topic models are regularly used to provide directed exploration and a high-level overview of a corpus of unstructured text. In many cases, it is important to analyze the evolution of topics over a time range. In this work, we present an application of statistical topic modeling and alignment (binned topic models) to group related documents into automatically generated topics and align the topics across a time range. Additionally, we present TopicFlow , an interactive tool to visualize the evolution of these topics. The tool was developed using an iterative design process based on feedback from expert reviewers. We demonstrate the utility of the tool with a detailed analysis of a corpus of data collected over the period of an academic conference, and demonstrate the effectiveness of this visualization for reasoning about large data by a usability study with 18 participants.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
This work is an extension of our prior work [13], in which we originally introduced TopicFlow as a Twitter analysis tool. A video demonstrating this work can be found here: https://www.youtube.com/watch?v=qqIlvMOQaOE&feature=youtu.be
- 2.
For TopicFlow, the number of topics is adjustable with a default of 15 to balance granularity and comprehensibility of the resulting topics.
- 3.
For this implementation the LDA algorithm runs for 100 iterations with \(\alpha =0.5\) and \(\beta =0.5\).
- 4.
For example, Twitter-specific stop words include {rt, retweet, etc.} and Spanish stop words include {el, la, tu, etc.}.
- 5.
\(cos(A, B) = \frac{A\cdot B}{\left\| A \right\| \left\| B \right\| }\).
- 6.
For prototyping and evaluation purposes, the threshold was set between 0.15 and 0.25 depending on the dataset.
- 7.
A prototype of the TopicFlow tool is available for demo here: http://www.cs.umd.edu/~maliks/topicflow/TopicFlow.html.
- 8.
Twitter’s open API and the fact that tweets are rich with metadata, specifically time stamps, makes it an appropriate data source for prototyping and testing.
References
Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of 23rd international conference on machine learning. ACM Press, New York, pp 113–120
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Bostock M (2012) Data driven documents (d3). http://d3js.org
Cui W, Liu S, Tan L, Shi C, Song Y, Gao Z, Qu H, Tong X (2011) TextFlow: towards better understanding of evolving topics in text. IEEE Trans Vis Comput Graph 17(12):2412–2421
Hart S, Staveland L (1988) Development of NASA-TLX (Task Load Index): results of empirical and theoretical research. Hum Mental Workload 1:139–183
Havre S, Hetzler B, Nowell L (2000) ThemeRiver: visualizing theme changes over time. In: Proceedings of IEEE symposium on information visualization, pp 115–123
Hu Y, Boyd-Graber J, Satinoff B, Smith A (2013) Interactive topic modeling. Mach Learn J 95:423–469
Kleinberg J (2003) Bursty and hierarchical structure in streams. Data Min Knowl Discov 7:373–397 (2003)
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:49–86
Leskovec J, Backstrom L, Kleinberg J (2009) Meme-tracking and the dynamics of the news cycle. In: Proceedings of 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 497–506
Lin J (1991) Divergence measures based on the shannon entropy. IEEE Trans Inf Theory 37(1):145–151
Liu Y, Niculescu-Mizil A, Gryc W (2009) Topic-link LDA: joint models of topic and author community. In: Proceedings of 26th annual international conference on machine learning. ACM Press, New York, pp 665–672
Malik S, Smith A, Hawes T, Dunne C, Papadatos P, Li J, Shneiderman B (2013) Topicflow: visualizing topic alignment of twitter data over time. In: The 2013 IEEE/ACM international conference on advances in social networks analysis and mining
Mimno D, McCallum A (2007) Organizing the OCA: learning faceted subjects from a library of digital books. In: Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries. ACM Press, New York, pp 376–385
Nikulin M (2001) Hazewinkel, Michiel, encyclopaedia of mathematics : an updated and annotated translation of the Soviet. Mathematical encyclopaedia. Reidel Sold and distributed in the U.S.A. and Canada. Kluwer Academic, Boston
O’Brien WL (2012) Preliminary investigation of the use of Sankey diagrams to enhance building performance simulation-supported design. In: Proceedings of 2012 symposium on simulation for architecture and urban design. Society for Computer Simulation International, San Diego, pp 15:1–15:8
Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of 2009 conference on empirical methods in natural language processing, vol 1. Association for Computational Linguistics, New York, pp 248–256
Shuyo N (2011) LDA implementation. https://github.com/shuyo/iir/blob/master/lda/lda.py
Sopan A, Rey P, Butler B, Shneiderman B (2012) Monitoring academic conferences: real-time visualization and retrospective analysis of backchannel conversations. In: ASE international conference on social informatics, pp 63–69
Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison Wesley, New York
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101:1566–1581
Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 424–433
Wilbur WJ, Sirotkin K (1992) The automatic identification of stop words. J Inf Sci 18(1):45–55
Zhai K, Boyd-Graber J, Asadi N, Alkhouja M (2012) Mr. LDA: a flexible large scale topic modeling package using variational inference in mapreduce. In: ACM international conference on world wide web
Acknowledgments
We would like to thank Timothy Hawes, Cody Dunne, Marc Smith, Jimmy Lin, Jordan Boyd-Graber, Catherine Plaisant, Peter David, and Jim Nolan for their input throughout the design and implementation of this project and thoughtful reviews of this paper. Additionally, we would like to thank Jianyu (Leo) Li and Panagis Papadatos for their assistance in designing, developing, and evaluating the initial version of the tool.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Smith, A., Malik, S., Shneiderman, B. (2015). Visual Analysis of Topical Evolution in Unstructured Text: Design and Evaluation of TopicFlow. In: Kazienko, P., Chawla, N. (eds) Applications of Social Media and Social Network Analysis. Lecture Notes in Social Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-19003-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-19003-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19002-0
Online ISBN: 978-3-319-19003-7
eBook Packages: Computer ScienceComputer Science (R0)