Abstract
Detecting topics from Twitter streams has become an important task as it is used in various fields including natural disaster warning, users opinion assessment, and traffic prediction. In this article, we outline different types of topic detection techniques and evaluate their performance. We categorize the topic detection techniques into five categories which are clustering, frequent pattern mining, Exemplar-based, matrix factorization, and probabilistic models. For clustering techniques, we discuss and evaluate nine different techniques which are sequential k-means, spherical k-means, Kernel k-means, scalable Kernel k-means, incremental batch k-means, DBSCAN, spectral clustering, document pivot clustering, and Bngram. Moreover, for matrix factorization techniques, we analyze five different techniques which are sequential Latent Semantic Indexing (LSI), stochastic LSI, Alternating Least Squares (ALS), Rank-one Downdate (R1D), and Column Subset Selection (CSS). Additionally, we evaluate several other techniques in the frequent pattern mining, Exemplar-based, and probabilistic model categories. Results on three Twitter datasets show that Soft Frequent Pattern Mining (SFM) and Bngram achieve the best term precision, while CSS achieves the best term recall and topic recall in most of the cases. Moreover, Exemplar-based topic detection obtains a good balance between the term recall and term precision, while achieving a good topic recall and running time.
Similar content being viewed by others
Notes
http://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/C/sk_means.htm. [last visited February 24, 2016].
Figure from J. Steinberger and K. Jeek. “Using latent semantic analysis in text summarization and summary evaluation.” In Proc. ISIM, 2004.
References
Aiello LM, Petkos G, Martin C, Corney D, Papadopoulos S, Skraba R, Jaimes A (2013) Sensing trending topics in twitter. IEEE Trans Multimed 15(6):1268–1282
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual acm-siam symposium on discrete algorithms, pp 1027–1035
Berry MW, Browne M, Langville AN, Pauca VP, Plemmons RJ (2007) Algorithms and applications for approximate nonnegative matrix factorization. Comput Stat Data Anal 52(1):155–173
Biggs M, Ghodsi A, Vavasis S (2008) Nonnegative matrix factorization via rank-one downdate. In: Proceedings of the 25th international conference on machine learning, pp 64–71
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Buza K, Nanopoulos A, Nagy G (2015) Nearest neighbor regression in the presence of bad hubs. Knowl-Based Syst 86:250–260
Chauhan A, Kummamuru K, Toshniwal D (2017) Prediction of places of visit using tweets. Knowl Inf Syst 50:145–166
Demmel J, Kahan W (1990) Accurate singular values of bidiagonal matrices. SIAM J Sci Stat Comput 11(5):873–912
Dhillon IS, Guan Y, Kogan J (2002) Refining clusters in high dimensional text data. In: Proceedings of the workshop on clustering high dimensional data and its applications at the second SIAM international conference on data mining, pp 71–82
Earle PS, Bowden DC, Guy M (2012) Twitter earthquake detection: earthquake monitoring in a social world. Ann Geophys 54(6):708–715
Elbagoury A, Ibrahim R, Farahat A, Kamel M, Karray F (2015) Exemplar-based topic detection in twitter streams. In: Ninth international AAAI conference on weblogs and social media
Elgohary A, Farahat AK, Kamel MS, Karray F (2014) Embed and conquer: scalable embeddings for kernel k-means on mapreduce. In: SDM, pp 425–433
Farahat AK, Ghodsi A, Kamel MS (2011) An efficient greedy method for unsupervised feature selection. In: 2011 IEEE 11th international conference on data mining (ICDM), pp 161–170
Frakes WB, Baeza-Yates R (1992) Introduction to data structures and algorithms related to information retrieval. In: Baeza-Yates R (ed) Information retrieval: data structures and algorithms. Pearson Education, Delhi, pp 13–27
Golub G, Kahan W (1965) Calculating the singular values and pseudoinverse of a matrix. SIAM J Numer Anal 2(2):205–224
Golub GH, Reinsch C (1970) Singular value decomposition and least squares solutions. Numerische mathematik 14(5):403–420
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235
Halko N, Martinsson P-G, Tropp JA (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev 53(2):217–288
Hernandez V, Roman J, Tomás A. (n.d.). Restarted lanczos bidiagonaliza- tion for the SVD in slepc (Tech. Rep.). Citeseer
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2–3):259–284
Lau JH, Grieser K, Newman D, Baldwin T (2011) Automatic labelling of topic models. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1, pp 1536–1545
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems, pp 556–562
Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5(8):716–727
Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1408.2041
Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp 889–892
Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv (CSUR) 41(2):10
Ng AY et al (2002) On spectral clustering: analysis and an algorithm. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems, vol 14. MIT Press. Cam-bridge, MA, pp 849–856
Oh O, Kwon KH, Rao HR (2010) An exploration of social media in extreme events: rumor theory and twitter during the haiti earthquake 2010. In: Icis p 231
Petrović S, Osborne M, Lavrenko V (2010) Streaming first story detection with application to twitter. In: Human language technologies: The 2010 annual conference of the north American chapter of the association for computational linguistics, pp 181–189
Ren F, Wu Y (2013) Predicting user-topic opinions in twitter with social and topical context. IEEE Trans Affect Comput 4(4):412–424
Schoefegger K, Tammet T, Granitzer M (2013) A survey on sociosemantic information retrieval. Comput Sci Rev 8:25–46
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Tekli J (2016) An overview on xml semantic disambiguation from unstructured text to semi-structured data: background, applications, and ongoing challenges. IEEE Trans Knowl Data Eng 28(6):1383–1407
Tumasjan A, Sprenger TO, Sandner PG, Welpe IM (2010) Predicting elections with twitter: what 140 characters reveal about political sentiment. ICWSM 10(1):178–185
Wang Y-X, Zhang Y-J (2013) Nonnegative matrix factorization: a comprehensive review. IEEE Trans Knowl Data Eng 25(6):1336–1353
Acknowledgements
This publication was made possible by a grant from the Qatar National Research Fund through National Priority Research Program (NPRP) No. 06-1220-1-233. Its contents are solely the responsibility of the authors.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Rania Ibrahim, Ahmed Elbagoury are co-first authors.
Rights and permissions
About this article
Cite this article
Ibrahim, R., Elbagoury, A., Kamel, M.S. et al. Tools and approaches for topic detection from Twitter streams: survey. Knowl Inf Syst 54, 511–539 (2018). https://doi.org/10.1007/s10115-017-1081-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1081-x