Tools and approaches for topic detection from Twitter streams: survey

Ibrahim, Rania; Elbagoury, Ahmed; Kamel, Mohamed S.; Karray, Fakhri

doi:10.1007/s10115-017-1081-x

Tools and approaches for topic detection from Twitter streams: survey

Survey Paper
Published: 21 July 2017

Volume 54, pages 511–539, (2018)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Rania Ibrahim ORCID: orcid.org/0000-0001-5663-8714¹,
Ahmed Elbagoury¹,
Mohamed S. Kamel¹ &
…
Fakhri Karray¹

2023 Accesses
37 Citations
Explore all metrics

Abstract

Detecting topics from Twitter streams has become an important task as it is used in various fields including natural disaster warning, users opinion assessment, and traffic prediction. In this article, we outline different types of topic detection techniques and evaluate their performance. We categorize the topic detection techniques into five categories which are clustering, frequent pattern mining, Exemplar-based, matrix factorization, and probabilistic models. For clustering techniques, we discuss and evaluate nine different techniques which are sequential k-means, spherical k-means, Kernel k-means, scalable Kernel k-means, incremental batch k-means, DBSCAN, spectral clustering, document pivot clustering, and Bngram. Moreover, for matrix factorization techniques, we analyze five different techniques which are sequential Latent Semantic Indexing (LSI), stochastic LSI, Alternating Least Squares (ALS), Rank-one Downdate (R1D), and Column Subset Selection (CSS). Additionally, we evaluate several other techniques in the frequent pattern mining, Exemplar-based, and probabilistic model categories. Results on three Twitter datasets show that Soft Frequent Pattern Mining (SFM) and Bngram achieve the best term precision, while CSS achieves the best term recall and topic recall in most of the cases. Moreover, Exemplar-based topic detection obtains a good balance between the term recall and term precision, while achieving a good topic recall and running time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Intelligent Twitter Data Analysis Based on Nonnegative Matrix Factorizations

Entity Tracking in Real-Time Using Sub-topic Detection on Twitter

Correlation Between K-means Clustering and Topic Modeling Methods on Twitter Datasets

Notes

http://www.statisticbrain.com/twitter-statistics/.
http://sherpablog.marketingsherpa.com/social-networking-evangelism-community/twitter-data-mining/.
http://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/C/sk_means.htm. [last visited February 24, 2016].
Figure from J. Steinberger and K. Jeek. “Using latent semantic analysis in text summarization and summary evaluation.” In Proc. ISIM, 2004.
http://scgroup20.ceid.upatras.gr:8000/tmg/.
http://mathworks.com/matlabcentral/fileexchange/26182-kernel-k-means/content/knkmeans.m.
http://alumni.cs.ucsb.edu/~wychen/sc.html.
http://www.csse.uwa.edu.au/~pk/research/matlabfns/Misc/dbscan.m.
http://www.socialsensor.eu/results/software/87-topic-detection-framework.

References

Aiello LM, Petkos G, Martin C, Corney D, Papadopoulos S, Skraba R, Jaimes A (2013) Sensing trending topics in twitter. IEEE Trans Multimed 15(6):1268–1282
Article Google Scholar
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual acm-siam symposium on discrete algorithms, pp 1027–1035
Berry MW, Browne M, Langville AN, Pauca VP, Plemmons RJ (2007) Algorithms and applications for approximate nonnegative matrix factorization. Comput Stat Data Anal 52(1):155–173
Article MathSciNet MATH Google Scholar
Biggs M, Ghodsi A, Vavasis S (2008) Nonnegative matrix factorization via rank-one downdate. In: Proceedings of the 25th international conference on machine learning, pp 64–71
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Buza K, Nanopoulos A, Nagy G (2015) Nearest neighbor regression in the presence of bad hubs. Knowl-Based Syst 86:250–260
Article Google Scholar
Chauhan A, Kummamuru K, Toshniwal D (2017) Prediction of places of visit using tweets. Knowl Inf Syst 50:145–166
Article Google Scholar
Demmel J, Kahan W (1990) Accurate singular values of bidiagonal matrices. SIAM J Sci Stat Comput 11(5):873–912
Article MathSciNet MATH Google Scholar
Dhillon IS, Guan Y, Kogan J (2002) Refining clusters in high dimensional text data. In: Proceedings of the workshop on clustering high dimensional data and its applications at the second SIAM international conference on data mining, pp 71–82
Earle PS, Bowden DC, Guy M (2012) Twitter earthquake detection: earthquake monitoring in a social world. Ann Geophys 54(6):708–715
Google Scholar
Elbagoury A, Ibrahim R, Farahat A, Kamel M, Karray F (2015) Exemplar-based topic detection in twitter streams. In: Ninth international AAAI conference on weblogs and social media
Elgohary A, Farahat AK, Kamel MS, Karray F (2014) Embed and conquer: scalable embeddings for kernel k-means on mapreduce. In: SDM, pp 425–433
Farahat AK, Ghodsi A, Kamel MS (2011) An efficient greedy method for unsupervised feature selection. In: 2011 IEEE 11th international conference on data mining (ICDM), pp 161–170
Frakes WB, Baeza-Yates R (1992) Introduction to data structures and algorithms related to information retrieval. In: Baeza-Yates R (ed) Information retrieval: data structures and algorithms. Pearson Education, Delhi, pp 13–27
Google Scholar
Golub G, Kahan W (1965) Calculating the singular values and pseudoinverse of a matrix. SIAM J Numer Anal 2(2):205–224
MATH Google Scholar
Golub GH, Reinsch C (1970) Singular value decomposition and least squares solutions. Numerische mathematik 14(5):403–420
Article MathSciNet MATH Google Scholar
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235
Article Google Scholar
Halko N, Martinsson P-G, Tropp JA (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev 53(2):217–288
Article MathSciNet MATH Google Scholar
Hernandez V, Roman J, Tomás A. (n.d.). Restarted lanczos bidiagonaliza- tion for the SVD in slepc (Tech. Rep.). Citeseer
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Article Google Scholar
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2–3):259–284
Article Google Scholar
Lau JH, Grieser K, Newman D, Baldwin T (2011) Automatic labelling of topic models. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1, pp 1536–1545
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems, pp 556–562
Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5(8):716–727
Article Google Scholar
Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1408.2041
Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp 889–892
Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv (CSUR) 41(2):10
Article Google Scholar
Ng AY et al (2002) On spectral clustering: analysis and an algorithm. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems, vol 14. MIT Press. Cam-bridge, MA, pp 849–856
Google Scholar
Oh O, Kwon KH, Rao HR (2010) An exploration of social media in extreme events: rumor theory and twitter during the haiti earthquake 2010. In: Icis p 231
Petrović S, Osborne M, Lavrenko V (2010) Streaming first story detection with application to twitter. In: Human language technologies: The 2010 annual conference of the north American chapter of the association for computational linguistics, pp 181–189
Ren F, Wu Y (2013) Predicting user-topic opinions in twitter with social and topical context. IEEE Trans Affect Comput 4(4):412–424
Article Google Scholar
Schoefegger K, Tammet T, Granitzer M (2013) A survey on sociosemantic information retrieval. Comput Sci Rev 8:25–46
Article Google Scholar
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Book MATH Google Scholar
Tekli J (2016) An overview on xml semantic disambiguation from unstructured text to semi-structured data: background, applications, and ongoing challenges. IEEE Trans Knowl Data Eng 28(6):1383–1407
Article Google Scholar
Tumasjan A, Sprenger TO, Sandner PG, Welpe IM (2010) Predicting elections with twitter: what 140 characters reveal about political sentiment. ICWSM 10(1):178–185
Google Scholar
Wang Y-X, Zhang Y-J (2013) Nonnegative matrix factorization: a comprehensive review. IEEE Trans Knowl Data Eng 25(6):1336–1353
Article Google Scholar

Download references

Acknowledgements

This publication was made possible by a grant from the Qatar National Research Fund through National Priority Research Program (NPRP) No. 06-1220-1-233. Its contents are solely the responsibility of the authors.

Author information

Authors and Affiliations

University of Waterloo, Waterloo, ON, N2L 3G1, Canada
Rania Ibrahim, Ahmed Elbagoury, Mohamed S. Kamel & Fakhri Karray

Authors

Rania Ibrahim
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Elbagoury
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed S. Kamel
View author publications
You can also search for this author in PubMed Google Scholar
Fakhri Karray
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Rania Ibrahim or Ahmed Elbagoury.

Additional information

Rania Ibrahim, Ahmed Elbagoury are co-first authors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ibrahim, R., Elbagoury, A., Kamel, M.S. et al. Tools and approaches for topic detection from Twitter streams: survey. Knowl Inf Syst 54, 511–539 (2018). https://doi.org/10.1007/s10115-017-1081-x

Download citation

Received: 25 October 2016
Revised: 29 May 2017
Accepted: 30 June 2017
Published: 21 July 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s10115-017-1081-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tools and approaches for topic detection from Twitter streams: survey

Abstract

Access this article

Similar content being viewed by others

Intelligent Twitter Data Analysis Based on Nonnegative Matrix Factorizations

Entity Tracking in Real-Time Using Sub-topic Detection on Twitter

Correlation Between K-means Clustering and Topic Modeling Methods on Twitter Datasets

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Tools and approaches for topic detection from Twitter streams: survey

Abstract

Access this article

Similar content being viewed by others

Intelligent Twitter Data Analysis Based on Nonnegative Matrix Factorizations

Entity Tracking in Real-Time Using Sub-topic Detection on Twitter

Correlation Between K-means Clustering and Topic Modeling Methods on Twitter Datasets

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation