Skip to main content
Log in

Tools and approaches for topic detection from Twitter streams: survey

  • Survey Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Detecting topics from Twitter streams has become an important task as it is used in various fields including natural disaster warning, users opinion assessment, and traffic prediction. In this article, we outline different types of topic detection techniques and evaluate their performance. We categorize the topic detection techniques into five categories which are clustering, frequent pattern mining, Exemplar-based, matrix factorization, and probabilistic models. For clustering techniques, we discuss and evaluate nine different techniques which are sequential k-means, spherical k-means, Kernel k-means, scalable Kernel k-means, incremental batch k-means, DBSCAN, spectral clustering, document pivot clustering, and Bngram. Moreover, for matrix factorization techniques, we analyze five different techniques which are sequential Latent Semantic Indexing (LSI), stochastic LSI, Alternating Least Squares (ALS), Rank-one Downdate (R1D), and Column Subset Selection (CSS). Additionally, we evaluate several other techniques in the frequent pattern mining, Exemplar-based, and probabilistic model categories. Results on three Twitter datasets show that Soft Frequent Pattern Mining (SFM) and Bngram achieve the best term precision, while CSS achieves the best term recall and topic recall in most of the cases. Moreover, Exemplar-based topic detection obtains a good balance between the term recall and term precision, while achieving a good topic recall and running time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. http://www.statisticbrain.com/twitter-statistics/.

  2. http://sherpablog.marketingsherpa.com/social-networking-evangelism-community/twitter-data-mining/.

  3. http://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/C/sk_means.htm. [last visited February 24, 2016].

  4. Figure from J. Steinberger and K. Jeek. “Using latent semantic analysis in text summarization and summary evaluation.” In Proc. ISIM, 2004.

  5. http://scgroup20.ceid.upatras.gr:8000/tmg/.

  6. http://mathworks.com/matlabcentral/fileexchange/26182-kernel-k-means/content/knkmeans.m.

  7. http://alumni.cs.ucsb.edu/~wychen/sc.html.

  8. http://www.csse.uwa.edu.au/~pk/research/matlabfns/Misc/dbscan.m.

  9. http://www.socialsensor.eu/results/software/87-topic-detection-framework.

References

  1. Aiello LM, Petkos G, Martin C, Corney D, Papadopoulos S, Skraba R, Jaimes A (2013) Sensing trending topics in twitter. IEEE Trans Multimed 15(6):1268–1282

    Article  Google Scholar 

  2. Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual acm-siam symposium on discrete algorithms, pp 1027–1035

  3. Berry MW, Browne M, Langville AN, Pauca VP, Plemmons RJ (2007) Algorithms and applications for approximate nonnegative matrix factorization. Comput Stat Data Anal 52(1):155–173

    Article  MathSciNet  MATH  Google Scholar 

  4. Biggs M, Ghodsi A, Vavasis S (2008) Nonnegative matrix factorization via rank-one downdate. In: Proceedings of the 25th international conference on machine learning, pp 64–71

  5. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  6. Buza K, Nanopoulos A, Nagy G (2015) Nearest neighbor regression in the presence of bad hubs. Knowl-Based Syst 86:250–260

    Article  Google Scholar 

  7. Chauhan A, Kummamuru K, Toshniwal D (2017) Prediction of places of visit using tweets. Knowl Inf Syst 50:145–166

    Article  Google Scholar 

  8. Demmel J, Kahan W (1990) Accurate singular values of bidiagonal matrices. SIAM J Sci Stat Comput 11(5):873–912

    Article  MathSciNet  MATH  Google Scholar 

  9. Dhillon IS, Guan Y, Kogan J (2002) Refining clusters in high dimensional text data. In: Proceedings of the workshop on clustering high dimensional data and its applications at the second SIAM international conference on data mining, pp 71–82

  10. Earle PS, Bowden DC, Guy M (2012) Twitter earthquake detection: earthquake monitoring in a social world. Ann Geophys 54(6):708–715

    Google Scholar 

  11. Elbagoury A, Ibrahim R, Farahat A, Kamel M, Karray F (2015) Exemplar-based topic detection in twitter streams. In: Ninth international AAAI conference on weblogs and social media

  12. Elgohary A, Farahat AK, Kamel MS, Karray F (2014) Embed and conquer: scalable embeddings for kernel k-means on mapreduce. In: SDM, pp 425–433

  13. Farahat AK, Ghodsi A, Kamel MS (2011) An efficient greedy method for unsupervised feature selection. In: 2011 IEEE 11th international conference on data mining (ICDM), pp 161–170

  14. Frakes WB, Baeza-Yates R (1992) Introduction to data structures and algorithms related to information retrieval. In: Baeza-Yates R (ed) Information retrieval: data structures and algorithms. Pearson Education, Delhi, pp 13–27

    Google Scholar 

  15. Golub G, Kahan W (1965) Calculating the singular values and pseudoinverse of a matrix. SIAM J Numer Anal 2(2):205–224

    MATH  Google Scholar 

  16. Golub GH, Reinsch C (1970) Singular value decomposition and least squares solutions. Numerische mathematik 14(5):403–420

    Article  MathSciNet  MATH  Google Scholar 

  17. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235

    Article  Google Scholar 

  18. Halko N, Martinsson P-G, Tropp JA (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev 53(2):217–288

    Article  MathSciNet  MATH  Google Scholar 

  19. Hernandez V, Roman J, Tomás A. (n.d.). Restarted lanczos bidiagonaliza- tion for the SVD in slepc (Tech. Rep.). Citeseer

  20. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323

    Article  Google Scholar 

  21. Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2–3):259–284

    Article  Google Scholar 

  22. Lau JH, Grieser K, Newman D, Baldwin T (2011) Automatic labelling of topic models. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1, pp 1536–1545

  23. Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems, pp 556–562

  24. Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5(8):716–727

    Article  Google Scholar 

  25. Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1408.2041

  26. Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp 889–892

  27. Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv (CSUR) 41(2):10

    Article  Google Scholar 

  28. Ng AY et al (2002) On spectral clustering: analysis and an algorithm. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems, vol 14. MIT Press. Cam-bridge, MA, pp 849–856

    Google Scholar 

  29. Oh O, Kwon KH, Rao HR (2010) An exploration of social media in extreme events: rumor theory and twitter during the haiti earthquake 2010. In: Icis p 231

  30. Petrović S, Osborne M, Lavrenko V (2010) Streaming first story detection with application to twitter. In: Human language technologies: The 2010 annual conference of the north American chapter of the association for computational linguistics, pp 181–189

  31. Ren F, Wu Y (2013) Predicting user-topic opinions in twitter with social and topical context. IEEE Trans Affect Comput 4(4):412–424

    Article  Google Scholar 

  32. Schoefegger K, Tammet T, Granitzer M (2013) A survey on sociosemantic information retrieval. Comput Sci Rev 8:25–46

    Article  Google Scholar 

  33. Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  34. Tekli J (2016) An overview on xml semantic disambiguation from unstructured text to semi-structured data: background, applications, and ongoing challenges. IEEE Trans Knowl Data Eng 28(6):1383–1407

    Article  Google Scholar 

  35. Tumasjan A, Sprenger TO, Sandner PG, Welpe IM (2010) Predicting elections with twitter: what 140 characters reveal about political sentiment. ICWSM 10(1):178–185

    Google Scholar 

  36. Wang Y-X, Zhang Y-J (2013) Nonnegative matrix factorization: a comprehensive review. IEEE Trans Knowl Data Eng 25(6):1336–1353

    Article  Google Scholar 

Download references

Acknowledgements

This publication was made possible by a grant from the Qatar National Research Fund through National Priority Research Program (NPRP) No. 06-1220-1-233. Its contents are solely the responsibility of the authors.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Rania Ibrahim or Ahmed Elbagoury.

Additional information

Rania Ibrahim, Ahmed Elbagoury are co-first authors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ibrahim, R., Elbagoury, A., Kamel, M.S. et al. Tools and approaches for topic detection from Twitter streams: survey. Knowl Inf Syst 54, 511–539 (2018). https://doi.org/10.1007/s10115-017-1081-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1081-x

Keywords

Navigation