Skip to main content
Log in

Design and analysis of microblog-based summarization system

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

A daily summary or digest from microblogs allows social media users to stay up to date on what happened today on their favorite topic. Summarizing microblogs is a non-trivial task. This paper presents a summarization system built over the Twitter stream to summarize the topic for a given duration. Tweet ranking is the primary task of designing a microblog-based summarization system. After ranking tweets, the selection of relevant tweets is the crucial task for any summarization system due to the massive volume of tweets in the Twitter stream. In addition, the summarization system should include novel tweets in the summary or digest. The measure of relevance is typically the similarity score obtained from different text similarity algorithms. These measure the similarity between user information needs and each tweet. The more similar, the higher the score. So we need to choose a threshold that can minimize false-positive judgments for this task. In this paper, we proposed novel threshold estimation methods to find optimal values for these thresholds and evaluate them against thresholds determined via grid search. These methods estimate the thresholds with reasonable accuracy, according to the results. Previous research has empirically and heuristically set these thresholds, and our work suggests a method that exploits statistical features of the ranking list to estimate these thresholds. We used language models to rank the tweets and to select relevant tweets. For any language model, the selection of the smoothing technique and its parameters are critical. The results are also compared with the standard probabilistic ranking algorithm BM25. Learning to rank strategies is also implemented, which shows substantial improvement in some of the result metrics. Experiments were performed on standard benchmarks like the TREC Microblog 2015, TREC RTS 2016, and TREC RTS 2017 datasets. Different variants of normal discounted cumulative gain, the standard official evaluation metric of TREC, nDCG-1, nDCG-0, and nDCG-p are used in this study. We also performed a comprehensive failure analysis on our experiments and identified key issues for improvement that can be addressed in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://www.oberlo.in/blog/twitter-statistics/.

  2. https://backlinko.com/twitter-users.

  3. https://www.omnicoreagency.com/twitter-statistics/.

  4. https://nlp.stanford.edu/ner/.

  5. https://lucene.apache.org/.

  6. https://variety.com/2020/digital/news/chadwick-boseman-twitter-most-retweeted-most-liked-1234847935/.

  7. https://github.com/sjmodha/Real-Time-summarization.

  8. https://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html.

References

  • Bagdouri M, Oard DW (2015) Clip at TREC 2015: microblog and liveqa. In: Proceedings of the 24th text retrieval conference (TREC)

  • Chakrabarti D, Punera K (2011) Event summarization using tweets. In: Proceedings of the international AAAI conference on web and social media. https://ojs.aaai.org/index.php/ICWSM/article/view/14138

  • Chellal A, Boughanem M (2018) Optimization framework model for retrospective tweet summarization. In: Proceedings of the 33rd annual ACM symposium on applied computing. ACM, pp 704–711. https://doi.org/10.1145/3167132.3167210

  • Chen SF, Goodman J (1999) An empirical study of smoothing techniques for language modeling. Comput Speech Lang 13(4):359–394

    Article  Google Scholar 

  • Gonçalves G, Martins F, Magalhães J (2018) Analysis of subtopic discovery algorithms for real-time information summarization. In: Companion of the the web conference 2018 on the web conference 2018. International World Wide Web Conferences Steering Committee, pp 1855–1856. https://doi.org/10.1145/3184558.3191651

  • Han Z, Li S, Kong L, Tian L, Qi H (2017) Hljit at TREC 2017 real-time summarization. In: Proceedings of the 26th text retrieval conference (TREC)

  • Islam R, Liu S, Wang X, Guandong X (2020) Deep learning for misinformation detection on online social networks: a survey and new perspectives. Soc Netw Anal Min 10(1):1–20. https://doi.org/10.1007/s13278-020-00696-x

    Article  Google Scholar 

  • Lanius CL, Weber R, MacKenzie WI (2021) Use of bot and content flags to limit the spread of misinformation among social networks: a behavior and attitude survey. Soc Netw Anal Min 11(1):32. https://doi.org/10.1007/s13278-021-00739-x

    Article  Google Scholar 

  • Li H, Tan D, Luo W (2016) PolyU at TREC 2016 real-time summarization. In: Proceedings of the 25th text retrieval conference (TREC)

  • Li H (2011a) Learning to rank for information retrieval and natural language processing. Synth Lect Hum Lang Technol 4(1):1–113

  • Li H (2011b) A short introduction to learning to rank. IEICE Trans Inf Syst 94(10):1854–1862. https://doi.org/10.1587/transinf.E94.D.1854

  • Li Q, Zhang Q (2021) Twitter event summarization by exploiting semantic terms and graph network. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 15347–15354

  • Lin J, Efron M, Sherman G, Wang Y, Voorhees EM (2015) Overview of the TREC-2015 microblog track. In: Voorhees EM, Ellis A (eds) Proceedings of The twenty-fourth text retrieval conference, TREC 2015, Gaithersburg, Maryland, USA, November 17–20, 2015, vol 500–319 of NIST Special Publication. National Institute of Standards and Technology (NIST). https://trec.nist.gov/pubs/trec24/papers/Overview-MB.pdf

  • Lin J, Roegiest A, Tan L, McCreadie R, Diaz FVEM (2016) Overview of the TREC 2016 real-time summarization track. In: Voorhees EM, Ellis A (eds) Proceedings of The twenty-fifth text retrieval conference, TREC 2016, Gaithersburg, Maryland, USA, November 15-8, 2016, vol 500–321 of NIST Special Publication. National Institute of Standards and Technology (NIST). https://trec.nist.gov/pubs/trec25/papers/Overview-RT.pdf

  • Lin J, Mohammed S, Sequiera R, Tan L, Ghelani N, Abualsaud M, McCreadie R, Milajevs D, Voorhees EM (2017) Overview of the TREC 2017 real-time summarization track. In: Voorhees EM, Ellis A (eds) Proceedings of The twenty-sixth text retrieval conference, TREC 2017, Gaithersburg, Maryland, USA, November 15–17, 2017, vol 500–324 of NIST Special Publication. National Institute of Standards and Technology (NIST). https://trec.nist.gov/pubs/trec26/papers/Overview-RT.pdf

  • Liu T-Y (2009) Learning to rank for information retrieval. Found Trends Inf Retriev 3(3):225–331. https://doi.org/10.1561/1500000016

    Article  Google Scholar 

  • Lu K, Fang H (2018) Silent day detection on microblog data. In: International conference on applications of natural language to information systems. Springer, pp 443–455. https://doi.org/10.1007/978-3-319-91947-8_46

  • Mandl T, Womser-Hacker C (2005) The effect of named entities on effectiveness in cross-language information retrieval evaluation. In: Proceedings of the 2005 ACM symposium on applied computing (SAC), Santa Fe, New Mexico, USA, March 13–17, 2005. ACM, pp 1059–1064. https://doi.org/10.1145/1066677.1066919

  • Meladianos P, Xypolopoulos C, Nikolentzos G, Vazirgiannis M (2018) An optimization approach for sub-event detection and summarization in Twitter. In: European conference on information retrieval. Springer, pp 481–493. https://doi.org/10.1007/978-3-319-76941-7_36

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781

  • Modha S (2017) Microblog processing: a study. In: Prasenjit M, Mandar M, Parth M, Jainisha S (eds) Working notes of FIRE 2017—forum for information retrieval evaluation, Bangalore, India, December 8–10, 2017, vol 2036 of CEUR workshop proceedings, pp 164–167. CEUR-WS.org. http://ceur-ws.org/Vol-2036/T6-4.pdf

  • Modha S, Agrawal K, Verma D, Majumder P, Mandalia C (2016a) Daiict at TREC RTS 2016: live push notification and email digest. In: Proceedings of the 25th text retrieval conference (TREC)

  • Modha S, Mandalia C, Agrawal K, Verma D, Majumder P (2016b) Real time information extraction from microblog. In: Prasenjit M, Mandar M, Parth M, Jainisha S, Kripabandhu G (eds) Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, vol 1737 of CEUR Workshop Proceedings, pp 79–80. CEUR-WS.org. http://ceur-ws.org/Vol-1737/T2-7.pdf

  • Moutidis I, Williams HTP (2020) Good and bad events: combining network-based event detection with sentiment analysis. Soc Netw Anal Min 10(1):1–12. https://doi.org/10.1007/s13278-020-00681-4

    Article  Google Scholar 

  • Radev DR, Hovy E, McKeown K (2002) Introduction to the special issue on summarization. Comput Linguist 28(4):399–408

    Article  Google Scholar 

  • Roegiest A, Tan L, Lin J (2017) Online in-situ interleaved evaluation of real-time push notification systems. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, Shinjuku, Tokyo, Japan, August 7–11, 2017. ACM, pp 415–424. https://doi.org/10.1145/3077136.3080808

  • Rudra K, Ghosh S, Ganguly N, Goyal P, Ghosh S (2015) Extracting situational information from microblogs during disaster events: a classification-summarization approach. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 583–592

  • Sharifi B, Hutton M-A, Kalita JK (2010) Experiments in microblog summarization. In: Proceedings of the 2010 IEEE second international conference on social computing, SocialCom/IEEE international conference on privacy, security, risk and trust, PASSAT 2010, Minneapolis, Minnesota, USA, August 20–22, 2010. IEEE Computer Society, pp 49–56. https://doi.org/10.1109/SocialCom.2010.17

  • Singla R, Modha S, Majumder P, Mandalia C (2017a) Summarizing disaster related event from microblog. In: Proceedings of the first international workshop on exploitation of social media for emergency relief and preparedness co-located with european conference on information retrieval, SMERP@ECIR 2017, Aberdeen, UK, April 9, 2017, vol 1832 of CEUR workshop proceedings, pp 109–115. CEUR-WS.org. http://ceur-ws.org/Vol-1832/SMERP-2017-DC-DAIICT-IR-LAB-Summarization.pdf

  • Singla R, Modha S, Majumder P, Mandalia C (2017b) Information extraction from microblog for disaster related event. In: Proceedings of the first international workshop on exploitation of social media for emergency relief and preparedness co-located with european conference on information retrieval, SMERP@ECIR 2017, Aberdeen, UK, April 9, 2017, vol 1832 of CEUR workshop proceedings, pp 85–92. CEUR-WS.org. http://ceur-ws.org/Vol-1832/SMERP-2017-DC-DAIICT-IR-LAB-Retrieval.pdf

  • Suwaileh R, Hasanain M, Elsayed T (2016) Light-weight, conservative, yet effective: scalable real-time tweet summarization. In: Proceedings of the 25th text retrieval conference (TREC)

  • Tan L, Roegiest A, Clarke CLA, Lin JJ (2016a) Simple dynamic emission strategies for microblog filtering. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, SIGIR 2016, Pisa, Italy, July 17–21, 2016. ACM, pp 1009–1012. https://doi.org/10.1145/2911451.2914704

  • Tan L, Roegiest A, Lin J, Clarke CLA (2016b) An exploration of evaluation metrics for mobile push notifications. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. ACM, pp. 741–744. https://doi.org/10.1145/2911451.2914694

  • Tan H, Lu Z, Li W (2017) Neural network based reinforcement learning for real-time pushing on text stream. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 913–916. https://doi.org/10.1145/3077136.3080677

  • Wang D, Al-Rubaie A, Hirsch B, Pole GC (2021) National happiness index monitoring using twitter for bilanguages. Soc Netw Anal Min 11(1):1–18. https://doi.org/10.1007/s13278-021-00728-0

    Article  Google Scholar 

  • Yang M, Tu W, Qu Q, Lei K, Chen X, Zhu J, Shen Y (2018) Mares: multitask learning algorithm for web-scale real-time event summarization. World Wide Web, pp 1–17. https://doi.org/10.1007/s11280-018-0597-7

  • Yao L, Lv C, Fan F, Yang J, Zhao D (2016) Pkuicst at TREC 2016 real-time summarization track: push notifications and email digest. In: Proceedings of the 25th text retrieval conference (TREC)

  • Zhai C, Lafferty J (2017) A study of smoothing methods for language models applied to ad hoc information retrieval. SIGIR Forum 51(2):268–276. https://doi.org/10.1145/3130348.3130377

    Article  Google Scholar 

  • Zhou Y, Croft WB (2007) Query performance prediction in web search environments. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 543–550. https://doi.org/10.1145/1277741.1277835

Download references

Acknowledgements

Authors acknowledge the rigorous internal review done by Dr. Parth Mehta. He gave many constructive comments to improve the manuscript

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sandip Modha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Modha, S., Majumder, P., Mandl, T. et al. Design and analysis of microblog-based summarization system. Soc. Netw. Anal. Min. 11, 114 (2021). https://doi.org/10.1007/s13278-021-00830-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-021-00830-3

Keywords

Navigation