Abstract
A daily summary or digest from microblogs allows social media users to stay up to date on what happened today on their favorite topic. Summarizing microblogs is a non-trivial task. This paper presents a summarization system built over the Twitter stream to summarize the topic for a given duration. Tweet ranking is the primary task of designing a microblog-based summarization system. After ranking tweets, the selection of relevant tweets is the crucial task for any summarization system due to the massive volume of tweets in the Twitter stream. In addition, the summarization system should include novel tweets in the summary or digest. The measure of relevance is typically the similarity score obtained from different text similarity algorithms. These measure the similarity between user information needs and each tweet. The more similar, the higher the score. So we need to choose a threshold that can minimize false-positive judgments for this task. In this paper, we proposed novel threshold estimation methods to find optimal values for these thresholds and evaluate them against thresholds determined via grid search. These methods estimate the thresholds with reasonable accuracy, according to the results. Previous research has empirically and heuristically set these thresholds, and our work suggests a method that exploits statistical features of the ranking list to estimate these thresholds. We used language models to rank the tweets and to select relevant tweets. For any language model, the selection of the smoothing technique and its parameters are critical. The results are also compared with the standard probabilistic ranking algorithm BM25. Learning to rank strategies is also implemented, which shows substantial improvement in some of the result metrics. Experiments were performed on standard benchmarks like the TREC Microblog 2015, TREC RTS 2016, and TREC RTS 2017 datasets. Different variants of normal discounted cumulative gain, the standard official evaluation metric of TREC, nDCG-1, nDCG-0, and nDCG-p are used in this study. We also performed a comprehensive failure analysis on our experiments and identified key issues for improvement that can be addressed in the future.
Similar content being viewed by others
Notes
References
Bagdouri M, Oard DW (2015) Clip at TREC 2015: microblog and liveqa. In: Proceedings of the 24th text retrieval conference (TREC)
Chakrabarti D, Punera K (2011) Event summarization using tweets. In: Proceedings of the international AAAI conference on web and social media. https://ojs.aaai.org/index.php/ICWSM/article/view/14138
Chellal A, Boughanem M (2018) Optimization framework model for retrospective tweet summarization. In: Proceedings of the 33rd annual ACM symposium on applied computing. ACM, pp 704–711. https://doi.org/10.1145/3167132.3167210
Chen SF, Goodman J (1999) An empirical study of smoothing techniques for language modeling. Comput Speech Lang 13(4):359–394
Gonçalves G, Martins F, Magalhães J (2018) Analysis of subtopic discovery algorithms for real-time information summarization. In: Companion of the the web conference 2018 on the web conference 2018. International World Wide Web Conferences Steering Committee, pp 1855–1856. https://doi.org/10.1145/3184558.3191651
Han Z, Li S, Kong L, Tian L, Qi H (2017) Hljit at TREC 2017 real-time summarization. In: Proceedings of the 26th text retrieval conference (TREC)
Islam R, Liu S, Wang X, Guandong X (2020) Deep learning for misinformation detection on online social networks: a survey and new perspectives. Soc Netw Anal Min 10(1):1–20. https://doi.org/10.1007/s13278-020-00696-x
Lanius CL, Weber R, MacKenzie WI (2021) Use of bot and content flags to limit the spread of misinformation among social networks: a behavior and attitude survey. Soc Netw Anal Min 11(1):32. https://doi.org/10.1007/s13278-021-00739-x
Li H, Tan D, Luo W (2016) PolyU at TREC 2016 real-time summarization. In: Proceedings of the 25th text retrieval conference (TREC)
Li H (2011a) Learning to rank for information retrieval and natural language processing. Synth Lect Hum Lang Technol 4(1):1–113
Li H (2011b) A short introduction to learning to rank. IEICE Trans Inf Syst 94(10):1854–1862. https://doi.org/10.1587/transinf.E94.D.1854
Li Q, Zhang Q (2021) Twitter event summarization by exploiting semantic terms and graph network. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 15347–15354
Lin J, Efron M, Sherman G, Wang Y, Voorhees EM (2015) Overview of the TREC-2015 microblog track. In: Voorhees EM, Ellis A (eds) Proceedings of The twenty-fourth text retrieval conference, TREC 2015, Gaithersburg, Maryland, USA, November 17–20, 2015, vol 500–319 of NIST Special Publication. National Institute of Standards and Technology (NIST). https://trec.nist.gov/pubs/trec24/papers/Overview-MB.pdf
Lin J, Roegiest A, Tan L, McCreadie R, Diaz FVEM (2016) Overview of the TREC 2016 real-time summarization track. In: Voorhees EM, Ellis A (eds) Proceedings of The twenty-fifth text retrieval conference, TREC 2016, Gaithersburg, Maryland, USA, November 15-8, 2016, vol 500–321 of NIST Special Publication. National Institute of Standards and Technology (NIST). https://trec.nist.gov/pubs/trec25/papers/Overview-RT.pdf
Lin J, Mohammed S, Sequiera R, Tan L, Ghelani N, Abualsaud M, McCreadie R, Milajevs D, Voorhees EM (2017) Overview of the TREC 2017 real-time summarization track. In: Voorhees EM, Ellis A (eds) Proceedings of The twenty-sixth text retrieval conference, TREC 2017, Gaithersburg, Maryland, USA, November 15–17, 2017, vol 500–324 of NIST Special Publication. National Institute of Standards and Technology (NIST). https://trec.nist.gov/pubs/trec26/papers/Overview-RT.pdf
Liu T-Y (2009) Learning to rank for information retrieval. Found Trends Inf Retriev 3(3):225–331. https://doi.org/10.1561/1500000016
Lu K, Fang H (2018) Silent day detection on microblog data. In: International conference on applications of natural language to information systems. Springer, pp 443–455. https://doi.org/10.1007/978-3-319-91947-8_46
Mandl T, Womser-Hacker C (2005) The effect of named entities on effectiveness in cross-language information retrieval evaluation. In: Proceedings of the 2005 ACM symposium on applied computing (SAC), Santa Fe, New Mexico, USA, March 13–17, 2005. ACM, pp 1059–1064. https://doi.org/10.1145/1066677.1066919
Meladianos P, Xypolopoulos C, Nikolentzos G, Vazirgiannis M (2018) An optimization approach for sub-event detection and summarization in Twitter. In: European conference on information retrieval. Springer, pp 481–493. https://doi.org/10.1007/978-3-319-76941-7_36
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Modha S (2017) Microblog processing: a study. In: Prasenjit M, Mandar M, Parth M, Jainisha S (eds) Working notes of FIRE 2017—forum for information retrieval evaluation, Bangalore, India, December 8–10, 2017, vol 2036 of CEUR workshop proceedings, pp 164–167. CEUR-WS.org. http://ceur-ws.org/Vol-2036/T6-4.pdf
Modha S, Agrawal K, Verma D, Majumder P, Mandalia C (2016a) Daiict at TREC RTS 2016: live push notification and email digest. In: Proceedings of the 25th text retrieval conference (TREC)
Modha S, Mandalia C, Agrawal K, Verma D, Majumder P (2016b) Real time information extraction from microblog. In: Prasenjit M, Mandar M, Parth M, Jainisha S, Kripabandhu G (eds) Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, vol 1737 of CEUR Workshop Proceedings, pp 79–80. CEUR-WS.org. http://ceur-ws.org/Vol-1737/T2-7.pdf
Moutidis I, Williams HTP (2020) Good and bad events: combining network-based event detection with sentiment analysis. Soc Netw Anal Min 10(1):1–12. https://doi.org/10.1007/s13278-020-00681-4
Radev DR, Hovy E, McKeown K (2002) Introduction to the special issue on summarization. Comput Linguist 28(4):399–408
Roegiest A, Tan L, Lin J (2017) Online in-situ interleaved evaluation of real-time push notification systems. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, Shinjuku, Tokyo, Japan, August 7–11, 2017. ACM, pp 415–424. https://doi.org/10.1145/3077136.3080808
Rudra K, Ghosh S, Ganguly N, Goyal P, Ghosh S (2015) Extracting situational information from microblogs during disaster events: a classification-summarization approach. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 583–592
Sharifi B, Hutton M-A, Kalita JK (2010) Experiments in microblog summarization. In: Proceedings of the 2010 IEEE second international conference on social computing, SocialCom/IEEE international conference on privacy, security, risk and trust, PASSAT 2010, Minneapolis, Minnesota, USA, August 20–22, 2010. IEEE Computer Society, pp 49–56. https://doi.org/10.1109/SocialCom.2010.17
Singla R, Modha S, Majumder P, Mandalia C (2017a) Summarizing disaster related event from microblog. In: Proceedings of the first international workshop on exploitation of social media for emergency relief and preparedness co-located with european conference on information retrieval, SMERP@ECIR 2017, Aberdeen, UK, April 9, 2017, vol 1832 of CEUR workshop proceedings, pp 109–115. CEUR-WS.org. http://ceur-ws.org/Vol-1832/SMERP-2017-DC-DAIICT-IR-LAB-Summarization.pdf
Singla R, Modha S, Majumder P, Mandalia C (2017b) Information extraction from microblog for disaster related event. In: Proceedings of the first international workshop on exploitation of social media for emergency relief and preparedness co-located with european conference on information retrieval, SMERP@ECIR 2017, Aberdeen, UK, April 9, 2017, vol 1832 of CEUR workshop proceedings, pp 85–92. CEUR-WS.org. http://ceur-ws.org/Vol-1832/SMERP-2017-DC-DAIICT-IR-LAB-Retrieval.pdf
Suwaileh R, Hasanain M, Elsayed T (2016) Light-weight, conservative, yet effective: scalable real-time tweet summarization. In: Proceedings of the 25th text retrieval conference (TREC)
Tan L, Roegiest A, Clarke CLA, Lin JJ (2016a) Simple dynamic emission strategies for microblog filtering. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, SIGIR 2016, Pisa, Italy, July 17–21, 2016. ACM, pp 1009–1012. https://doi.org/10.1145/2911451.2914704
Tan L, Roegiest A, Lin J, Clarke CLA (2016b) An exploration of evaluation metrics for mobile push notifications. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. ACM, pp. 741–744. https://doi.org/10.1145/2911451.2914694
Tan H, Lu Z, Li W (2017) Neural network based reinforcement learning for real-time pushing on text stream. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 913–916. https://doi.org/10.1145/3077136.3080677
Wang D, Al-Rubaie A, Hirsch B, Pole GC (2021) National happiness index monitoring using twitter for bilanguages. Soc Netw Anal Min 11(1):1–18. https://doi.org/10.1007/s13278-021-00728-0
Yang M, Tu W, Qu Q, Lei K, Chen X, Zhu J, Shen Y (2018) Mares: multitask learning algorithm for web-scale real-time event summarization. World Wide Web, pp 1–17. https://doi.org/10.1007/s11280-018-0597-7
Yao L, Lv C, Fan F, Yang J, Zhao D (2016) Pkuicst at TREC 2016 real-time summarization track: push notifications and email digest. In: Proceedings of the 25th text retrieval conference (TREC)
Zhai C, Lafferty J (2017) A study of smoothing methods for language models applied to ad hoc information retrieval. SIGIR Forum 51(2):268–276. https://doi.org/10.1145/3130348.3130377
Zhou Y, Croft WB (2007) Query performance prediction in web search environments. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 543–550. https://doi.org/10.1145/1277741.1277835
Acknowledgements
Authors acknowledge the rigorous internal review done by Dr. Parth Mehta. He gave many constructive comments to improve the manuscript
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Modha, S., Majumder, P., Mandl, T. et al. Design and analysis of microblog-based summarization system. Soc. Netw. Anal. Min. 11, 114 (2021). https://doi.org/10.1007/s13278-021-00830-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-021-00830-3