ABSTRACT
Online discussions are a ubiquitous aspect of everyday life. An Internet user who interacts with an online discussion may benefit from seeing hyperlinks to webpages relevant to the discussion because the relevant webpages can provide added context, act as citations for background sources, or condense information so that conversations can proceed seamlessly at a high level. In this paper, we propose and study a new task of retrieving relevant webpages given an online discussion. We frame the task as a novel retrieval problem where we treat a sequence of comments in an online discussion as a query and use such a query to retrieve relevant webpages. We construct a new data set using Reddit, an online discussion forum, to study this new problem. We explore and evaluate multiple representative retrieval methods to examine their effectiveness for solving this new problem. We also propose to leverage the comments that contain hyperlinks as training data to enable supervised learning and further improve retrieval performance. We find that results using modern retrieval methods are promising and that leveraging comments with hyperlinks as training data can further improve performance. We release our data set and code to enable additional research in this direction.
Supplemental Material
- Nasreen Abdul-Jaleel, James Allan, W Bruce Croft, Fernando Diaz, Leah Larkey, Xiaoyan Li, Mark D Smucker, and Courtney Wade. 2004. UMass at TREC 2004: Novelty and HARD. Computer Science Department Faculty Publication Series (2004), 189.Google ScholarCross Ref
- Zafar Ali, Irfan Ullah, Amin Khan, Asim Ullah Jan, and Khan Muhammad. 2021. An overview and evaluation of citation recommendation models. Scientometrics, Vol. 126, 5 (2021), 4083--4119.Google ScholarDigital Library
- Salvatore Andolina, Valeria Orso, Hendrik Schneider, Khalil Klouche, Tuukka Ruotsalo, Luciano Gamberini, and Giulio Jacucci. 2018. Investigating proactive search support in conversations. In Proceedings of the 2018 Designing Interactive Systems Conference. 1295--1307.Google ScholarDigital Library
- Trapit Bansal, David Belanger, and Andrew McCallum. 2016. Ask the GRU: Multi-task learning for deep text recommendations. In proceedings of the 10th ACM Conference on Recommender Systems. 107--114.Google ScholarDigital Library
- Campuswire. 2022. Commenting on discussion posts. http://web.archive.org/web/20221002030422/https://campuswire.com/chatroomsGoogle Scholar
- Common Crawl. 2022. Common Crawl. http://web.archive.org/web/20221014025949/https://commoncrawl.org/Google Scholar
- Brian Dean. 2021. Reddit User and Growth Stats (Updated Oct 2021). http://web.archive.org/web/20221005051600/https://backlinko.com/reddit-usersGoogle Scholar
- Ying Ding, Guo Zhang, Tamy Chambers, Min Song, Xiaolong Wang, and Chengxiang Zhai. 2014. Content-based citation analysis: The next generation of citation analysis. Journal of the association for information science and technology, Vol. 65, 9 (2014), 1820--1833.Google ScholarDigital Library
- Travis Ebesu and Yi Fang. 2017. Neural citation network for context-aware citation recommendation. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. 1093--1096.Google ScholarDigital Library
- Desmond Elliott and Joemon M Jose. 2009. A proactive personalised retrieval system. In Proceedings of the 18th ACM conference on Information and knowledge management. 1935--1938.Google ScholarDigital Library
- Dehong Gao, Renxian Zhang, Wenjie Li, and Yuexian Hou. 2012. Twitter hyperlink recommendation with user-tweet-hyperlink three-way clustering. In Proceedings of the 21st ACM international conference on Information and knowledge management. 2535--2538.Google ScholarDigital Library
- Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. 2021. Longt5: Efficient text-to-text transformer for long sequences. arXiv preprint arXiv:2112.07916 (2021).Google Scholar
- Itay Harel, Hagai Taitelbaum, Idan Szpektor, and Oren Kurland. 2022. A Dataset for Sentence Retrieval for Open-Ended Dialogues. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2960--2969.Google ScholarDigital Library
- Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, László Lukács, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652 (2017).Google Scholar
- Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. Towards unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118 (2021).Google Scholar
- Chanwoo Jeong, Sion Jang, Eunjeong Park, and Sungchul Choi. 2020. A context-aware citation recommendation model with BERT and graph convolutional networks. Scientometrics, Vol. 124, 3 (2020), 1907--1922.Google ScholarDigital Library
- Mozhgan Karimi, Dietmar Jannach, and Michael Jugovac. 2018. News recommender systems--Survey and roads ahead. Information Processing & Management, Vol. 54, 6 (2018), 1203--1227.Google ScholarCross Ref
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Weize Kong, Rui Li, Jie Luo, Aston Zhang, Yi Chang, and James Allan. 2015. Predicting search intent based on pre-search context. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 503--512.Google ScholarDigital Library
- Markus Koskela, Petri Luukkonen, Tuukka Ruotsalo, Mats Sjöberg, and Patrik Floréen. 2018. Proactive information retrieval by capturing search intent from primary task context. ACM Transactions on Interactive Intelligent Systems (TiiS), Vol. 8, 3 (2018), 1--25.Google ScholarDigital Library
- Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300 (2019).Google Scholar
- Marcia Lee Lee and Kitt Hirasaki. 2012. Commenting on discussion posts. http://web.archive.org/web/20221206102201/https://blog.khanacademy.org/commenting-on-discussion-posts/Google Scholar
- Qing Li, Jia Wang, Yuanzhu Peter Chen, and Zhangxi Lin. 2010. User comments for news recommendation in forum-based social media. Information Sciences, Vol. 180, 24 (2010), 4929--4939.Google ScholarDigital Library
- Daniel J Liebling, Paul N Bennett, and Ryen W White. 2012. Anticipatory search: using context to initiate search. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 1035--1036.Google ScholarDigital Library
- Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2356--2362.Google ScholarDigital Library
- Petri Luukkonen, Markus Koskela, and Patrik Floréen. 2016. LSTM-based predictions for proactive information retrieval. arXiv preprint arXiv:1606.06137 (2016).Google Scholar
- Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPS.Google Scholar
- Paul Owoicho, Jeffrey Dalton, Mohammad Aliannejadi, Leif Azzopardi, Johanne R Trippas, and Svitlana Vakulenko. 2023. TREC CAsT 2022: Going Beyond User Ask and System Retrieve with Initiative and Response Generation. In Proceedings of the NIST Text Retrieval Conference (TREC 2022). TREC'22. 1--11.Google Scholar
- Dae Hoon Park, Yi Fang, Mengwen Liu, and ChengXiang Zhai. 2016. Mobile app retrieval for social media users via inference of implicit intent in social media text. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 959--968.Google ScholarDigital Library
- PushShift.io. 2021. PushShift.io. https://files.pushshift.io/Google Scholar
- Jan Heinrich Reimer, Sebastian Schmidt, Maik Fröbe, Lukas Gienapp, Harrisen Scells, Benno Stein, Matthias Hagen, and Martin Potthast. 2023. The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives. arXiv preprint arXiv:2304.00413 (2023).Google Scholar
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.Google ScholarCross Ref
- Bradley James Rhodes and Pattie Maes. 2000. Just-in-time information retrieval agents. IBM Systems journal, Vol. 39, 3.4 (2000), 685--704.Google ScholarDigital Library
- Stephen E Robertson and Steve Walker. 1994. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR'94. Springer, 232--241.Google ScholarCross Ref
- Kevin Ros, Carl Edwards, Heng Ji, and Cheng Xiang Zhai. 2021. Team Skeletor at Touché 2021: Argument Retrieval and Visualization for Controversial Questions. In CEUR Workshop Proceedings, Vol. 2936. CEUR-WS, 2441--2454.Google Scholar
- Natalie Jomini Stroud, Emily Van Duyn, and Cynthia Peacock. 2016. Survey of Commenters and Comment Readers. http://web.archive.org/web/20221129215636/https://mediaengagement.org/research/survey-of-commenters-and-comment-readers/Google Scholar
- Yury Ustinovskiy and Pavel Serdyukov. 2013. Personalization of web-search using short-term browsing context. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 1979--1988.Google ScholarDigital Library
- Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, .Ilhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, Vol. 17 (2020), 261--272. https://doi.org/10.1038/s41592-019-0686--2Google ScholarCross Ref
- Chuhan Wu, Fangzhao Wu, Yongfeng Huang, and Xing Xie. 2021. Personalized news recommendation: A survey. arXiv preprint arXiv:2106.08934 (2021).Google Scholar
- Libin Yang, Yu Zheng, Xiaoyan Cai, Hang Dai, Dejun Mu, Lantian Guo, and Tao Dai. 2018. A LSTM based model for personalized context-aware citation recommendation. IEEE access, Vol. 6 (2018), 59618--59627.Google ScholarCross Ref
- Hamed Zamani, Michael Bendersky, Xuanhui Wang, and Mingyang Zhang. 2017. Situational context for ranking in personal search. In Proceedings of the 26th International Conference on World Wide Web. 1531--1540.Google ScholarDigital Library
Index Terms
- Retrieving Webpages Using Online Discussions
Recommendations
Classification of online discussions via content and participation
ICNC'06: Proceedings of the Second international conference on Advances in Natural Computation - Volume Part IIWeb forums and online communities are becoming increasingly important information sources. While there are significant research works on classification of Web pages, relatively less is known on classification of online discussions. By observing the ...
Query bot for retrieving patients’ clinical history: A COVID-19 use-case
Graphical abstractDisplay Omitted
Highlights- A query-bot information retrieval system with user-feedback that allows clinicians to ask natural questions to retrieve data from patient notes.
Abstract ObjectiveWith increasing patient complexity whose data are stored in fragmented health information systems, automated and time-efficient ways of gathering important information from the patients' medical history are needed ...
Some results using different approaches to merge visual and text-based features in CLEF'08 photo collection
CLEF'08: Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information accessThis paper describes the participation of the MIRACLE team at the ImageCLEF Photographic Retrieval task of CLEF 2008. We succeeded in submitting 41 runs. Obtained results from text-based retrieval are better than content-based as previous experiments in ...
Comments