research-article

Retrieving Webpages Using Online Discussions

Authors:

ChengXiang ZhaiAuthors Info & Claims

ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval

Pages 159 - 168

https://doi.org/10.1145/3578337.3605139

Published: 09 August 2023 Publication History

Abstract

Online discussions are a ubiquitous aspect of everyday life. An Internet user who interacts with an online discussion may benefit from seeing hyperlinks to webpages relevant to the discussion because the relevant webpages can provide added context, act as citations for background sources, or condense information so that conversations can proceed seamlessly at a high level. In this paper, we propose and study a new task of retrieving relevant webpages given an online discussion. We frame the task as a novel retrieval problem where we treat a sequence of comments in an online discussion as a query and use such a query to retrieve relevant webpages. We construct a new data set using Reddit, an online discussion forum, to study this new problem. We explore and evaluate multiple representative retrieval methods to examine their effectiveness for solving this new problem. We also propose to leverage the comments that contain hyperlinks as training data to enable supervised learning and further improve retrieval performance. We find that results using modern retrieval methods are promising and that leveraging comments with hyperlinks as training data can further improve performance. We release our data set and code to enable additional research in this direction.

Supplemental Material

MP4 File

Presentation Video - Retrieving Webpages Using Online Discussions

Download
25.03 MB

References

[1]

Nasreen Abdul-Jaleel, James Allan, W Bruce Croft, Fernando Diaz, Leah Larkey, Xiaoyan Li, Mark D Smucker, and Courtney Wade. 2004. UMass at TREC 2004: Novelty and HARD. Computer Science Department Faculty Publication Series (2004), 189.

[2]

Zafar Ali, Irfan Ullah, Amin Khan, Asim Ullah Jan, and Khan Muhammad. 2021. An overview and evaluation of citation recommendation models. Scientometrics, Vol. 126, 5 (2021), 4083--4119.

Digital Library

[3]

Salvatore Andolina, Valeria Orso, Hendrik Schneider, Khalil Klouche, Tuukka Ruotsalo, Luciano Gamberini, and Giulio Jacucci. 2018. Investigating proactive search support in conversations. In Proceedings of the 2018 Designing Interactive Systems Conference. 1295--1307.

Digital Library

[4]

Trapit Bansal, David Belanger, and Andrew McCallum. 2016. Ask the GRU: Multi-task learning for deep text recommendations. In proceedings of the 10th ACM Conference on Recommender Systems. 107--114.

Digital Library

[5]

Campuswire. 2022. Commenting on discussion posts. http://web.archive.org/web/20221002030422/https://campuswire.com/chatrooms

[6]

Common Crawl. 2022. Common Crawl. http://web.archive.org/web/20221014025949/https://commoncrawl.org/

[7]

Brian Dean. 2021. Reddit User and Growth Stats (Updated Oct 2021). http://web.archive.org/web/20221005051600/https://backlinko.com/reddit-users

[8]

Ying Ding, Guo Zhang, Tamy Chambers, Min Song, Xiaolong Wang, and Chengxiang Zhai. 2014. Content-based citation analysis: The next generation of citation analysis. Journal of the association for information science and technology, Vol. 65, 9 (2014), 1820--1833.

Digital Library

[9]

Travis Ebesu and Yi Fang. 2017. Neural citation network for context-aware citation recommendation. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. 1093--1096.

Digital Library

[10]

Desmond Elliott and Joemon M Jose. 2009. A proactive personalised retrieval system. In Proceedings of the 18th ACM conference on Information and knowledge management. 1935--1938.

Digital Library

[11]

Dehong Gao, Renxian Zhang, Wenjie Li, and Yuexian Hou. 2012. Twitter hyperlink recommendation with user-tweet-hyperlink three-way clustering. In Proceedings of the 21st ACM international conference on Information and knowledge management. 2535--2538.

Digital Library

[12]

Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. 2021. Longt5: Efficient text-to-text transformer for long sequences. arXiv preprint arXiv:2112.07916 (2021).

[13]

Itay Harel, Hagai Taitelbaum, Idan Szpektor, and Oren Kurland. 2022. A Dataset for Sentence Retrieval for Open-Ended Dialogues. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2960--2969.

Digital Library

[14]

Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, László Lukács, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652 (2017).

[15]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. Towards unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118 (2021).

[16]

Chanwoo Jeong, Sion Jang, Eunjeong Park, and Sungchul Choi. 2020. A context-aware citation recommendation model with BERT and graph convolutional networks. Scientometrics, Vol. 124, 3 (2020), 1907--1922.

Digital Library

[17]

Mozhgan Karimi, Dietmar Jannach, and Michael Jugovac. 2018. News recommender systems--Survey and roads ahead. Information Processing & Management, Vol. 54, 6 (2018), 1203--1227.

[18]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[19]

Weize Kong, Rui Li, Jie Luo, Aston Zhang, Yi Chang, and James Allan. 2015. Predicting search intent based on pre-search context. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 503--512.

Digital Library

[20]

Markus Koskela, Petri Luukkonen, Tuukka Ruotsalo, Mats Sjöberg, and Patrik Floréen. 2018. Proactive information retrieval by capturing search intent from primary task context. ACM Transactions on Interactive Intelligent Systems (TiiS), Vol. 8, 3 (2018), 1--25.

Digital Library

[21]

Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300 (2019).

[22]

Marcia Lee Lee and Kitt Hirasaki. 2012. Commenting on discussion posts. http://web.archive.org/web/20221206102201/https://blog.khanacademy.org/commenting-on-discussion-posts/

[23]

Qing Li, Jia Wang, Yuanzhu Peter Chen, and Zhangxi Lin. 2010. User comments for news recommendation in forum-based social media. Information Sciences, Vol. 180, 24 (2010), 4929--4939.

Digital Library

[24]

Daniel J Liebling, Paul N Bennett, and Ryen W White. 2012. Anticipatory search: using context to initiate search. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 1035--1036.

Digital Library

[25]

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2356--2362.

Digital Library

[26]

Petri Luukkonen, Markus Koskela, and Patrik Floréen. 2016. LSTM-based predictions for proactive information retrieval. arXiv preprint arXiv:1606.06137 (2016).

[27]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPS.

[28]

Paul Owoicho, Jeffrey Dalton, Mohammad Aliannejadi, Leif Azzopardi, Johanne R Trippas, and Svitlana Vakulenko. 2023. TREC CAsT 2022: Going Beyond User Ask and System Retrieve with Initiative and Response Generation. In Proceedings of the NIST Text Retrieval Conference (TREC 2022). TREC'22. 1--11.

[29]

Dae Hoon Park, Yi Fang, Mengwen Liu, and ChengXiang Zhai. 2016. Mobile app retrieval for social media users via inference of implicit intent in social media text. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 959--968.

Digital Library

[30]

PushShift.io. 2021. PushShift.io. https://files.pushshift.io/

[31]

Jan Heinrich Reimer, Sebastian Schmidt, Maik Fröbe, Lukas Gienapp, Harrisen Scells, Benno Stein, Matthias Hagen, and Martin Potthast. 2023. The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives. arXiv preprint arXiv:2304.00413 (2023).

[32]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.

[33]

Bradley James Rhodes and Pattie Maes. 2000. Just-in-time information retrieval agents. IBM Systems journal, Vol. 39, 3.4 (2000), 685--704.

Digital Library

[34]

Stephen E Robertson and Steve Walker. 1994. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR'94. Springer, 232--241.

[35]

Kevin Ros, Carl Edwards, Heng Ji, and Cheng Xiang Zhai. 2021. Team Skeletor at Touché 2021: Argument Retrieval and Visualization for Controversial Questions. In CEUR Workshop Proceedings, Vol. 2936. CEUR-WS, 2441--2454.

[36]

Natalie Jomini Stroud, Emily Van Duyn, and Cynthia Peacock. 2016. Survey of Commenters and Comment Readers. http://web.archive.org/web/20221129215636/https://mediaengagement.org/research/survey-of-commenters-and-comment-readers/

[37]

Yury Ustinovskiy and Pavel Serdyukov. 2013. Personalization of web-search using short-term browsing context. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 1979--1988.

Digital Library

[38]

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, .Ilhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, Vol. 17 (2020), 261--272. https://doi.org/10.1038/s41592-019-0686--2

[39]

Chuhan Wu, Fangzhao Wu, Yongfeng Huang, and Xing Xie. 2021. Personalized news recommendation: A survey. arXiv preprint arXiv:2106.08934 (2021).

[40]

Libin Yang, Yu Zheng, Xiaoyan Cai, Hang Dai, Dejun Mu, Lantian Guo, and Tao Dai. 2018. A LSTM based model for personalized context-aware citation recommendation. IEEE access, Vol. 6 (2018), 59618--59627.

[41]

Hamed Zamani, Michael Bendersky, Xuanhui Wang, and Mingyang Zhang. 2017. Situational context for ranking in personal search. In Proceedings of the 26th International Conference on World Wide Web. 1531--1540.

Digital Library

Cited By

Samarinas CZamani HHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)ProCIS: A Benchmark for Proactive Retrieval in ConversationsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657869(830-840)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657869

Index Terms

Retrieving Webpages Using Online Discussions
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Environment-specific retrieval
        Web and social media search

Recommendations

Query bot for retrieving patients’ clinical history: A COVID-19 use-case
Graphical abstract

Display Omitted
Highlights
- A query-bot information retrieval system with user-feedback that allows clinicians to ask natural questions to retrieve data from patient notes.
Abstract Objective
With increasing patient complexity whose data are stored in fragmented health information systems, automated and time-efficient ways of gathering important information from the patients' medical history are needed ...
Some results using different approaches to merge visual and text-based features in CLEF'08 photo collection
CLEF'08: Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access

This paper describes the participation of the MIRACLE team at the ImageCLEF Photographic Retrieval task of CLEF 2008. We succeeded in submitting 41 runs. Obtained results from text-based retrieval are better than content-based as previous experiments in ...
MapReduce Based Information Retrieval Algorithms for Efficient Ranking of Webpages

In this paper, the authors discuss the MapReduce implementation of crawler, indexer and ranking algorithms in search engines. The proposed algorithms are used in search engines to retrieve results from the World Wide Web. A crawler and an indexer in a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval

August 2023

300 pages

ISBN:9798400700736

DOI:10.1145/3578337

General Chair:
Masaharu Yoshioka
Hokkaido University, Japan
,
Program Chairs:
Julia Kiseleva
Microsoft Research, USA
,
Mohammad Aliannejadi
University of Amsterdam, Netherlands

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 August 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation
IBM-Illinois Discovery Accelerator Institute

Conference

ICTIR '23

Sponsor:

SIGIR

ICTIR '23: The 2023 ACM SIGIR International Conference on the Theory of Information Retrieval

July 23, 2023

Taipei, Taiwan

Acceptance Rates

ICTIR '23 Paper Acceptance Rate 30 of 73 submissions, 41%;

Overall Acceptance Rate 235 of 527 submissions, 45%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
72
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)2

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Samarinas CZamani HHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)ProCIS: A Benchmark for Proactive Retrieval in ConversationsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657869(830-840)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657869

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten