ABSTRACT
Large language models (LLMs) are capable of assessing document and query characteristics, including relevance, and are now being used for a variety of different classification labeling tasks as well. This study explores how to use LLMs to classify an information need, often represented as a user query. In particular, our goal is to classify the cognitive complexity of the search task for a given “backstory”. Using 180 TREC topics and backstories, we show that GPT-based LLMs agree with human experts as much as other human experts. We also show that batching and ordering can significantly impact the accuracy of GPT-3.5, but rarely alter the quality of GPT-4 predictions. This study provides insights into the efficacy of large language models for annotation tasks normally completed by humans, and offers recommendations for other similar applications.
- Marwah Alaofi, Luke Gallagher, Mark Sanderson, Falk Scholer, and Paul Thomas. 2023. Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study. In Proc. SIGIR. 1869–1873.Google ScholarDigital Library
- Meysam Alizadeh, Maël Kubli, Zeynab Samei, Shirin Dehghani, Juan Diego Bermeo, Maria Korobeynikova, and Fabrizio Gilardi. 2023. Open-Source Large Language Models Outperform Crowd Workers and Approach ChatGPT in Text-Annotation Tasks. arXiv preprint arXiv: 2307.02179 (2023).Google Scholar
- Peter Bailey, Alistair Moffat, Falk Scholer, and Paul Thomas. 2015. User Variability and IR System Evaluation. In Proc. SIGIR. 625–634.Google ScholarDigital Library
- Peter Bailey, Paul Thomas, Nick Craswell, Arjen P. De Vries, Ian Soboroff, and Emine Yilmaz. 2008. Relevance Assessment: Are Judges Exchangeable and Does It Matter. In Proc. SIGIR. 667–674.Google ScholarDigital Library
- Lotfi Belkhir and Ahmed Elmeligi. 2018. Assessing ICT global emissions footprint: Trends to 2040 & recommendations. Journal of Cleaner Production 177 (2018), 448–463.Google ScholarCross Ref
- David J. Bell and Ian Ruthven. 2004. Searcher’s Assessments of Task Complexity for Web Searching. In Proc. ECIR. 57–71.Google Scholar
- Katriina Byström and Kalervo Järvelin. 1995. Task Complexity Affects Information Seeking and Use. Inf. Proc. & Man.2 (1995), 191–213.Google Scholar
- Bogeum Choi, Austin Ward, Yuan Li, Jaime Arguello, and Robert Capra. 2019. The Effects of Task Complexity on the Use of Different Types of Information in a Search Assistance Tool. ACM Trans. Inf. Sys. 38 (2019).Google Scholar
- D. Demszky, D. Yang, D. S. Yeager, C. J. Bryan, M. Clapper, S. Chandhok, C. J. Eichstaedt, C. Hecht, J. Jamieson, M. Johnson, M. Jones, D. Krettek-Cobb, L. Lai, N. JonesMitchell, D. C. Ong, C. S. Dweck, J. J. Gross, and J. W. Pennebaker. 2023. Using Large Language Models in Psychology. Nature Reviews Psychology (2023).Google Scholar
- Guglielmo Faggioli, Laura Dietz, Charles L A Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Language Models for Relevance Judgment. In Proc. ICTIR. 39–50.Google ScholarDigital Library
- Souvick Ghosh, Manasa Rath, and Chirag Shah. 2018. Searching as Learning: Exploring Search Behavior and Learning Outcomes in Learning-Related Tasks. In Proc. CHIIR. 22–31.Google ScholarDigital Library
- Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks. Proc. Nat. Acad. Sci. 120, 30 (2023), e2305016120.Google ScholarCross Ref
- Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks. Proc. Nat. Acad. Sci. 120 (2023).Google ScholarCross Ref
- Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. In Proc. NeurIPS, Vol. 35. 22199–22213.Google Scholar
- David R. Krathwohl, Lorin W. Anderson, and Benjamin Samuel Bloom. 2001. A Taxonomy for Learning, Teaching, and Assessing : A Revision of Bloom’s Taxonomy of Educational Objectives (complete ed.).Google Scholar
- Klaus Krippendorff. 2022. Content Analysis: An Introduction to Its Methodology (fourth ed.). SAGE Publications, Inc.Google Scholar
- Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. In Proc. SIGIR. 2230–2235.Google ScholarDigital Library
- Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R Trippas, J Shane Culpepper, and Alistair Moffat. 2020. CC-News-En: A Large English News Corpus. In Proc. CIKM. 3077–3084.Google ScholarDigital Library
- Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training Language Models to Follow Instructions With Human Feedback. In Proc. NeurIPS, Vol. 35. 27730–27744.Google Scholar
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.Google ScholarDigital Library
- Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Falk Scholer. 2021. On the Effect of Relevance Scales in Crowdsourcing Relevance Assessments for Information Retrieval Evaluation. Inf. Proc. & Man. 58 (2021), 102688.Google ScholarDigital Library
- Paul Thomas, Gabriella Kazai, Ryen W White, and Nick Craswell. 2022. The Crowd is Made of People Observations from Large-Scale Crowd Labelling. In Proc. CHIIR. 25–35.Google Scholar
- Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2023. Large Language Models Can Accurately Predict Searcher Preferences. arXiv preprint arXiv:2309.10621 (2023).Google Scholar
- Petter Törnberg. 2023. ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages with Zero-Shot Learning. arXiv preprint arXiv: 2304.06588 (2023).Google Scholar
- Pertti Vakkari. 1999. Task Complexity, Problem Structure and Information Actions: Integrating Studies on Information Seeking and Retrieval. Inf. Proc. & Man.6 (1999), 819–837.Google ScholarDigital Library
- Wan-Ching Wu, Diane Kelly, Ashlee Edwards, and Jaime Arguello. 2012. Grannies, Tanning Beds, Tattoos and NASCAR: Evaluation of Search Tasks with Varying Levels of Cognitive Complexity. In Proc. IIiX. 254–257.Google ScholarDigital Library
- Oleg Zendel, Melika P Ebrahim, J Shane Culpepper, Alistair Moffat, and Falk Scholer. 2022. Can Users Predict Relative Query Effectiveness?. In Proc. SIGIR. 2545–2549.Google ScholarDigital Library
- Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E. Gonzalez. 2023. TEMPERA: Test-Time Prompt Editing via Reinforcement Learning. In Proc. ICLR.Google Scholar
- Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023. Large Language Models Are Human-Level Prompt Engineers. In Proc. ICLR.Google Scholar
- Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, and Ji-Rong Wen. 2023. Large Language Models for Information Retrieval:A Survey. arXiv preprint arXiv: 2308.07107 (2023).Google Scholar
- Guido Zuccon, Harrisen Scells, and Shengyao Zhuang. 2023. Beyond CO2 Emissions: The Overlooked Impact of Water Consumption of Information Retrieval Models. In Proc. ICTIR. 283–289.Google ScholarDigital Library
Index Terms
- Enhancing Human Annotation: Leveraging Large Language Models and Efficient Batch Processing
Recommendations
Enhancing Large Language Models-Based Code Generation by Leveraging Genetic Improvement
Genetic ProgrammingAbstractIn recent years, the rapid advances in neural networks for Natural Language Processing (NLP) have led to the development of Large Language Models (LLMs), able to substantially improve the state-of-the-art in many NLP tasks, such as question ...
Generating Domain-Specific Programs for Diagram Authoring with Large Language Models
SPLASH 2023: Companion Proceedings of the 2023 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for HumanityLarge language models (LLMs) can generate programs in general-purpose languages from prose descriptions, but are not trained on many domain-specific languages (DSLs). Diagram authoring with Penrose, a diagramming system using three DSLs, exemplifies ...
Enabling Conversational Interaction with Mobile UI using Large Language Models
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing SystemsConversational agents show the promise to allow users to interact with mobile devices using language. However, to perform diverse UI tasks with natural language, developers typically need to create separate datasets and models for each specific task, ...
Comments