short-paper

Open Access

Enhancing Human Annotation: Leveraging Large Language Models and Efficient Batch Processing

Authors:
Oleg Zendel

RMIT University, Australia

RMIT University, Australia

0000-0003-1535-0989
View Profile

,
J. Shane Culpepper

The University of Queensland, Australia

The University of Queensland, Australia

0000-0002-1902-9087
View Profile

,
Falk Scholer

RMIT University, Australia

RMIT University, Australia

0000-0001-9094-0810
View Profile

,
Paul Thomas

Microsoft, Australia

Microsoft, Australia

0000-0003-2425-3136
View Profile

CHIIR '24: Proceedings of the 2024 Conference on Human Information Interaction and RetrievalMarch 2024Pages 340–345https://doi.org/10.1145/3627508.3638322

Published:10 March 2024Publication History

CHIIR '24: Proceedings of the 2024 Conference on Human Information Interaction and Retrieval

Pages 340–345

ABSTRACT

Large language models (LLMs) are capable of assessing document and query characteristics, including relevance, and are now being used for a variety of different classification labeling tasks as well. This study explores how to use LLMs to classify an information need, often represented as a user query. In particular, our goal is to classify the cognitive complexity of the search task for a given “backstory”. Using 180 TREC topics and backstories, we show that GPT-based LLMs agree with human experts as much as other human experts. We also show that batching and ordering can significantly impact the accuracy of GPT-3.5, but rarely alter the quality of GPT-4 predictions. This study provides insights into the efficacy of large language models for annotation tasks normally completed by humans, and offers recommendations for other similar applications.

References

Marwah Alaofi, Luke Gallagher, Mark Sanderson, Falk Scholer, and Paul Thomas. 2023. Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study. In Proc. SIGIR. 1869–1873.Google ScholarDigital Library
Meysam Alizadeh, Maël Kubli, Zeynab Samei, Shirin Dehghani, Juan Diego Bermeo, Maria Korobeynikova, and Fabrizio Gilardi. 2023. Open-Source Large Language Models Outperform Crowd Workers and Approach ChatGPT in Text-Annotation Tasks. arXiv preprint arXiv: 2307.02179 (2023).Google Scholar
Peter Bailey, Alistair Moffat, Falk Scholer, and Paul Thomas. 2015. User Variability and IR System Evaluation. In Proc. SIGIR. 625–634.Google ScholarDigital Library
Peter Bailey, Paul Thomas, Nick Craswell, Arjen P. De Vries, Ian Soboroff, and Emine Yilmaz. 2008. Relevance Assessment: Are Judges Exchangeable and Does It Matter. In Proc. SIGIR. 667–674.Google ScholarDigital Library
Lotfi Belkhir and Ahmed Elmeligi. 2018. Assessing ICT global emissions footprint: Trends to 2040 & recommendations. Journal of Cleaner Production 177 (2018), 448–463.Google ScholarCross Ref
David J. Bell and Ian Ruthven. 2004. Searcher’s Assessments of Task Complexity for Web Searching. In Proc. ECIR. 57–71.Google Scholar
Katriina Byström and Kalervo Järvelin. 1995. Task Complexity Affects Information Seeking and Use. Inf. Proc. & Man.2 (1995), 191–213.Google Scholar
Bogeum Choi, Austin Ward, Yuan Li, Jaime Arguello, and Robert Capra. 2019. The Effects of Task Complexity on the Use of Different Types of Information in a Search Assistance Tool. ACM Trans. Inf. Sys. 38 (2019).Google Scholar
D. Demszky, D. Yang, D. S. Yeager, C. J. Bryan, M. Clapper, S. Chandhok, C. J. Eichstaedt, C. Hecht, J. Jamieson, M. Johnson, M. Jones, D. Krettek-Cobb, L. Lai, N. JonesMitchell, D. C. Ong, C. S. Dweck, J. J. Gross, and J. W. Pennebaker. 2023. Using Large Language Models in Psychology. Nature Reviews Psychology (2023).Google Scholar
Guglielmo Faggioli, Laura Dietz, Charles L A Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Language Models for Relevance Judgment. In Proc. ICTIR. 39–50.Google ScholarDigital Library
Souvick Ghosh, Manasa Rath, and Chirag Shah. 2018. Searching as Learning: Exploring Search Behavior and Learning Outcomes in Learning-Related Tasks. In Proc. CHIIR. 22–31.Google ScholarDigital Library
Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks. Proc. Nat. Acad. Sci. 120, 30 (2023), e2305016120.Google ScholarCross Ref
Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks. Proc. Nat. Acad. Sci. 120 (2023).Google ScholarCross Ref
Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. In Proc. NeurIPS, Vol. 35. 22199–22213.Google Scholar
David R. Krathwohl, Lorin W. Anderson, and Benjamin Samuel Bloom. 2001. A Taxonomy for Learning, Teaching, and Assessing : A Revision of Bloom’s Taxonomy of Educational Objectives (complete ed.).Google Scholar
Klaus Krippendorff. 2022. Content Analysis: An Introduction to Its Methodology (fourth ed.). SAGE Publications, Inc.Google Scholar
Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. In Proc. SIGIR. 2230–2235.Google ScholarDigital Library
Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R Trippas, J Shane Culpepper, and Alistair Moffat. 2020. CC-News-En: A Large English News Corpus. In Proc. CIKM. 3077–3084.Google ScholarDigital Library
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training Language Models to Follow Instructions With Human Feedback. In Proc. NeurIPS, Vol. 35. 27730–27744.Google Scholar
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.Google ScholarDigital Library
Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Falk Scholer. 2021. On the Effect of Relevance Scales in Crowdsourcing Relevance Assessments for Information Retrieval Evaluation. Inf. Proc. & Man. 58 (2021), 102688.Google ScholarDigital Library
Paul Thomas, Gabriella Kazai, Ryen W White, and Nick Craswell. 2022. The Crowd is Made of People Observations from Large-Scale Crowd Labelling. In Proc. CHIIR. 25–35.Google Scholar
Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2023. Large Language Models Can Accurately Predict Searcher Preferences. arXiv preprint arXiv:2309.10621 (2023).Google Scholar
Petter Törnberg. 2023. ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages with Zero-Shot Learning. arXiv preprint arXiv: 2304.06588 (2023).Google Scholar
Pertti Vakkari. 1999. Task Complexity, Problem Structure and Information Actions: Integrating Studies on Information Seeking and Retrieval. Inf. Proc. & Man.6 (1999), 819–837.Google ScholarDigital Library
Wan-Ching Wu, Diane Kelly, Ashlee Edwards, and Jaime Arguello. 2012. Grannies, Tanning Beds, Tattoos and NASCAR: Evaluation of Search Tasks with Varying Levels of Cognitive Complexity. In Proc. IIiX. 254–257.Google ScholarDigital Library
Oleg Zendel, Melika P Ebrahim, J Shane Culpepper, Alistair Moffat, and Falk Scholer. 2022. Can Users Predict Relative Query Effectiveness?. In Proc. SIGIR. 2545–2549.Google ScholarDigital Library
Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E. Gonzalez. 2023. TEMPERA: Test-Time Prompt Editing via Reinforcement Learning. In Proc. ICLR.Google Scholar
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023. Large Language Models Are Human-Level Prompt Engineers. In Proc. ICLR.Google Scholar
Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, and Ji-Rong Wen. 2023. Large Language Models for Information Retrieval:A Survey. arXiv preprint arXiv: 2308.07107 (2023).Google Scholar
Guido Zuccon, Harrisen Scells, and Shengyao Zhuang. 2023. Beyond CO2 Emissions: The Overlooked Impact of Water Consumption of Information Retrieval Models. In Proc. ICTIR. 283–289.Google ScholarDigital Library

Index Terms

Enhancing Human Annotation: Leveraging Large Language Models and Efficient Batch Processing
1. Computing methodologies
  1. Artificial intelligence
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
    2. Users and interactive retrieval
      1. Task models

Recommendations

Enhancing Large Language Models-Based Code Generation by Leveraging Genetic Improvement
Genetic Programming
Abstract
In recent years, the rapid advances in neural networks for Natural Language Processing (NLP) have led to the development of Large Language Models (LLMs), able to substantially improve the state-of-the-art in many NLP tasks, such as question ...
Read More
Generating Domain-Specific Programs for Diagram Authoring with Large Language Models
SPLASH 2023: Companion Proceedings of the 2023 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity

Large language models (LLMs) can generate programs in general-purpose languages from prose descriptions, but are not trained on many domain-specific languages (DSLs). Diagram authoring with Penrose, a diagramming system using three DSLs, exemplifies ...
Read More
Enabling Conversational Interaction with Mobile UI using Large Language Models
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Conversational agents show the promise to allow users to interact with mobile devices using language. However, to perform diverse UI tasks with natural language, developers typically need to create separate datasets and models for each specific task, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHIIR '24: Proceedings of the 2024 Conference on Human Information Interaction and Retrieval
March 2024
481 pages
ISBN:9798400704345
DOI:10.1145/3627508
Editors:
Paul Clough
University of Sheffield Information School
,
Morgan Harvey
University of Sheffield Information School
,
Frank Hopfgartner
Universität Koblenz
Copyright © 2024 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 March 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cognitive task complexity
large language models
search task classification
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate55of163submissions,34%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 56
  Total Downloads
- Downloads (Last 12 months)56
- Downloads (Last 6 weeks)50
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Enhancing Human Annotation: Leveraging Large Language Models and Efficient Batch Processing

CHIIR '24: Proceedings of the 2024 Conference on Human Information Interaction and Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Enhancing Large Language Models-Based Code Generation by Leveraging Genetic Improvement

Generating Domain-Specific Programs for Diagram Authoring with Large Language Models

Enabling Conversational Interaction with Mobile UI using Large Language Models

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Enhancing Human Annotation: Leveraging Large Language Models and Efficient Batch Processing

CHIIR '24: Proceedings of the 2024 Conference on Human Information Interaction and Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Enhancing Large Language Models-Based Code Generation by Leveraging Genetic Improvement

Generating Domain-Specific Programs for Diagram Authoring with Large Language Models

Enabling Conversational Interaction with Mobile UI using Large Language Models

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media