skip to main content
10.1145/3627508.3638322acmotherconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper
Open Access

Enhancing Human Annotation: Leveraging Large Language Models and Efficient Batch Processing

Published:10 March 2024Publication History

ABSTRACT

Large language models (LLMs) are capable of assessing document and query characteristics, including relevance, and are now being used for a variety of different classification labeling tasks as well. This study explores how to use LLMs to classify an information need, often represented as a user query. In particular, our goal is to classify the cognitive complexity of the search task for a given “backstory”. Using 180 TREC topics and backstories, we show that GPT-based LLMs agree with human experts as much as other human experts. We also show that batching and ordering can significantly impact the accuracy of GPT-3.5, but rarely alter the quality of GPT-4 predictions. This study provides insights into the efficacy of large language models for annotation tasks normally completed by humans, and offers recommendations for other similar applications.

References

  1. Marwah Alaofi, Luke Gallagher, Mark Sanderson, Falk Scholer, and Paul Thomas. 2023. Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study. In Proc. SIGIR. 1869–1873.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Meysam Alizadeh, Maël Kubli, Zeynab Samei, Shirin Dehghani, Juan Diego Bermeo, Maria Korobeynikova, and Fabrizio Gilardi. 2023. Open-Source Large Language Models Outperform Crowd Workers and Approach ChatGPT in Text-Annotation Tasks. arXiv preprint arXiv: 2307.02179 (2023).Google ScholarGoogle Scholar
  3. Peter Bailey, Alistair Moffat, Falk Scholer, and Paul Thomas. 2015. User Variability and IR System Evaluation. In Proc. SIGIR. 625–634.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Peter Bailey, Paul Thomas, Nick Craswell, Arjen P. De Vries, Ian Soboroff, and Emine Yilmaz. 2008. Relevance Assessment: Are Judges Exchangeable and Does It Matter. In Proc. SIGIR. 667–674.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Lotfi Belkhir and Ahmed Elmeligi. 2018. Assessing ICT global emissions footprint: Trends to 2040 & recommendations. Journal of Cleaner Production 177 (2018), 448–463.Google ScholarGoogle ScholarCross RefCross Ref
  6. David J. Bell and Ian Ruthven. 2004. Searcher’s Assessments of Task Complexity for Web Searching. In Proc. ECIR. 57–71.Google ScholarGoogle Scholar
  7. Katriina Byström and Kalervo Järvelin. 1995. Task Complexity Affects Information Seeking and Use. Inf. Proc. & Man.2 (1995), 191–213.Google ScholarGoogle Scholar
  8. Bogeum Choi, Austin Ward, Yuan Li, Jaime Arguello, and Robert Capra. 2019. The Effects of Task Complexity on the Use of Different Types of Information in a Search Assistance Tool. ACM Trans. Inf. Sys. 38 (2019).Google ScholarGoogle Scholar
  9. D. Demszky, D. Yang, D. S. Yeager, C. J. Bryan, M. Clapper, S. Chandhok, C. J. Eichstaedt, C. Hecht, J. Jamieson, M. Johnson, M. Jones, D. Krettek-Cobb, L. Lai, N. JonesMitchell, D. C. Ong, C. S. Dweck, J. J. Gross, and J. W. Pennebaker. 2023. Using Large Language Models in Psychology. Nature Reviews Psychology (2023).Google ScholarGoogle Scholar
  10. Guglielmo Faggioli, Laura Dietz, Charles L A Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Language Models for Relevance Judgment. In Proc. ICTIR. 39–50.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Souvick Ghosh, Manasa Rath, and Chirag Shah. 2018. Searching as Learning: Exploring Search Behavior and Learning Outcomes in Learning-Related Tasks. In Proc. CHIIR. 22–31.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks. Proc. Nat. Acad. Sci. 120, 30 (2023), e2305016120.Google ScholarGoogle ScholarCross RefCross Ref
  13. Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks. Proc. Nat. Acad. Sci. 120 (2023).Google ScholarGoogle ScholarCross RefCross Ref
  14. Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. In Proc. NeurIPS, Vol. 35. 22199–22213.Google ScholarGoogle Scholar
  15. David R. Krathwohl, Lorin W. Anderson, and Benjamin Samuel Bloom. 2001. A Taxonomy for Learning, Teaching, and Assessing : A Revision of Bloom’s Taxonomy of Educational Objectives (complete ed.).Google ScholarGoogle Scholar
  16. Klaus Krippendorff. 2022. Content Analysis: An Introduction to Its Methodology (fourth ed.). SAGE Publications, Inc.Google ScholarGoogle Scholar
  17. Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. In Proc. SIGIR. 2230–2235.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R Trippas, J Shane Culpepper, and Alistair Moffat. 2020. CC-News-En: A Large English News Corpus. In Proc. CIKM. 3077–3084.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training Language Models to Follow Instructions With Human Feedback. In Proc. NeurIPS, Vol. 35. 27730–27744.Google ScholarGoogle Scholar
  20. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Falk Scholer. 2021. On the Effect of Relevance Scales in Crowdsourcing Relevance Assessments for Information Retrieval Evaluation. Inf. Proc. & Man. 58 (2021), 102688.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Paul Thomas, Gabriella Kazai, Ryen W White, and Nick Craswell. 2022. The Crowd is Made of People Observations from Large-Scale Crowd Labelling. In Proc. CHIIR. 25–35.Google ScholarGoogle Scholar
  23. Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2023. Large Language Models Can Accurately Predict Searcher Preferences. arXiv preprint arXiv:2309.10621 (2023).Google ScholarGoogle Scholar
  24. Petter Törnberg. 2023. ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages with Zero-Shot Learning. arXiv preprint arXiv: 2304.06588 (2023).Google ScholarGoogle Scholar
  25. Pertti Vakkari. 1999. Task Complexity, Problem Structure and Information Actions: Integrating Studies on Information Seeking and Retrieval. Inf. Proc. & Man.6 (1999), 819–837.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Wan-Ching Wu, Diane Kelly, Ashlee Edwards, and Jaime Arguello. 2012. Grannies, Tanning Beds, Tattoos and NASCAR: Evaluation of Search Tasks with Varying Levels of Cognitive Complexity. In Proc. IIiX. 254–257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Oleg Zendel, Melika P Ebrahim, J Shane Culpepper, Alistair Moffat, and Falk Scholer. 2022. Can Users Predict Relative Query Effectiveness?. In Proc. SIGIR. 2545–2549.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E. Gonzalez. 2023. TEMPERA: Test-Time Prompt Editing via Reinforcement Learning. In Proc. ICLR.Google ScholarGoogle Scholar
  29. Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023. Large Language Models Are Human-Level Prompt Engineers. In Proc. ICLR.Google ScholarGoogle Scholar
  30. Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, and Ji-Rong Wen. 2023. Large Language Models for Information Retrieval:A Survey. arXiv preprint arXiv: 2308.07107 (2023).Google ScholarGoogle Scholar
  31. Guido Zuccon, Harrisen Scells, and Shengyao Zhuang. 2023. Beyond CO2 Emissions: The Overlooked Impact of Water Consumption of Information Retrieval Models. In Proc. ICTIR. 283–289.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Enhancing Human Annotation: Leveraging Large Language Models and Efficient Batch Processing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          CHIIR '24: Proceedings of the 2024 Conference on Human Information Interaction and Retrieval
          March 2024
          481 pages
          ISBN:9798400704345
          DOI:10.1145/3627508

          Copyright © 2024 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 10 March 2024

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • short-paper
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate55of163submissions,34%
        • Article Metrics

          • Downloads (Last 12 months)56
          • Downloads (Last 6 weeks)50

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format