ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce Search

Soviero, Beatriz; Kuhn, Daniel; Salle, Alexandre; Moreira, Viviane Pereira

doi:10.1007/978-3-031-56066-8_1

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14611))

Included in the following conference series:

European Conference on Information Retrieval

272 Accesses

Abstract

The dependence on human relevance judgments limits the development of information retrieval test collections that are vital for evaluating these systems. Since their launch, large language models (LLMs) have been applied to automate several human tasks. Recently, LLMs started being used to provide relevance judgments for document search. In this work, our goal is to assess whether LLMs can replace human annotators in a different setting – product search in eCommerce. We conducted experiments on open and proprietary industrial datasets to measure LLM’s ability to predict relevance judgments. Our results found that LLM-generated relevance assessments present a strong agreement (\(\sim \)82%) with human annotations indicating that LLMs have an innate ability to perform relevance judgments in an eCommerce setting. Then, we went further and tested whether LLMs can generate annotation guidelines. Our results found that relevance assessments obtained with LLM-generated guidelines are as accurate as the ones obtained from human instructions.\(^1\)(The source code for this work is available at https://github.com/danimtk/chatGPT-goes-shopping)

B. Soviero and D. Kuhn—Work conducted during an internship at VTEX.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Blanco, R., et al.: Repeatable and reliable search system evaluation using crowdsourcing. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 923–932 (2011)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Carterette, B., Allan, J., Sitaraman, R.: Minimal test collections for retrieval evaluation. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 268–275 (2006)
Google Scholar
Chen, Y., Liu, S., Liu, Z., Sun, W., Baltrunas, L., Schroeder, B.: WANDS: dataset for product search relevance assessment. In: Hagen, M., et al. (eds.) ECIR 2022. LNCS, vol. 13185, pp. 128–141. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99736-6_9
Chapter Google Scholar
Cleverdon, C.W.: The ASLIB cranfield research project on the comparative efficiency of indexing systems. In: ASLIB Proceedings, vol. 12, pp. 421–431. MCB UP Ltd. (1960)
Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main.747, https://aclanthology.org/2020.acl-main.747
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Faggioli, G., et al.: Perspectives on large language models for relevance judgment. In: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval, pp. 39–50 (2023)
Google Scholar
Harman, D., Voorhees, E.: Overview of the eighth text retrieval conference (TREC-8). In: Proceedings of the Eight Text Retrieval Conference (TREC-8), pp. 1–19 (1999)
Google Scholar
Joachims, T., Granka, L., Pan, B., Hembrooke, H., Gay, G.: Accurately interpreting clickthrough data as implicit feedback. In: ACM SIGIR Forum, vol. 51, pp. 4–11. ACM New York, NY, USA (2017)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2020). https://openreview.net/forum?id=SyxS0T4tvS
Lima de Oliveira, L., Romeu, R.K., Moreira, V.P.: REGIS: a test collection for geoscientific documents in Portuguese. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2363–2368 (2021)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744 (2022)
Google Scholar
Sanderson, M., et al.: Test collection based evaluation of information retrieval systems. Found. Trends® Inf. Retrieval 4(4), 247–375 (2010)
Google Scholar
Schick, T., Schütze, H.: It’s not just size that matters: small language models are also few-shot learners. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2339–2352. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.naacl-main.185, https://aclanthology.org/2021.naacl-main.185
Sondhi, P., Sharma, M., Kolari, P., Zhai, C.: A taxonomy of queries for e-commerce search. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1245–1248 (2018)
Google Scholar
Spark-Jones, K., van Rijsbergen, C.J.: Report on the need for and provision of an “ideal” information retrieval test collection. University of Cambridge, Computer Laboratory (1975)
Google Scholar
Thomas, P., Spielman, S., Craswell, N., Mitra, B.: Large language models can accurately predict searcher preferences. arXiv preprint arXiv:2309.10621 (2023)
Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. Inform. Process. Manag. 36(5), 697–716 (2000)
Article Google Scholar
Voorhees, E.M., et al.: Overview of the TREC 2003 robust retrieval track. In: Proceedings of the Text Retrieval Conference, pp. 69–77 (2003)
Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=rJ4km2R5t7
Xu, L., et al.: FewCLUE: a Chinese few-shot learning evaluation benchmark (2021)
Google Scholar

Download references

Acknowledgments

The authors thank Shervin Malmasi for his helpful comments and suggestions. This work has been financed in part by VTEX BRASIL (EMBRAPII PCEE1911.0140), CAPES Finance Code 001, and CNPq/Brazil.

Author information

Authors and Affiliations

Institute of Informatics, UFRGS, Porto Alegre, Brazil
Beatriz Soviero & Viviane Pereira Moreira
Institute of Education, Science and Technology of Rio Grande do Sul (IFRS), Ibirubá, Brazil
Daniel Kuhn
VTEX, Porto Alegre, Brazil
Alexandre Salle

Authors

Beatriz Soviero
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Kuhn
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Salle
View author publications
You can also search for this author in PubMed Google Scholar
Viviane Pereira Moreira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Viviane Pereira Moreira .

Editor information

Editors and Affiliations

Georgetown University, Washington, WA, USA
Nazli Goharian
University of Pisa, PISA, Pisa, Italy
Nicola Tonellotto
King's College London, London, UK
Yulan He
University College London, London, UK
Aldo Lipani
University of Glasgow, Glasgow, UK
Graham McDonald
University of Glasgow, Glasgow, UK
Craig Macdonald
University of Glasgow, Glasgow, UK
Iadh Ounis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Soviero, B., Kuhn, D., Salle, A., Moreira, V.P. (2024). ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce Search. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14611. Springer, Cham. https://doi.org/10.1007/978-3-031-56066-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-56066-8_1
Published: 15 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56065-1
Online ISBN: 978-3-031-56066-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce Search