Large Language Models for Scientific Question Answering: An Extensive Analysis of the SciQA Benchmark

Lehmann, Jens; Meloni, Antonello; Motta, Enrico; Osborne, Francesco; Recupero, Diego Reforgiato; Salatino, Angelo Antonio; Vahdati, Sahar

doi:10.1007/978-3-031-60626-7_11

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14664))

Included in the following conference series:

European Semantic Web Conference

932 Accesses
1 Altmetric

Abstract

The SciQA benchmark for scientific question answering aims to represent a challenging task for next-generation question-answering systems on which vanilla large language models fail. In this article, we provide an analysis of the performance of language models on this benchmark including prompting and fine-tuning techniques to adapt them to the SciQA task. We show that both fine-tuning and prompting techniques with intelligent few-shot selection allow us to obtain excellent results on the SciQA benchmark. We discuss the valuable lessons and common error categories, and outline their implications on how to optimise large language models for question answering over knowledge graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ThoughtSource: A central hub for large language model reasoning data

Article Open access 08 August 2023

An astronomical question answering dataset for evaluating large language models

Article Open access 18 March 2025

Results of the Seventh Edition of the BioASQ Challenge

Notes

1.
Open Research Knowledge Graph - https://orkg.org.
2.
Codebase and prompts - https://github.com/NIMI-research/SciQA-LLM.
3.
SciQA dataset - https://huggingface.co/datasets/orkg/SciQA.
4.
Pythia-12b - https://huggingface.co/EleutherAI/pythia-12b.
5.
Seq2SeqTrainer - https://huggingface.co/docs/transformers/v4.35.1/en/main_classe s/trainer#transformers.Seq2SeqTrainer.
6.
AutoModelForSeq2SeqLM - https://huggingface.co/docs/transformers/v4.35.1/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM.
7.
Transformers - https://huggingface.co/docs/transformers/v4.35.1/en/index.
8.
Trainer - https://huggingface.co/docs/transformers/v4.35.1/en/main_classes/trai ner
9.
Pipeline class - https://huggingface.co/docs/transformers/v4.35.1/en/main_classes/pipelines
10.
OPENAI API - https://openai.com/blog/openai-api.
11.
ChatCompletion -https://platform.openai.com/docs/guides/text-generation/chat-completions-api
12.
OpenAI Libraries - https://platform.openai.com/docs/libraries

References

Angioni, S., Salatino, A., Osborne, F., Recupero, D.R., Motta, E.: AIDA: a knowledge graph about research dynamics in academia and industry. Quant. Sci. Stud. 2(4), 1356–1398 (2021)
Article Google Scholar
Auer, S., et al.: The SciQA scientific question answering benchmark for scholarly knowledge. Sci. Rep. 13(1), 7240 (2023). https://doi.org/10.1038/s41598-023-33607-z
Article Google Scholar
Babu, G.A., Badugu, S.: A survey on automatic text summarisation. In: Reddy, A.B., Nagini, S., Balas, V.E., Raju, K.S. (eds.) Proceedings of Third International Conference on Advances in Computer Engineering and Communication Systems. LNNS, vol. 612, pp. 679–689. Springer, Singapore (2023). https://doi.org/10.1007/978-981-19-9228-5_58
Chapter Google Scholar
Banerjee, D., Usbeck, R., Mihindukulasooriya, N., Singh, G., Mutharaju, R., Kapanipathi, P. (eds.): Joint Proceedings of Scholarly QALD 2023 and SemREC 2023 Co-located with 22nd International Semantic Web Conference ISWC 2023, Athens, Greece, 6–10 November 2023, CEUR Workshop Proceedings, vol. 3592. CEUR-WS.org (2023), https://ceur-ws.org/Vol-3592
Bansal, T., Jha, R., McCallum, A.: Learning to few-shot learn across diverse natural language classification tasks. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 5108–5123 (2020)
Google Scholar
Biderman, S., et al.: Pythia: a suite for analyzing large language models across training and scaling. In: International Conference on Machine Learning, pp. 2397–2430. PMLR (2023)
Google Scholar
Bolanos, F., Salatino, A., Osborne, F., Motta, E.: Artificial intelligence for literature reviews: opportunities and challenges. arXiv preprint arXiv:2402.08565 (2024)
Borrego, A., et al.: Completing scientific facts in knowledge graphs of research concepts. IEEE Access 10, 125867–125880 (2022)
Article Google Scholar
Brown, T.B., et al.: Language models are few-shot learners (2020)
Google Scholar
Buscaldi, D., Dessí, D., Motta, E., Murgia, M., Osborne, F., Recupero, D.R.: Citation prediction by leveraging transformers and natural language processing heuristics. Inf. Process. Manage. 61(1), 103583 (2024)
Article Google Scholar
Cadeddu, A., et al.: A comparative analysis of knowledge injection strategies for large language models in the scholarly domain. Eng. Appl. Artif. Intell. 133, 108166 (2024)
Article Google Scholar
Chakraborty, N., Lukovnikov, D., Maheshwari, G., Trivedi, P., Lehmann, J., Fischer, A.: Introduction to neural network-based question answering over knowledge graphs. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 11(3), e1389 (2021)
Google Scholar
Chauhan, S., Daniel, P.: A comprehensive survey on various fully automatic machine translation evaluation metrics. Neural Process. Lett. 55, 12663–12717 (2022). https://doi.org/10.1007/s11063-022-10835-4
Article Google Scholar
Chen, Y., Kang, H., Zhai, V., Li, L., Singh, R., Raj, B.: Token prediction as implicit classification to identify LLM-generated text. arXiv preprint arXiv:2311.08723 (2023)
Conover, M., et al.: Free dolly: introducing the world’s first truly open instruction-tuned LLM (2023). https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
Dessí, D., Osborne, F., Reforgiato Recupero, D., Buscaldi, D., Motta, E.: CS-KG: a large-scale knowledge graph of research entities and claims in computer science. In: Sattler, U., et al. (eds.) ISWC 2022. LNCS, vol. 13489, pp. 678–696. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19433-7_39
Chapter Google Scholar
Fu, Z., Yang, H., So, A.M.C., Lam, W., Bing, L., Collier, N.: On the effectiveness of parameter-efficient fine-tuning (2022)
Google Scholar
Hogan, A., et al.: Knowledge graphs. ACM Comput. Surv. (CSUR) 54(4), 1–37 (2021)
Article Google Scholar
Iter, D., et al.: In-context demonstration selection with cross entropy difference. arXiv preprint arXiv:2305.14726 (2023)
Jiang, L., Yan, X., Usbeck, R.: A structure and content prompt-based method for knowledge graph question answering over scholarly data. CEUR Workshop Proceedings, vol. 3592 (2023). https://ceur-ws.org/Vol-3592/paper3.pdf
Kamath, A., Das, R.: A survey on semantic parsing. arXiv preprint arXiv:1812.00978 (2018)
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners (2023)
Google Scholar
Kumagai, A., Iwata, T., Fujiwara, Y.: Few-shot learning for unsupervised feature selection. arXiv preprint arXiv:2107.00816 (2021)
Lehmann, J., Gattogi, P., Bhandiwad, D., Ferré, S., Vahdati, S.: Language models as controlled natural language semantic parsers for knowledge graph question answering. In: European Conference on Artificial Intelligence (ECAI), vol. 372, pp. 1348–1356. IOS Press (2023)
Google Scholar
Lehmann, J., et al.: DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)
Article Google Scholar
Levy, I., Bogin, B., Berant, J.: Diverse demonstrations improve in-context compositional generalization. arXiv preprint arXiv:2212.06800 (2022)
Lin, X.V., et al.: Few-shot learning with multilingual generative language models. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 9019–9052 (2022)
Google Scholar
Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., Chen, W.: What makes good in-context examples for GPT-$3 $? arXiv preprint arXiv:2101.06804 (2021)
Meloni, A., et al.: AIDA-Bot 2.0: enhancing conversational agents with knowledge graphs for analysing the research landscape. In: Payne, T.R., et al. (eds.) ISWC 2023. LNCS, vol. 14266, pp. 400–418. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-47243-5_22
Chapter Google Scholar
Peng, C., Xia, F., Naseriparsa, M., Osborne, F.: Knowledge graphs: opportunities and challenges. Artif. Intell. Rev. 1–32 (2023)
Google Scholar
Pliukhin, D., Radyush, D., Kovriguina, L., Mouromtsev, D.: Improving subgraph extraction algorithms for one-shot SPARQL query generation with large language models. In: Scholarly-QALD-23: Scholarly QALD Challenge at The 22nd International Semantic Web Conference (ISWC 2023), Athens, Greece. vol. 3592, pp. 1–10 (2023). https://ceur-ws.org/Vol-3592/paper6.pdf
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 1–67 (2020)
MathSciNet Google Scholar
Rongali, S., Soldaini, L., Monti, E., Hamza, W.: Don’t parse, generate! A sequence to sequence architecture for task-oriented semantic parsing. In: Proceedings of The Web Conference 2020, pp. 2962–2968 (2020)
Google Scholar
Rony, M.R.A.H., Chaudhuri, D., Usbeck, R., Lehmann, J.: Tree-KGQA: an unsupervised approach for question answering over knowledge graphs. IEEE Access 10, 50467–50478 (2022)
Article Google Scholar
Stocker, M., et al.: Fair scientific information with the open research knowledge graph. FAIR Connect 1, 19–21 (2023). https://doi.org/10.3233/FC-221513
Article Google Scholar
Taffa, T.A., Usbeck, R.: Leveraging LLMs in scholarly knowledge graph question answering. In: Scholarly-QALD-23: Scholarly QALD Challenge at the 22nd International Semantic Web Conference (ISWC 2023), Athens, Greece, vol. 3592, pp. 1–10 (2023). https://ceur-ws.org/Vol-3592/paper5.pdf
Vaswani, A., et al.: Attention is all you need (2023)
Google Scholar
Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)
Article Google Scholar
Wei, J., et al.: Finetuned language models are zero-shot learners (2022)
Google Scholar
Zhao, S., Dang, J., Grover, A.: Group preference optimization: Few-shot alignment of large language models. arXiv preprint arXiv:2310.11523 (2023)

Download references

Author information

Authors and Affiliations

ScaDS.AI - TU Dresden, Dresden, Germany
Jens Lehmann & Sahar Vahdati
Amazon, Munich, Germany
Jens Lehmann
Department of Mathematics and Computer Science, University of Cagliari, Cagliari, Italy
Antonello Meloni & Diego Reforgiato Recupero
Knowledge Media Institute, The Open University, Milton Keynes, UK
Enrico Motta, Francesco Osborne & Angelo Antonio Salatino
Department of Business and Law, University of Milano-Bicocca, Milan, Italy
Francesco Osborne

Authors

Jens Lehmann
View author publications
You can also search for this author in PubMed Google Scholar
Antonello Meloni
View author publications
You can also search for this author in PubMed Google Scholar
Enrico Motta
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Osborne
View author publications
You can also search for this author in PubMed Google Scholar
Diego Reforgiato Recupero
View author publications
You can also search for this author in PubMed Google Scholar
Angelo Antonio Salatino
View author publications
You can also search for this author in PubMed Google Scholar
Sahar Vahdati
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jens Lehmann .

Editor information

Editors and Affiliations

King’s College London, London, UK
Albert Meroño Peñuela
KU Leuven, Sint-Katelijne-Waver, Belgium
Anastasia Dimou
EURECOM, Biot, France
Raphaël Troncy
Linköping University, Linköping, Sweden
Olaf Hartig
Technical University of Munich, Heilbronn, Germany
Maribel Acosta
Polytechnic Institute of Paris, Palaiseau, France
Mehwish Alam
University of Mannheim, Mannheim, Germany
Heiko Paulheim
EURECOM, Biot, France
Pasquale Lisena

A Appendix - Examples

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lehmann, J. et al. (2024). Large Language Models for Scientific Question Answering: An Extensive Analysis of the SciQA Benchmark. In: Meroño Peñuela, A., et al. The Semantic Web. ESWC 2024. Lecture Notes in Computer Science, vol 14664. Springer, Cham. https://doi.org/10.1007/978-3-031-60626-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-60626-7_11
Published: 19 May 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-60625-0
Online ISBN: 978-3-031-60626-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Large Language Models for Scientific Question Answering: An Extensive Analysis of the SciQA Benchmark

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ThoughtSource: A central hub for large language model reasoning data

An astronomical question answering dataset for evaluating large language models

Results of the Seventh Edition of the BioASQ Challenge

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix - Examples

A Appendix - Examples

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us