Abstract
The SciQA benchmark for scientific question answering aims to represent a challenging task for next-generation question-answering systems on which vanilla large language models fail. In this article, we provide an analysis of the performance of language models on this benchmark including prompting and fine-tuning techniques to adapt them to the SciQA task. We show that both fine-tuning and prompting techniques with intelligent few-shot selection allow us to obtain excellent results on the SciQA benchmark. We discuss the valuable lessons and common error categories, and outline their implications on how to optimise large language models for question answering over knowledge graphs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Open Research Knowledge Graph - https://orkg.org.
- 2.
Codebase and prompts - https://github.com/NIMI-research/SciQA-LLM.
- 3.
SciQA dataset - https://huggingface.co/datasets/orkg/SciQA.
- 4.
Pythia-12b - https://huggingface.co/EleutherAI/pythia-12b.
- 5.
- 6.
- 7.
Transformers - https://huggingface.co/docs/transformers/v4.35.1/en/index.
- 8.
- 9.
- 10.
OPENAI API - https://openai.com/blog/openai-api.
- 11.
- 12.
OpenAI Libraries - https://platform.openai.com/docs/libraries
References
Angioni, S., Salatino, A., Osborne, F., Recupero, D.R., Motta, E.: AIDA: a knowledge graph about research dynamics in academia and industry. Quant. Sci. Stud. 2(4), 1356–1398 (2021)
Auer, S., et al.: The SciQA scientific question answering benchmark for scholarly knowledge. Sci. Rep. 13(1), 7240 (2023). https://doi.org/10.1038/s41598-023-33607-z
Babu, G.A., Badugu, S.: A survey on automatic text summarisation. In: Reddy, A.B., Nagini, S., Balas, V.E., Raju, K.S. (eds.) Proceedings of Third International Conference on Advances in Computer Engineering and Communication Systems. LNNS, vol. 612, pp. 679–689. Springer, Singapore (2023). https://doi.org/10.1007/978-981-19-9228-5_58
Banerjee, D., Usbeck, R., Mihindukulasooriya, N., Singh, G., Mutharaju, R., Kapanipathi, P. (eds.): Joint Proceedings of Scholarly QALD 2023 and SemREC 2023 Co-located with 22nd International Semantic Web Conference ISWC 2023, Athens, Greece, 6–10 November 2023, CEUR Workshop Proceedings, vol. 3592. CEUR-WS.org (2023), https://ceur-ws.org/Vol-3592
Bansal, T., Jha, R., McCallum, A.: Learning to few-shot learn across diverse natural language classification tasks. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 5108–5123 (2020)
Biderman, S., et al.: Pythia: a suite for analyzing large language models across training and scaling. In: International Conference on Machine Learning, pp. 2397–2430. PMLR (2023)
Bolanos, F., Salatino, A., Osborne, F., Motta, E.: Artificial intelligence for literature reviews: opportunities and challenges. arXiv preprint arXiv:2402.08565 (2024)
Borrego, A., et al.: Completing scientific facts in knowledge graphs of research concepts. IEEE Access 10, 125867–125880 (2022)
Brown, T.B., et al.: Language models are few-shot learners (2020)
Buscaldi, D., Dessí, D., Motta, E., Murgia, M., Osborne, F., Recupero, D.R.: Citation prediction by leveraging transformers and natural language processing heuristics. Inf. Process. Manage. 61(1), 103583 (2024)
Cadeddu, A., et al.: A comparative analysis of knowledge injection strategies for large language models in the scholarly domain. Eng. Appl. Artif. Intell. 133, 108166 (2024)
Chakraborty, N., Lukovnikov, D., Maheshwari, G., Trivedi, P., Lehmann, J., Fischer, A.: Introduction to neural network-based question answering over knowledge graphs. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 11(3), e1389 (2021)
Chauhan, S., Daniel, P.: A comprehensive survey on various fully automatic machine translation evaluation metrics. Neural Process. Lett. 55, 12663–12717 (2022). https://doi.org/10.1007/s11063-022-10835-4
Chen, Y., Kang, H., Zhai, V., Li, L., Singh, R., Raj, B.: Token prediction as implicit classification to identify LLM-generated text. arXiv preprint arXiv:2311.08723 (2023)
Conover, M., et al.: Free dolly: introducing the world’s first truly open instruction-tuned LLM (2023). https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
Dessí, D., Osborne, F., Reforgiato Recupero, D., Buscaldi, D., Motta, E.: CS-KG: a large-scale knowledge graph of research entities and claims in computer science. In: Sattler, U., et al. (eds.) ISWC 2022. LNCS, vol. 13489, pp. 678–696. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19433-7_39
Fu, Z., Yang, H., So, A.M.C., Lam, W., Bing, L., Collier, N.: On the effectiveness of parameter-efficient fine-tuning (2022)
Hogan, A., et al.: Knowledge graphs. ACM Comput. Surv. (CSUR) 54(4), 1–37 (2021)
Iter, D., et al.: In-context demonstration selection with cross entropy difference. arXiv preprint arXiv:2305.14726 (2023)
Jiang, L., Yan, X., Usbeck, R.: A structure and content prompt-based method for knowledge graph question answering over scholarly data. CEUR Workshop Proceedings, vol. 3592 (2023). https://ceur-ws.org/Vol-3592/paper3.pdf
Kamath, A., Das, R.: A survey on semantic parsing. arXiv preprint arXiv:1812.00978 (2018)
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners (2023)
Kumagai, A., Iwata, T., Fujiwara, Y.: Few-shot learning for unsupervised feature selection. arXiv preprint arXiv:2107.00816 (2021)
Lehmann, J., Gattogi, P., Bhandiwad, D., Ferré, S., Vahdati, S.: Language models as controlled natural language semantic parsers for knowledge graph question answering. In: European Conference on Artificial Intelligence (ECAI), vol. 372, pp. 1348–1356. IOS Press (2023)
Lehmann, J., et al.: DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)
Levy, I., Bogin, B., Berant, J.: Diverse demonstrations improve in-context compositional generalization. arXiv preprint arXiv:2212.06800 (2022)
Lin, X.V., et al.: Few-shot learning with multilingual generative language models. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 9019–9052 (2022)
Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., Chen, W.: What makes good in-context examples for GPT-\(3 \)? arXiv preprint arXiv:2101.06804 (2021)
Meloni, A., et al.: AIDA-Bot 2.0: enhancing conversational agents with knowledge graphs for analysing the research landscape. In: Payne, T.R., et al. (eds.) ISWC 2023. LNCS, vol. 14266, pp. 400–418. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-47243-5_22
Peng, C., Xia, F., Naseriparsa, M., Osborne, F.: Knowledge graphs: opportunities and challenges. Artif. Intell. Rev. 1–32 (2023)
Pliukhin, D., Radyush, D., Kovriguina, L., Mouromtsev, D.: Improving subgraph extraction algorithms for one-shot SPARQL query generation with large language models. In: Scholarly-QALD-23: Scholarly QALD Challenge at The 22nd International Semantic Web Conference (ISWC 2023), Athens, Greece. vol. 3592, pp. 1–10 (2023). https://ceur-ws.org/Vol-3592/paper6.pdf
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 1–67 (2020)
Rongali, S., Soldaini, L., Monti, E., Hamza, W.: Don’t parse, generate! A sequence to sequence architecture for task-oriented semantic parsing. In: Proceedings of The Web Conference 2020, pp. 2962–2968 (2020)
Rony, M.R.A.H., Chaudhuri, D., Usbeck, R., Lehmann, J.: Tree-KGQA: an unsupervised approach for question answering over knowledge graphs. IEEE Access 10, 50467–50478 (2022)
Stocker, M., et al.: Fair scientific information with the open research knowledge graph. FAIR Connect 1, 19–21 (2023). https://doi.org/10.3233/FC-221513
Taffa, T.A., Usbeck, R.: Leveraging LLMs in scholarly knowledge graph question answering. In: Scholarly-QALD-23: Scholarly QALD Challenge at the 22nd International Semantic Web Conference (ISWC 2023), Athens, Greece, vol. 3592, pp. 1–10 (2023). https://ceur-ws.org/Vol-3592/paper5.pdf
Vaswani, A., et al.: Attention is all you need (2023)
Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)
Wei, J., et al.: Finetuned language models are zero-shot learners (2022)
Zhao, S., Dang, J., Grover, A.: Group preference optimization: Few-shot alignment of large language models. arXiv preprint arXiv:2310.11523 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix - Examples
A Appendix - Examples





Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lehmann, J. et al. (2024). Large Language Models for Scientific Question Answering: An Extensive Analysis of the SciQA Benchmark. In: Meroño Peñuela, A., et al. The Semantic Web. ESWC 2024. Lecture Notes in Computer Science, vol 14664. Springer, Cham. https://doi.org/10.1007/978-3-031-60626-7_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-60626-7_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-60625-0
Online ISBN: 978-3-031-60626-7
eBook Packages: Computer ScienceComputer Science (R0)