Skip to main content

Large Language Models for Scientific Question Answering: An Extensive Analysis of the SciQA Benchmark

  • Conference paper
  • First Online:
The Semantic Web (ESWC 2024)

Abstract

The SciQA benchmark for scientific question answering aims to represent a challenging task for next-generation question-answering systems on which vanilla large language models fail. In this article, we provide an analysis of the performance of language models on this benchmark including prompting and fine-tuning techniques to adapt them to the SciQA task. We show that both fine-tuning and prompting techniques with intelligent few-shot selection allow us to obtain excellent results on the SciQA benchmark. We discuss the valuable lessons and common error categories, and outline their implications on how to optimise large language models for question answering over knowledge graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Open Research Knowledge Graph - https://orkg.org.

  2. 2.

    Codebase and prompts - https://github.com/NIMI-research/SciQA-LLM.

  3. 3.

    SciQA dataset - https://huggingface.co/datasets/orkg/SciQA.

  4. 4.

    Pythia-12b - https://huggingface.co/EleutherAI/pythia-12b.

  5. 5.

    Seq2SeqTrainer - https://huggingface.co/docs/transformers/v4.35.1/en/main_classes/trainer#transformers.Seq2SeqTrainer.

  6. 6.

    AutoModelForSeq2SeqLM - https://huggingface.co/docs/transformers/v4.35.1/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM.

  7. 7.

    Transformers - https://huggingface.co/docs/transformers/v4.35.1/en/index.

  8. 8.

    Trainer - https://huggingface.co/docs/transformers/v4.35.1/en/main_classes/trainer

  9. 9.

    Pipeline class - https://huggingface.co/docs/transformers/v4.35.1/en/main_classes/pipelines

  10. 10.

    OPENAI API - https://openai.com/blog/openai-api.

  11. 11.

    ChatCompletion -https://platform.openai.com/docs/guides/text-generation/chat-completions-api

  12. 12.

    OpenAI Libraries - https://platform.openai.com/docs/libraries

References

  1. Angioni, S., Salatino, A., Osborne, F., Recupero, D.R., Motta, E.: AIDA: a knowledge graph about research dynamics in academia and industry. Quant. Sci. Stud. 2(4), 1356–1398 (2021)

    Article  Google Scholar 

  2. Auer, S., et al.: The SciQA scientific question answering benchmark for scholarly knowledge. Sci. Rep. 13(1), 7240 (2023). https://doi.org/10.1038/s41598-023-33607-z

    Article  Google Scholar 

  3. Babu, G.A., Badugu, S.: A survey on automatic text summarisation. In: Reddy, A.B., Nagini, S., Balas, V.E., Raju, K.S. (eds.) Proceedings of Third International Conference on Advances in Computer Engineering and Communication Systems. LNNS, vol. 612, pp. 679–689. Springer, Singapore (2023). https://doi.org/10.1007/978-981-19-9228-5_58

    Chapter  Google Scholar 

  4. Banerjee, D., Usbeck, R., Mihindukulasooriya, N., Singh, G., Mutharaju, R., Kapanipathi, P. (eds.): Joint Proceedings of Scholarly QALD 2023 and SemREC 2023 Co-located with 22nd International Semantic Web Conference ISWC 2023, Athens, Greece, 6–10 November 2023, CEUR Workshop Proceedings, vol. 3592. CEUR-WS.org (2023), https://ceur-ws.org/Vol-3592

  5. Bansal, T., Jha, R., McCallum, A.: Learning to few-shot learn across diverse natural language classification tasks. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 5108–5123 (2020)

    Google Scholar 

  6. Biderman, S., et al.: Pythia: a suite for analyzing large language models across training and scaling. In: International Conference on Machine Learning, pp. 2397–2430. PMLR (2023)

    Google Scholar 

  7. Bolanos, F., Salatino, A., Osborne, F., Motta, E.: Artificial intelligence for literature reviews: opportunities and challenges. arXiv preprint arXiv:2402.08565 (2024)

  8. Borrego, A., et al.: Completing scientific facts in knowledge graphs of research concepts. IEEE Access 10, 125867–125880 (2022)

    Article  Google Scholar 

  9. Brown, T.B., et al.: Language models are few-shot learners (2020)

    Google Scholar 

  10. Buscaldi, D., Dessí, D., Motta, E., Murgia, M., Osborne, F., Recupero, D.R.: Citation prediction by leveraging transformers and natural language processing heuristics. Inf. Process. Manage. 61(1), 103583 (2024)

    Article  Google Scholar 

  11. Cadeddu, A., et al.: A comparative analysis of knowledge injection strategies for large language models in the scholarly domain. Eng. Appl. Artif. Intell. 133, 108166 (2024)

    Article  Google Scholar 

  12. Chakraborty, N., Lukovnikov, D., Maheshwari, G., Trivedi, P., Lehmann, J., Fischer, A.: Introduction to neural network-based question answering over knowledge graphs. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 11(3), e1389 (2021)

    Google Scholar 

  13. Chauhan, S., Daniel, P.: A comprehensive survey on various fully automatic machine translation evaluation metrics. Neural Process. Lett. 55, 12663–12717 (2022). https://doi.org/10.1007/s11063-022-10835-4

    Article  Google Scholar 

  14. Chen, Y., Kang, H., Zhai, V., Li, L., Singh, R., Raj, B.: Token prediction as implicit classification to identify LLM-generated text. arXiv preprint arXiv:2311.08723 (2023)

  15. Conover, M., et al.: Free dolly: introducing the world’s first truly open instruction-tuned LLM (2023). https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

  16. Dessí, D., Osborne, F., Reforgiato Recupero, D., Buscaldi, D., Motta, E.: CS-KG: a large-scale knowledge graph of research entities and claims in computer science. In: Sattler, U., et al. (eds.) ISWC 2022. LNCS, vol. 13489, pp. 678–696. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19433-7_39

    Chapter  Google Scholar 

  17. Fu, Z., Yang, H., So, A.M.C., Lam, W., Bing, L., Collier, N.: On the effectiveness of parameter-efficient fine-tuning (2022)

    Google Scholar 

  18. Hogan, A., et al.: Knowledge graphs. ACM Comput. Surv. (CSUR) 54(4), 1–37 (2021)

    Article  Google Scholar 

  19. Iter, D., et al.: In-context demonstration selection with cross entropy difference. arXiv preprint arXiv:2305.14726 (2023)

  20. Jiang, L., Yan, X., Usbeck, R.: A structure and content prompt-based method for knowledge graph question answering over scholarly data. CEUR Workshop Proceedings, vol. 3592 (2023). https://ceur-ws.org/Vol-3592/paper3.pdf

  21. Kamath, A., Das, R.: A survey on semantic parsing. arXiv preprint arXiv:1812.00978 (2018)

  22. Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners (2023)

    Google Scholar 

  23. Kumagai, A., Iwata, T., Fujiwara, Y.: Few-shot learning for unsupervised feature selection. arXiv preprint arXiv:2107.00816 (2021)

  24. Lehmann, J., Gattogi, P., Bhandiwad, D., Ferré, S., Vahdati, S.: Language models as controlled natural language semantic parsers for knowledge graph question answering. In: European Conference on Artificial Intelligence (ECAI), vol. 372, pp. 1348–1356. IOS Press (2023)

    Google Scholar 

  25. Lehmann, J., et al.: DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)

    Article  Google Scholar 

  26. Levy, I., Bogin, B., Berant, J.: Diverse demonstrations improve in-context compositional generalization. arXiv preprint arXiv:2212.06800 (2022)

  27. Lin, X.V., et al.: Few-shot learning with multilingual generative language models. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 9019–9052 (2022)

    Google Scholar 

  28. Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., Chen, W.: What makes good in-context examples for GPT-\(3 \)? arXiv preprint arXiv:2101.06804 (2021)

  29. Meloni, A., et al.: AIDA-Bot 2.0: enhancing conversational agents with knowledge graphs for analysing the research landscape. In: Payne, T.R., et al. (eds.) ISWC 2023. LNCS, vol. 14266, pp. 400–418. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-47243-5_22

    Chapter  Google Scholar 

  30. Peng, C., Xia, F., Naseriparsa, M., Osborne, F.: Knowledge graphs: opportunities and challenges. Artif. Intell. Rev. 1–32 (2023)

    Google Scholar 

  31. Pliukhin, D., Radyush, D., Kovriguina, L., Mouromtsev, D.: Improving subgraph extraction algorithms for one-shot SPARQL query generation with large language models. In: Scholarly-QALD-23: Scholarly QALD Challenge at The 22nd International Semantic Web Conference (ISWC 2023), Athens, Greece. vol. 3592, pp. 1–10 (2023). https://ceur-ws.org/Vol-3592/paper6.pdf

  32. Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  33. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 1–67 (2020)

    MathSciNet  Google Scholar 

  34. Rongali, S., Soldaini, L., Monti, E., Hamza, W.: Don’t parse, generate! A sequence to sequence architecture for task-oriented semantic parsing. In: Proceedings of The Web Conference 2020, pp. 2962–2968 (2020)

    Google Scholar 

  35. Rony, M.R.A.H., Chaudhuri, D., Usbeck, R., Lehmann, J.: Tree-KGQA: an unsupervised approach for question answering over knowledge graphs. IEEE Access 10, 50467–50478 (2022)

    Article  Google Scholar 

  36. Stocker, M., et al.: Fair scientific information with the open research knowledge graph. FAIR Connect 1, 19–21 (2023). https://doi.org/10.3233/FC-221513

    Article  Google Scholar 

  37. Taffa, T.A., Usbeck, R.: Leveraging LLMs in scholarly knowledge graph question answering. In: Scholarly-QALD-23: Scholarly QALD Challenge at the 22nd International Semantic Web Conference (ISWC 2023), Athens, Greece, vol. 3592, pp. 1–10 (2023). https://ceur-ws.org/Vol-3592/paper5.pdf

  38. Vaswani, A., et al.: Attention is all you need (2023)

    Google Scholar 

  39. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)

    Article  Google Scholar 

  40. Wei, J., et al.: Finetuned language models are zero-shot learners (2022)

    Google Scholar 

  41. Zhao, S., Dang, J., Grover, A.: Group preference optimization: Few-shot alignment of large language models. arXiv preprint arXiv:2310.11523 (2023)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jens Lehmann .

Editor information

Editors and Affiliations

A Appendix - Examples

A Appendix - Examples

figure a
figure b
figure c
figure d
figure e

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lehmann, J. et al. (2024). Large Language Models for Scientific Question Answering: An Extensive Analysis of the SciQA Benchmark. In: Meroño Peñuela, A., et al. The Semantic Web. ESWC 2024. Lecture Notes in Computer Science, vol 14664. Springer, Cham. https://doi.org/10.1007/978-3-031-60626-7_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-60626-7_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-60625-0

  • Online ISBN: 978-3-031-60626-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics