Assessing the Trustworthiness of Large Language Models on Domain-Specific Questions

Mitrović, Sandra; Mazzola, Matteo; Larcher, Roberto; Guzzi, Jérôme

doi:10.1007/978-3-031-73503-5_25

Sandra Mitrović¹²,
Matteo Mazzola¹³,
Roberto Larcher¹³ &
…
Jérôme Guzzi¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14969))

Included in the following conference series:

EPIA Conference on Artificial Intelligence

187 Accesses

Abstract

Using prompt-engineering and retrieval augmented generation, we can leverage pre-trained Large Language Models to answer domain-specific questions relying on information from textual sources. In this work, we discuss how to assess the trustworthiness of a module that performs such task: how to build a large, representative, and unbiased dataset of questions/answers by automatically generating variations and which metrics to compute. We apply the methodology to a use-case where a smart wheelchair provides answers about its functioning, presenting experimental results on a dataset of more than 1000 questions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

What is the Best Model? Application-Driven Evaluation for Large Language Models

Larger and more instructable language models become less reliable

Article Open access 25 September 2024

An astronomical question answering dataset for evaluating large language models

Article Open access 18 March 2025

Notes

References

Damodaran, P.: Parrot: paraphrase generation for NLU. (2021)
Google Scholar
Es, S., James, J., Espinosa Anke, L., Schockaert, S.: RAGAs: automated evaluation of retrieval augmented generation. In: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 150–158 (2024)
Google Scholar
Jang, M., Lukasiewicz, T.: Consistency analysis of ChatGPT. arXiv preprint arXiv:2303.06273 (2023)
Johnson, D., et al.: Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Res Sq [Preprint]. 28 Feb 2023. rs.3.rs-2566942. https://doi.org/10.21203/rs.3.rs-2566942/v1. PMID: 36909565; PMCID: PMC10002821
Jungiewicz, M., Smywiński-Pohl, A.: Towards textual data augmentation for neural networks: synonyms and maximum loss. Comput. Sci. 20, 57–83 (2019)
Article Google Scholar
Kale, M., Rastogi, A.: Text-to-text pre-training for data-to-text tasks. In: Proceedings of the 13th International Conference on Natural Language Generation, pp. 97–102. Association for Computational Linguistics (2020)
Google Scholar
Khatun, A., Brown, D.G.: Reliability check: an analysis of GPT-3’s response to sensitive topics and prompt wording. arXiv preprint arXiv:2306.06199 (2023)
Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474 (2020)
Google Scholar
Li, J., et al.: Are you asking GPT-4 medical questions properly?-prompt engineering in consistency and reliability with evidence-based guidelines for ChatGPT-4: A pilot study. npj Digit. Med. 7, 41 (2023)
Google Scholar
Liu, Y., et al.: Trustworthy LLMs: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374 (2023)
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 (2015)
Shen, X., Chen, Z., Backes, M., Zhang, Y.: In ChatGPT we trust? measuring and characterizing the reliability of ChatGPT. arXiv preprint arXiv:2304.08979 (2023)
Si, C., et al.: Prompting GPT-3 to be reliable. arXiv preprint arXiv:2210.09150 (2022)
Silva, A., Schrum, M., Hedlund-Botti, E., Gopalan, N., Gombolay, M.: Explainable artificial intelligence: evaluating the objective and subjective impacts of XAI on human-agent interaction. Int. J. Hum. Comput. Interact. 39(7), 1390–1404 (2023)
Article Google Scholar
Suárez, A., et al.: Unveiling the ChatGPT phenomenon: evaluating the consistency and accuracy of endodontic question answers. Int. Endod. J. 57(1), 108–113 (2024)
Article Google Scholar
Wang, W.Y., Yang, D.: That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2557–2563 (2015)
Google Scholar
Zhang, J., Zhao, Y., Saleh, M., Liu, P.J.: PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. arXiv:1912.08777 (2019)
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Zhong, L., Wang, Z.: A study on robustness and reliability of large language model code generation. arXiv preprint arXiv:2308.10335 (2023)

Download references

Acknowledgments

This work was supported in part by REXASI-PRO H-EU project, call HORIZON-CL4-2021-HUMAN-01-01, Grant agreement no. 101070028. (Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.)

Author information

Authors and Affiliations

IDSIA, USI-SUPSI, Lugano, Switzerland
Sandra Mitrović & Jérôme Guzzi
Spindox Labs, Povo, Italy
Matteo Mazzola & Roberto Larcher

Authors

Sandra Mitrović
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Mazzola
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Larcher
View author publications
You can also search for this author in PubMed Google Scholar
Jérôme Guzzi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sandra Mitrović .

Editor information

Editors and Affiliations

University of Minho, Braga, Portugal
Manuel Filipe Santos
University of Minho, Braga, Portugal
José Machado
University of Minho, Braga, Portugal
Paulo Novais
University of Minho, Braga, Portugal
Paulo Cortez
Polytechnic Institute of Viana do Castelo, Viana do Castelo, Portugal
Pedro Miguel Moreira

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mitrović, S., Mazzola, M., Larcher, R., Guzzi, J. (2025). Assessing the Trustworthiness of Large Language Models on Domain-Specific Questions. In: Santos, M.F., Machado, J., Novais, P., Cortez, P., Moreira, P.M. (eds) Progress in Artificial Intelligence. EPIA 2024. Lecture Notes in Computer Science(), vol 14969. Springer, Cham. https://doi.org/10.1007/978-3-031-73503-5_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-73503-5_25
Published: 16 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73502-8
Online ISBN: 978-3-031-73503-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Assessing the Trustworthiness of Large Language Models on Domain-Specific Questions