Applying Pairwise Combinatorial Testing to Large Language Model Testing

Garn, Bernhard; Kampel, Ludwig; Leithner, Manuel; Celic, Berina; Çulha, Ceren; Hiess, Irene; Kieseberg, Klaus; Koelbing, Marlene; Schreiber, Dominik-Philip; Wagner, Michael; Wech, Christoph; Zivanovic, Jovan; Simos, Dimitris E.

doi:10.1007/978-3-031-43240-8_16

Bernhard Garn¹⁰,
Ludwig Kampel¹⁰,
Manuel Leithner¹⁰,
Berina Celic¹⁰,
Ceren Çulha¹⁰,
Irene Hiess¹⁰,
Klaus Kieseberg¹⁰,
Marlene Koelbing¹⁰,
Dominik-Philip Schreiber¹⁰,
Michael Wagner¹⁰,
Christoph Wech¹⁰,
Jovan Zivanovic¹⁰ &
…
Dimitris E. Simos¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14131))

Included in the following conference series:

IFIP International Conference on Testing Software and Systems

330 Accesses

Abstract

In this paper, we report on applying combinatorial testing to large language models (LLMs) testing. Our aim is to pioneer the usage of combinatorial testing to be used in the realm of LLMs, e.g. for the generation of additional training or test data. We first describe how to create an input parameter model for the input of an LLM. Based on a given original sentence, we derive new sentences by replacing words with synonyms according to a combinatorial test set, leading to a specified level of coverage over synonyms while attaining an efficient diversification. Assuming that the semantics of the original sentence are retained in the derived sentences, we construct a test oracle based on existing annotations. In an experimental evaluation, we apply generated pairwise sentence test sets from the BoolQ benchmark set [4] against two LLMs (T5 [12] and LLaMa [15]). Having automated our approach for test sentence generation, as well as their execution and analysis, our experimental evaluations demonstrate the applicability of pairwise combinatorial testing methods to LLMs.

B. Garn, L. Kampel, M. Leithner—Equally contributing first authors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://huggingface.co/docs/transformers/model_doc/t5v1.1, accessed on 2023-05-03.
2.
https://github.com/ggerganov/llama.cpp, accessed on 2023-05-03.

References

Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642 (2015)
Google Scholar
Božić, J.: Ontology-based metamorphic testing for chatbots. Softw. Qual. J. 30(1), 227–251 (2022)
Article Google Scholar
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., et al.: Language Models are Few-Shot Learners. In: Advance in Neural Information Proceedings Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020)
Google Scholar
Clark, C., Lee, K., Chang, M.W., Kwiatkowski, T., Collins, M., Toutanova, K.: BoolQ: exploring the surprising difficulty of natural yes/no questions. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1. pp. 2924–2936 (2019)
Google Scholar
Gardner, M., Artzi, Y., Basmov, V., Berant, J., Bogin, B., Chen, S., et al.: Evaluating models’ local decision boundaries via contrast sets. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1307–1323 (2020)
Google Scholar
Grindal, M., Offutt, J.: Input parameter modeling for combination strategies. In: Proceedings of the 25th Conference on IASTED International Multi-Conference: Software Engineering, pp. 255–260. SE 2007, ACTA Press, Anaheim, CA, USA (2007)
Google Scholar
Guichard, J., Ruane, E., Smith, R., Bean, D., Ventresque, A.: Assessing the robustness of conversational agents using paraphrases. In: 2019 IEEE International Conference On Artificial Intelligence Testing (AITest), pp. 55–62 (2019)
Google Scholar
Jang, M., Lukasiewicz, T.: Consistency analysis of chatgpt. arXiv preprint arXiv:2303.06273 (2023). https://doi.org/10.48550/arXiv.2303.06273
Khashabi, D., Khot, T., Sabharwal, A.: More bang for your buck: natural perturbation for robust question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 163–170 (2020)
Google Scholar
Kuhn, D., Kacker, R., Lei, Y.: Introduction to Combinatorial Testing. Chapman & Hall/CRC Innovations in Software Engineering and Software Development Series, Taylor & Francis Group, CRC Press, Boca Raton, Florida (2013)
Google Scholar
Nie, C., Leung, H.: A survey of combinatorial testing. ACM Comput. Surv. 43(2), 1–29 (2011). https://doi.org/10.1145/1883612.1883618
Article MATH Google Scholar
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Ruane, E., Faure, T., Smith, R., Bean, D., Carson-Berndsen, J., Ventresque, A.: BoTest: a framework to test the quality of conversational agents using divergent input examples. In: Proceedings of the 23rd International Conference on Intelligent User Interfaces Companion. IUI 20118 Companion, ACM, New York, NY, USA (2018)
Google Scholar
Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in NLP. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645–3650
Google Scholar
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., et al.: Llama: open and efficient foundation language models. Preprint arXiv:2302.13971 (2023). https://doi.org/10.48550/arXiv.2302.13971
Wagner, M., Kleine, K., Simos, D.E., Kuhn, R., Kacker, R.: CAGEN: a fast combinatorial test generation tool with support for constraints and higher-index arrays. In: 2020 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 191–200 (2020)
Google Scholar
Wotawa, F.: On the use of available testing methods for verification & validation of AI-based software and systems. In: CEUR Workshop Proceedings 2808 (2021)
Google Scholar

Download references

Acknowledgements

SBA Research (SBA-K1) is a COMET Center within the COMET – Competence Centers for Excellent Technologies Programme and funded by BMK, BMAW, and the federal state of Vienna. The COMET Programme is managed by FFG. Moreover, this work was performed partly under the following financial assistance award 70NANB21H124 from U.S. Department of Commerce, National Institute of Standards and Technology.

Author information

Authors and Affiliations

MATRIS Research Group, SBA Research, 1040, Vienna, Austria
Bernhard Garn, Ludwig Kampel, Manuel Leithner, Berina Celic, Ceren Çulha, Irene Hiess, Klaus Kieseberg, Marlene Koelbing, Dominik-Philip Schreiber, Michael Wagner, Christoph Wech, Jovan Zivanovic & Dimitris E. Simos

Authors

Bernhard Garn
View author publications
You can also search for this author in PubMed Google Scholar
Ludwig Kampel
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Leithner
View author publications
You can also search for this author in PubMed Google Scholar
Berina Celic
View author publications
You can also search for this author in PubMed Google Scholar
Ceren Çulha
View author publications
You can also search for this author in PubMed Google Scholar
Irene Hiess
View author publications
You can also search for this author in PubMed Google Scholar
Klaus Kieseberg
View author publications
You can also search for this author in PubMed Google Scholar
Marlene Koelbing
View author publications
You can also search for this author in PubMed Google Scholar
Dominik-Philip Schreiber
View author publications
You can also search for this author in PubMed Google Scholar
Michael Wagner
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Wech
View author publications
You can also search for this author in PubMed Google Scholar
Jovan Zivanovic
View author publications
You can also search for this author in PubMed Google Scholar
Dimitris E. Simos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ludwig Kampel .

Editor information

Editors and Affiliations

University of Bergamo, Dalmine, Italy
Silvia Bonfanti
University of Bergamo, Dalmine, Italy
Angelo Gargantini
Salvaneschi & Partners, Bergamo, Italy
Paolo Salvaneschi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Garn, B. et al. (2023). Applying Pairwise Combinatorial Testing to Large Language Model Testing. In: Bonfanti, S., Gargantini, A., Salvaneschi, P. (eds) Testing Software and Systems. ICTSS 2023. Lecture Notes in Computer Science, vol 14131. Springer, Cham. https://doi.org/10.1007/978-3-031-43240-8_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-43240-8_16
Published: 19 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43239-2
Online ISBN: 978-3-031-43240-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Applying Pairwise Combinatorial Testing to Large Language Model Testing