Skip to main content

Applying Pairwise Combinatorial Testing to Large Language Model Testing

  • Conference paper
  • First Online:
Testing Software and Systems (ICTSS 2023)

Abstract

In this paper, we report on applying combinatorial testing to large language models (LLMs) testing. Our aim is to pioneer the usage of combinatorial testing to be used in the realm of LLMs, e.g. for the generation of additional training or test data. We first describe how to create an input parameter model for the input of an LLM. Based on a given original sentence, we derive new sentences by replacing words with synonyms according to a combinatorial test set, leading to a specified level of coverage over synonyms while attaining an efficient diversification. Assuming that the semantics of the original sentence are retained in the derived sentences, we construct a test oracle based on existing annotations. In an experimental evaluation, we apply generated pairwise sentence test sets from the BoolQ benchmark set [4] against two LLMs (T5 [12] and LLaMa [15]). Having automated our approach for test sentence generation, as well as their execution and analysis, our experimental evaluations demonstrate the applicability of pairwise combinatorial testing methods to LLMs.

B. Garn, L. Kampel, M. Leithner—Equally contributing first authors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://huggingface.co/docs/transformers/model_doc/t5v1.1, accessed on 2023-05-03.

  2. 2.

    https://github.com/ggerganov/llama.cpp, accessed on 2023-05-03.

References

  1. Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642 (2015)

    Google Scholar 

  2. Božić, J.: Ontology-based metamorphic testing for chatbots. Softw. Qual. J. 30(1), 227–251 (2022)

    Article  Google Scholar 

  3. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., et al.: Language Models are Few-Shot Learners. In: Advance in Neural Information Proceedings Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020)

    Google Scholar 

  4. Clark, C., Lee, K., Chang, M.W., Kwiatkowski, T., Collins, M., Toutanova, K.: BoolQ: exploring the surprising difficulty of natural yes/no questions. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1. pp. 2924–2936 (2019)

    Google Scholar 

  5. Gardner, M., Artzi, Y., Basmov, V., Berant, J., Bogin, B., Chen, S., et al.: Evaluating models’ local decision boundaries via contrast sets. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1307–1323 (2020)

    Google Scholar 

  6. Grindal, M., Offutt, J.: Input parameter modeling for combination strategies. In: Proceedings of the 25th Conference on IASTED International Multi-Conference: Software Engineering, pp. 255–260. SE 2007, ACTA Press, Anaheim, CA, USA (2007)

    Google Scholar 

  7. Guichard, J., Ruane, E., Smith, R., Bean, D., Ventresque, A.: Assessing the robustness of conversational agents using paraphrases. In: 2019 IEEE International Conference On Artificial Intelligence Testing (AITest), pp. 55–62 (2019)

    Google Scholar 

  8. Jang, M., Lukasiewicz, T.: Consistency analysis of chatgpt. arXiv preprint arXiv:2303.06273 (2023). https://doi.org/10.48550/arXiv.2303.06273

  9. Khashabi, D., Khot, T., Sabharwal, A.: More bang for your buck: natural perturbation for robust question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 163–170 (2020)

    Google Scholar 

  10. Kuhn, D., Kacker, R., Lei, Y.: Introduction to Combinatorial Testing. Chapman & Hall/CRC Innovations in Software Engineering and Software Development Series, Taylor & Francis Group, CRC Press, Boca Raton, Florida (2013)

    Google Scholar 

  11. Nie, C., Leung, H.: A survey of combinatorial testing. ACM Comput. Surv. 43(2), 1–29 (2011). https://doi.org/10.1145/1883612.1883618

    Article  MATH  Google Scholar 

  12. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)

    MathSciNet  Google Scholar 

  13. Ruane, E., Faure, T., Smith, R., Bean, D., Carson-Berndsen, J., Ventresque, A.: BoTest: a framework to test the quality of conversational agents using divergent input examples. In: Proceedings of the 23rd International Conference on Intelligent User Interfaces Companion. IUI 20118 Companion, ACM, New York, NY, USA (2018)

    Google Scholar 

  14. Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in NLP. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645–3650

    Google Scholar 

  15. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., et al.: Llama: open and efficient foundation language models. Preprint arXiv:2302.13971 (2023). https://doi.org/10.48550/arXiv.2302.13971

  16. Wagner, M., Kleine, K., Simos, D.E., Kuhn, R., Kacker, R.: CAGEN: a fast combinatorial test generation tool with support for constraints and higher-index arrays. In: 2020 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 191–200 (2020)

    Google Scholar 

  17. Wotawa, F.: On the use of available testing methods for verification & validation of AI-based software and systems. In: CEUR Workshop Proceedings 2808 (2021)

    Google Scholar 

Download references

Acknowledgements

SBA Research (SBA-K1) is a COMET Center within the COMET – Competence Centers for Excellent Technologies Programme and funded by BMK, BMAW, and the federal state of Vienna. The COMET Programme is managed by FFG. Moreover, this work was performed partly under the following financial assistance award 70NANB21H124 from U.S. Department of Commerce, National Institute of Standards and Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ludwig Kampel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Garn, B. et al. (2023). Applying Pairwise Combinatorial Testing to Large Language Model Testing. In: Bonfanti, S., Gargantini, A., Salvaneschi, P. (eds) Testing Software and Systems. ICTSS 2023. Lecture Notes in Computer Science, vol 14131. Springer, Cham. https://doi.org/10.1007/978-3-031-43240-8_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43240-8_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43239-2

  • Online ISBN: 978-3-031-43240-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics