Leveraging Large Language Models as Simulated Users for Initial, Low-Cost Evaluations of Designed Conversations

de Wit, Jan

doi:10.1007/978-3-031-54975-5_5

Jan de Wit ORCID: orcid.org/0000-0002-7299-7992¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14524))

Included in the following conference series:

International Workshop on Chatbot Research and Design

239 Accesses

Abstract

In this paper, we explore the use of large language models, in this case the ChatGPT API, as simulated users to evaluate designed, rule-based conversations. This type of evaluation can be introduced as a low-cost method to identify common usability issues prior to testing conversational agents with actual users. Preliminary findings show that ChatGPT is good at playing the part of a user, providing realistic testing scenarios for designed conversations even if these involve certain background knowledge or context. GPT-4 shows vast improvements over ChatGPT (3.5). In future work, it is important to evaluate the performance of simulated users in a more structured, generalizable manner, for example by comparing their behavior to that of actual users. In addition, ways to fine-tune the LLM could be explored to improve its performance, and the output of simulated conversations could be analyzed to automatically derive usability metrics such as the number of turns needed to reach the goal. Finally, the use of simulated users with open-ended conversational agents could be explored, where the LLM may also be able to reflect on the user experience of the conversation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See, for example, Botanalytics (https://botanalytics.co).
2.
https://github.com/tilbotio/guesswho.
3.
https://github.com/tilbotio/tilbot-main.
4.
GPT-4 appears to be much better than GPT-3.5 at following multiple choice instructions: It chooses an answer option and returns only that answer option. GPT-3.5 sometimes adds text, or rephrases the answer option.
5.
For a discussion of the value of this type of autoethnographic work, please see [49].
6.
https://osf.io/8p4zn/.

References

Afzali, J., Drzewiecki, A.M., Balog, K., Zhang, S.: UserSimCRS: a user simulation toolkit for evaluating conversational recommender systems. In: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (WSDM 2023), pp. 1160–1163. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3539597.3573029
Akbar, S., Coiera, E., Magrabi, F.: Safety concerns with consumer-facing mobile health applications and their consequences: a scoping review. J. Am. Med. Inform. Assoc. 27(2), 330–340 (2019). https://doi.org/10.1093/jamia/ocz175
Allouch, M., Azaria, A., Azoulay, R.: Conversational agents: goals, technologies, vision and challenges. Sensors 21(24), 8448 (2021). https://doi.org/10.3390/s21248448
Argyle, L.P., Busby, E.C., Fulda, N., Gubler, J.R., Rytting, C., Wingate, D.: Out of one, many: using language models to simulate human samples. Polit. Anal. 31(3), 337–351 (2023). https://doi.org/10.1017/pan.2023.2
Article Google Scholar
Bell, G., Blythe, M., Sengers, P.: Making by making strange: defamiliarization and the design of domestic technologies. ACM Trans. Comput. Hum. Interact. 12(2), 149–173 (2005). https://doi.org/10.1145/1067860.1067862
Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: can language models be too big? In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT 2021), pp. 610–623. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3442188.3445922
Blythe, M., Buie, E.: Chatbots of the gods: imaginary abstracts for techno-spirituality research. In: Proceedings of the 8th Nordic Conference on Human-Computer Interaction: Fun, Fast, Foundational (NordiCHI 2014), pp. 227–236. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2639189.2641212
Bozic, J., Tazl, O.A., Wotawa, F.: Chatbot testing using AI planning. In: 2019 IEEE International Conference On Artificial Intelligence Testing (AITest), pp. 37–44 (2019). https://doi.org/10.1109/AITest.2019.00-10
Bravo-Santos, S., Guerra, E., de Lara, J.: Testing chatbots with charm. In: Shepperd, M., Brito e Abreu, F., Rodrigues da Silva, A., Pérez-Castillo, R. (eds.) Quality of Information and Communications Technology: 13th International Conference, QUATIC 2020, Faro, Portugal, September 9–11, 2020, Proceedings, pp. 426–438. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58793-2_34
Cameron, G., et al.: Back to the future: lessons from knowledge engineering methodologies for chatbot design and development. In: British HCI Conference 2018. BCS Learning & Development Ltd. (2018)
Google Scholar
Choi, Y., Monserrat, T.J.K.P., Park, J., Shin, H., Lee, N., Kim, J.: ProtoChat: supporting the conversation design process with crowd feedback. Proc. ACM Hum. Comput. Interact. 4(CSCW3), 1–27 (2021). https://doi.org/10.1145/3432924
Cockton, G., Woolrych, A.: Sale must end: Should discount methods be cleared off HCI’s shelves? Interactions 9(5), 13–18 (2002). https://doi.org/10.1145/566981.566990
Cowan, B.R., Clark, L., Candello, H., Tsai, J.: Introduction to this special issue: guiding the conversation: new theory and design perspectives for conversational user interfaces. Hum. Comput. Interact. 38(3–4), 159–167 (2023). https://doi.org/10.1080/07370024.2022.2161905
Dall’Acqua, A., Tamburini, F.: Toward a linguistically grounded dialog model for chatbot design. Italian J. Comput. Linguist. 7(7–1, 2), 191–222 (2021)
Google Scholar
Davis, F.D.: Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quart. 13(3), 319–340 (1989)
Google Scholar
Desai, S., Sharma, T., Saha, P.: Using ChatGPT in HCI research-a trioethnography. In: Proceedings of the 5th International Conference on Conversational User Interfaces (CUI 2023). Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3571884.3603755
Deshpande, A., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K.: Toxicity in ChatGPT: analyzing persona-assigned language models. arXiv (2023)
Google Scholar
Diederich, S., Brendel, A.B., Morana, S., Kolbe, L.: On the design of and interaction with conversational agents: an organizing and assessing review of human-computer interaction research. J. Assoc. Inf. Syst. 23(1), 96–138 (2022)
Google Scholar
Eckert, W., Levin, E., Pieraccini, R.: User modeling for spoken dialogue system evaluation. In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pp. 80–87 (1997). https://doi.org/10.1109/ASRU.1997.658991
Engelbrecht, K.P., Quade, M., Möller, S.: Analysis of a new simulation approach to dialog system evaluation. Speech Commun. 51(12), 1234–1252 (2009)
Article Google Scholar
Følstad, A., et al.: Future directions for chatbot research: an interdisciplinary research agenda. Computing 103(12), 2915–2942 (2021)
Google Scholar
Følstad, A., Brandtzaeg, P.B.: Users’ experiences with chatbots: findings from a questionnaire study. Qual. User Exp. 5(1), 3 (2020)
Google Scholar
Fuchs, A., Passarella, A., Conti, M.: Modeling, replicating, and predicting human behavior: a survey. ACM Trans. Autonom. Adapt. Syst. 18(2), 1–47 (2023). https://doi.org/10.1145/3580492
Gatt, A., Krahmer, E.: Survey of the state of the art in natural language generation: core tasks, applications and evaluation. J. Artif. Intell. Res. 61, 65–170 (2018)
Google Scholar
Goes, F., Sawicki, P., Grześ, M., Brown, D., Volpe, M.: Is GPT-4 good enough to evaluate jokes? In: 14th International Conference for Computational Creativity, Waterloo (2023). https://kar.kent.ac.uk/101552/
Guo, F., Metallinou, A., Khatri, C., Raju, A., Venkatesh, A., Ram, A.: Topic-based evaluation for conversational bots. In: Proceedings of the Conversational AI Workshop at the 31st Conference on Neural Information Processing Systems (NIPS 2017) (2017)
Google Scholar
Hassenzahl, M., Tractinsky, N.: User experience–a research agenda. Behav. Inf. Technol. 25(2), 91–97 (2006). https://doi.org/10.1080/01449290500330331
Holmes, S., Moorhead, A., Bond, R., Zheng, H., Coates, V., McTear, M.: Usability testing of a healthcare chatbot: can we use conventional methods to assess conversational user interfaces? In: Proceedings of the 31st European Conference on Cognitive Ergonomics (ECCE 2019), pp. 207–214. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3335082.3335094
Horton, J.J.: Large language models as simulated economic agents: what can we learn from homo silicus? Working Paper 31122, National Bureau of Economic Research (2023). https://doi.org/10.3386/w31122
Janssen, A., Grützner, L., Breitner, M.H.: Why do chatbots fail? a critical success factors analysis. In: International Conference on Information Systems (ICIS) (2021)
Google Scholar
Keizer, S., Rossignol, S., Chandramohan, S., Pietquin, O.: User Simulation in the Development of Statistical Spoken Dialogue Systems, pp. 39–73. Springer, New York (2012). https://doi.org/10.1007/978-1-4614-4803-7_4
Kicken, M., van der Lee, C., Tenfelde, K., Maat, B., de Wit, J.: Introducing a framework for designing and evaluating interactions with conversational agents. In: Position Paper Presented at CONVERSATIONS 2022 – The 6th International Workshop on Chatbot Research and Design (2022)
Google Scholar
Kocaballi, A.B.: Conversational AI-powered design: ChatGPT as designer, user, and product. arXiv (2023)
Google Scholar
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 22199–22213. Curran Associates, Inc. (2022)
Google Scholar
Langevin, R., Lordon, R.J., Avrahami, T., Cowan, B.R., Hirsch, T., Hsieh, G.: Heuristic evaluation of conversational agents. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI 2021). Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3411764.3445312
van der Lee, C., Gatt, A., van Miltenburg, E., Krahmer, E.: Human evaluation of automatically generated text: current trends and best practice guidelines. Comput. Speech Lang. 67, 101151 (2021)
Google Scholar
Lewandowski, T., Heuer, M., Vogel, P., Böhmann, T.: Design knowledge for the lifecycle management of conversational agents. In: Wirtschaftsinformatik 2022 Proceedings. No. 3 (2022)
Google Scholar
Li, X., Lipton, Z.C., Dhingra, B., Li, L., Gao, J., Chen, Y.N.: A user simulator for task-completion dialogues. arXiv (2017)
Google Scholar
Li, Z., Chen, W., Li, S., Wang, H., Qian, J., Yan, X.: Controllable dialogue simulation with in-context learning. arXiv (2023)
Google Scholar
Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, Proceedings of the ACL-04 Workshop, pp. 74–81 (2004)
Google Scholar
Liu, C.W., Lowe, R., Serban, I.V., Noseworthy, M., Charlin, L., Pineau, J.: How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv (2017)
Google Scholar
Liu, H., Cai, Y., Ou, Z., Huang, Y., Feng, J.: A generative user simulator with GPT-based architecture and goal state tracking for reinforced multi-domain dialog systems. arXiv (2022)
Google Scholar
Liu, Y., et al.: One cannot stand for everyone! leveraging multiple user simulators to train task-oriented dialogue systems. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1–21. Association for Computational Linguistics, Toronto (2023). https://doi.org/10.18653/v1/2023.acl-long.1
Lowe, R., Noseworthy, M., Serban, I.V., Angelard-Gontier, N., Bengio, Y., Pineau, J.: Towards an automatic turing test: learning to evaluate dialogue responses. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2017)
Google Scholar
McTear, M.: Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots. Springer, Cham (2022)
Google Scholar
Meyer, S., Elsweiler, D., Ludwig, B., Fernandez-Pichel, M., Losada, D.E.: Do we still need human assessors? prompt-based GPT-3 user simulation in conversational AI. In: Proceedings of the 4th Conference on Conversational User Interfaces (CUI 2022). Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3543829.3544529
Möller, S., et al.: MeMo: Towards automatic usability evaluation of spoken dialogue services by user error simulations. In: INTERSPEECH (2006)
Google Scholar
Murad, C., Munteanu, C., Cowan, B.R., Clark, L.: Revolution or evolution? speech interaction and HCI design guidelines. IEEE Pervas. Comput. 18(2), 33–45 (2019). https://doi.org/10.1109/MPRV.2019.2906991
Article Google Scholar
Neustaedter, C., Sengers, P.: Autobiographical design in HCI research: Designing and learning through use-it-yourself. In: Proceedings of the Designing Interactive Systems Conference (DIS 2012), pp. 514–523. Association for Computing Machinery, New York (2012). https://doi.org/10.1145/2317956.2318034
Nielsen, J.: Usability inspection methods. In: Conference Companion on Human Factors in Computing Systems (CHI 1994), pp. 413–414. Association for Computing Machinery, New York (1994). https://doi.org/10.1145/259963.260531
Paoli, S.D.: Writing user personas with large language models: testing phase 6 of a thematic analysis of semi-structured interviews. arXiv (2023)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Radziwill, N., Benton, M.: Evaluating quality of chatbots and intelligent conversational agents. Softw. Qual. Profess. 19(3), 25 (2017)
Google Scholar
Sadek, M., Calvo, R.A., Mougenot, C.: Trends, challenges and processes in conversational agent design: exploring practitioners’ views through semi-structured interviews. In: Proceedings of the 5th International Conference on Conversational User Interfaces (CUI 2023). Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3571884.3597143
Sambasivan, N., Arnesen, E., Hutchinson, B., Doshi, T., Prabhakaran, V.: Re-imagining algorithmic fairness in India and beyond. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT 2021), pp. 315–328. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3442188.3445896
Schatzmann, J., Weilhammer, K., Stuttle, M., Young, S.: A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowl. Eng. Rev. 21(2), 97–126 (2006). https://doi.org/10.1017/S0269888906000944
Article Google Scholar
Silva, G.R.S., Canedo, E.D.: Towards user-centric guidelines for chatbot conversational design. Int. J. Hum.-Comput. Interact. (2022). https://doi.org/10.1080/10447318.2022.2118244
Article Google Scholar
Sugisaki, K., Bleiker, A.: Usability guidelines and evaluation criteria for conversational user interfaces: a heuristic and linguistic approach. In: Proceedings of Mensch Und Computer 2020 (MuC 2020), pp. 309–319. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3404983.3405505
Tao, C., Mou, L., Zhao, D., Yan, R.: Ruber: an unsupervised method for automatic evaluation of open-domain dialog systems. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Urban, M., Mailey, S.: Conversation design: principles, strategies, and practical application. In: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (CHI EA 2019), pp. 1–3. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3290607.3298821
Vasconcelos, M., Candello, H., Pinhanez, C., dos Santos, T.: Bottester: testing conversational systems with simulated users. In: Proceedings of the XVI Brazilian Symposium on Human Factors in Computing Systems (IHC 2017). Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3160504.3160584
White, J., et al.: A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv (2023)
Google Scholar
Wilson, C.: User interface inspection methods: a user-centered design method. Newnes (2013)
Google Scholar
Wilson, C.E.: Triangulation: the explicit use of multiple methods, measures, and approaches for determining core issues in product development. Interactions 13(6), 46-ff (2006). https://doi.org/10.1145/1167948.1167980
de Wit, J., Braggaar, A.: Tilbot: a visual design platform to facilitate open science research into conversational user interfaces. In: Proceedings of the 5th International Conference on Conversational User Interfaces (CUI 2023). Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3571884.3604403

Download references

Acknowledgments

I am sincerely grateful to Serkan Girgin (University of Twente) for helping to sprout the idea for this exploration, the eScience Center for their continuous support with developing Tilbot, and the funded WeCare project with the Elisabeth-TweeSteden hospital and the Heracleum Fund for supporting our studies into the use of conversational agents in medical practice. Finally, I greatly appreciate the reviewers’ suggestions based on the initial version of this paper, and the valuable questions and suggestions from CONVERSATIONS workshop attendees.

Author information

Authors and Affiliations

Department of Communication and Cognition, Tilburg University, Warandelaan 2, 5037 AB, Tilburg, The Netherlands
Jan de Wit

Authors

Jan de Wit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan de Wit .

Editor information

Editors and Affiliations

SINTEF, Oslo, Norway
Asbjørn Følstad
University of Amsterdam, Amsterdam, The Netherlands
Theo Araujo
CERTH-ITI, Thessaloniki, Greece
Symeon Papadopoulos
Department of Computer Science, Durham University, Durham, UK
Effie L.-C. Law
Design Informatics, University of Edinburgh, Edinburgh, UK
Ewa Luger
Centre for AI Research, University of Agder, Grimstad, Norway
Morten Goodwin
TH Lübeck – University of Applied Sciences, Lübeck, Germany
Sebastian Hobert
SINTEF & University of Oslo, Oslo, Norway
Petter Bae Brandtzaeg

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Wit, J. (2024). Leveraging Large Language Models as Simulated Users for Initial, Low-Cost Evaluations of Designed Conversations. In: Følstad, A., et al. Chatbot Research and Design. CONVERSATIONS 2023. Lecture Notes in Computer Science, vol 14524. Springer, Cham. https://doi.org/10.1007/978-3-031-54975-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-54975-5_5
Published: 13 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54974-8
Online ISBN: 978-3-031-54975-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Leveraging Large Language Models as Simulated Users for Initial, Low-Cost Evaluations of Designed Conversations