Abstract
In this paper, we explore the use of large language models, in this case the ChatGPT API, as simulated users to evaluate designed, rule-based conversations. This type of evaluation can be introduced as a low-cost method to identify common usability issues prior to testing conversational agents with actual users. Preliminary findings show that ChatGPT is good at playing the part of a user, providing realistic testing scenarios for designed conversations even if these involve certain background knowledge or context. GPT-4 shows vast improvements over ChatGPT (3.5). In future work, it is important to evaluate the performance of simulated users in a more structured, generalizable manner, for example by comparing their behavior to that of actual users. In addition, ways to fine-tune the LLM could be explored to improve its performance, and the output of simulated conversations could be analyzed to automatically derive usability metrics such as the number of turns needed to reach the goal. Finally, the use of simulated users with open-ended conversational agents could be explored, where the LLM may also be able to reflect on the user experience of the conversation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
See, for example, Botanalytics (https://botanalytics.co).
- 2.
- 3.
- 4.
GPT-4 appears to be much better than GPT-3.5 at following multiple choice instructions: It chooses an answer option and returns only that answer option. GPT-3.5 sometimes adds text, or rephrases the answer option.
- 5.
For a discussion of the value of this type of autoethnographic work, please see [49].
- 6.
References
Afzali, J., Drzewiecki, A.M., Balog, K., Zhang, S.: UserSimCRS: a user simulation toolkit for evaluating conversational recommender systems. In: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (WSDM 2023), pp. 1160–1163. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3539597.3573029
Akbar, S., Coiera, E., Magrabi, F.: Safety concerns with consumer-facing mobile health applications and their consequences: a scoping review. J. Am. Med. Inform. Assoc. 27(2), 330–340 (2019). https://doi.org/10.1093/jamia/ocz175
Allouch, M., Azaria, A., Azoulay, R.: Conversational agents: goals, technologies, vision and challenges. Sensors 21(24), 8448 (2021). https://doi.org/10.3390/s21248448
Argyle, L.P., Busby, E.C., Fulda, N., Gubler, J.R., Rytting, C., Wingate, D.: Out of one, many: using language models to simulate human samples. Polit. Anal. 31(3), 337–351 (2023). https://doi.org/10.1017/pan.2023.2
Bell, G., Blythe, M., Sengers, P.: Making by making strange: defamiliarization and the design of domestic technologies. ACM Trans. Comput. Hum. Interact. 12(2), 149–173 (2005). https://doi.org/10.1145/1067860.1067862
Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: can language models be too big? In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT 2021), pp. 610–623. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3442188.3445922
Blythe, M., Buie, E.: Chatbots of the gods: imaginary abstracts for techno-spirituality research. In: Proceedings of the 8th Nordic Conference on Human-Computer Interaction: Fun, Fast, Foundational (NordiCHI 2014), pp. 227–236. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2639189.2641212
Bozic, J., Tazl, O.A., Wotawa, F.: Chatbot testing using AI planning. In: 2019 IEEE International Conference On Artificial Intelligence Testing (AITest), pp. 37–44 (2019). https://doi.org/10.1109/AITest.2019.00-10
Bravo-Santos, S., Guerra, E., de Lara, J.: Testing chatbots with charm. In: Shepperd, M., Brito e Abreu, F., Rodrigues da Silva, A., Pérez-Castillo, R. (eds.) Quality of Information and Communications Technology: 13th International Conference, QUATIC 2020, Faro, Portugal, September 9–11, 2020, Proceedings, pp. 426–438. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58793-2_34
Cameron, G., et al.: Back to the future: lessons from knowledge engineering methodologies for chatbot design and development. In: British HCI Conference 2018. BCS Learning & Development Ltd. (2018)
Choi, Y., Monserrat, T.J.K.P., Park, J., Shin, H., Lee, N., Kim, J.: ProtoChat: supporting the conversation design process with crowd feedback. Proc. ACM Hum. Comput. Interact. 4(CSCW3), 1–27 (2021). https://doi.org/10.1145/3432924
Cockton, G., Woolrych, A.: Sale must end: Should discount methods be cleared off HCI’s shelves? Interactions 9(5), 13–18 (2002). https://doi.org/10.1145/566981.566990
Cowan, B.R., Clark, L., Candello, H., Tsai, J.: Introduction to this special issue: guiding the conversation: new theory and design perspectives for conversational user interfaces. Hum. Comput. Interact. 38(3–4), 159–167 (2023). https://doi.org/10.1080/07370024.2022.2161905
Dall’Acqua, A., Tamburini, F.: Toward a linguistically grounded dialog model for chatbot design. Italian J. Comput. Linguist. 7(7–1, 2), 191–222 (2021)
Davis, F.D.: Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quart. 13(3), 319–340 (1989)
Desai, S., Sharma, T., Saha, P.: Using ChatGPT in HCI research-a trioethnography. In: Proceedings of the 5th International Conference on Conversational User Interfaces (CUI 2023). Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3571884.3603755
Deshpande, A., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K.: Toxicity in ChatGPT: analyzing persona-assigned language models. arXiv (2023)
Diederich, S., Brendel, A.B., Morana, S., Kolbe, L.: On the design of and interaction with conversational agents: an organizing and assessing review of human-computer interaction research. J. Assoc. Inf. Syst. 23(1), 96–138 (2022)
Eckert, W., Levin, E., Pieraccini, R.: User modeling for spoken dialogue system evaluation. In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pp. 80–87 (1997). https://doi.org/10.1109/ASRU.1997.658991
Engelbrecht, K.P., Quade, M., Möller, S.: Analysis of a new simulation approach to dialog system evaluation. Speech Commun. 51(12), 1234–1252 (2009)
Følstad, A., et al.: Future directions for chatbot research: an interdisciplinary research agenda. Computing 103(12), 2915–2942 (2021)
Følstad, A., Brandtzaeg, P.B.: Users’ experiences with chatbots: findings from a questionnaire study. Qual. User Exp. 5(1), 3 (2020)
Fuchs, A., Passarella, A., Conti, M.: Modeling, replicating, and predicting human behavior: a survey. ACM Trans. Autonom. Adapt. Syst. 18(2), 1–47 (2023). https://doi.org/10.1145/3580492
Gatt, A., Krahmer, E.: Survey of the state of the art in natural language generation: core tasks, applications and evaluation. J. Artif. Intell. Res. 61, 65–170 (2018)
Goes, F., Sawicki, P., Grześ, M., Brown, D., Volpe, M.: Is GPT-4 good enough to evaluate jokes? In: 14th International Conference for Computational Creativity, Waterloo (2023). https://kar.kent.ac.uk/101552/
Guo, F., Metallinou, A., Khatri, C., Raju, A., Venkatesh, A., Ram, A.: Topic-based evaluation for conversational bots. In: Proceedings of the Conversational AI Workshop at the 31st Conference on Neural Information Processing Systems (NIPS 2017) (2017)
Hassenzahl, M., Tractinsky, N.: User experience–a research agenda. Behav. Inf. Technol. 25(2), 91–97 (2006). https://doi.org/10.1080/01449290500330331
Holmes, S., Moorhead, A., Bond, R., Zheng, H., Coates, V., McTear, M.: Usability testing of a healthcare chatbot: can we use conventional methods to assess conversational user interfaces? In: Proceedings of the 31st European Conference on Cognitive Ergonomics (ECCE 2019), pp. 207–214. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3335082.3335094
Horton, J.J.: Large language models as simulated economic agents: what can we learn from homo silicus? Working Paper 31122, National Bureau of Economic Research (2023). https://doi.org/10.3386/w31122
Janssen, A., Grützner, L., Breitner, M.H.: Why do chatbots fail? a critical success factors analysis. In: International Conference on Information Systems (ICIS) (2021)
Keizer, S., Rossignol, S., Chandramohan, S., Pietquin, O.: User Simulation in the Development of Statistical Spoken Dialogue Systems, pp. 39–73. Springer, New York (2012). https://doi.org/10.1007/978-1-4614-4803-7_4
Kicken, M., van der Lee, C., Tenfelde, K., Maat, B., de Wit, J.: Introducing a framework for designing and evaluating interactions with conversational agents. In: Position Paper Presented at CONVERSATIONS 2022 – The 6th International Workshop on Chatbot Research and Design (2022)
Kocaballi, A.B.: Conversational AI-powered design: ChatGPT as designer, user, and product. arXiv (2023)
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 22199–22213. Curran Associates, Inc. (2022)
Langevin, R., Lordon, R.J., Avrahami, T., Cowan, B.R., Hirsch, T., Hsieh, G.: Heuristic evaluation of conversational agents. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI 2021). Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3411764.3445312
van der Lee, C., Gatt, A., van Miltenburg, E., Krahmer, E.: Human evaluation of automatically generated text: current trends and best practice guidelines. Comput. Speech Lang. 67, 101151 (2021)
Lewandowski, T., Heuer, M., Vogel, P., Böhmann, T.: Design knowledge for the lifecycle management of conversational agents. In: Wirtschaftsinformatik 2022 Proceedings. No. 3 (2022)
Li, X., Lipton, Z.C., Dhingra, B., Li, L., Gao, J., Chen, Y.N.: A user simulator for task-completion dialogues. arXiv (2017)
Li, Z., Chen, W., Li, S., Wang, H., Qian, J., Yan, X.: Controllable dialogue simulation with in-context learning. arXiv (2023)
Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, Proceedings of the ACL-04 Workshop, pp. 74–81 (2004)
Liu, C.W., Lowe, R., Serban, I.V., Noseworthy, M., Charlin, L., Pineau, J.: How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv (2017)
Liu, H., Cai, Y., Ou, Z., Huang, Y., Feng, J.: A generative user simulator with GPT-based architecture and goal state tracking for reinforced multi-domain dialog systems. arXiv (2022)
Liu, Y., et al.: One cannot stand for everyone! leveraging multiple user simulators to train task-oriented dialogue systems. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1–21. Association for Computational Linguistics, Toronto (2023). https://doi.org/10.18653/v1/2023.acl-long.1
Lowe, R., Noseworthy, M., Serban, I.V., Angelard-Gontier, N., Bengio, Y., Pineau, J.: Towards an automatic turing test: learning to evaluate dialogue responses. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2017)
McTear, M.: Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots. Springer, Cham (2022)
Meyer, S., Elsweiler, D., Ludwig, B., Fernandez-Pichel, M., Losada, D.E.: Do we still need human assessors? prompt-based GPT-3 user simulation in conversational AI. In: Proceedings of the 4th Conference on Conversational User Interfaces (CUI 2022). Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3543829.3544529
Möller, S., et al.: MeMo: Towards automatic usability evaluation of spoken dialogue services by user error simulations. In: INTERSPEECH (2006)
Murad, C., Munteanu, C., Cowan, B.R., Clark, L.: Revolution or evolution? speech interaction and HCI design guidelines. IEEE Pervas. Comput. 18(2), 33–45 (2019). https://doi.org/10.1109/MPRV.2019.2906991
Neustaedter, C., Sengers, P.: Autobiographical design in HCI research: Designing and learning through use-it-yourself. In: Proceedings of the Designing Interactive Systems Conference (DIS 2012), pp. 514–523. Association for Computing Machinery, New York (2012). https://doi.org/10.1145/2317956.2318034
Nielsen, J.: Usability inspection methods. In: Conference Companion on Human Factors in Computing Systems (CHI 1994), pp. 413–414. Association for Computing Machinery, New York (1994). https://doi.org/10.1145/259963.260531
Paoli, S.D.: Writing user personas with large language models: testing phase 6 of a thematic analysis of semi-structured interviews. arXiv (2023)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Radziwill, N., Benton, M.: Evaluating quality of chatbots and intelligent conversational agents. Softw. Qual. Profess. 19(3), 25 (2017)
Sadek, M., Calvo, R.A., Mougenot, C.: Trends, challenges and processes in conversational agent design: exploring practitioners’ views through semi-structured interviews. In: Proceedings of the 5th International Conference on Conversational User Interfaces (CUI 2023). Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3571884.3597143
Sambasivan, N., Arnesen, E., Hutchinson, B., Doshi, T., Prabhakaran, V.: Re-imagining algorithmic fairness in India and beyond. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT 2021), pp. 315–328. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3442188.3445896
Schatzmann, J., Weilhammer, K., Stuttle, M., Young, S.: A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowl. Eng. Rev. 21(2), 97–126 (2006). https://doi.org/10.1017/S0269888906000944
Silva, G.R.S., Canedo, E.D.: Towards user-centric guidelines for chatbot conversational design. Int. J. Hum.-Comput. Interact. (2022). https://doi.org/10.1080/10447318.2022.2118244
Sugisaki, K., Bleiker, A.: Usability guidelines and evaluation criteria for conversational user interfaces: a heuristic and linguistic approach. In: Proceedings of Mensch Und Computer 2020 (MuC 2020), pp. 309–319. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3404983.3405505
Tao, C., Mou, L., Zhao, D., Yan, R.: Ruber: an unsupervised method for automatic evaluation of open-domain dialog systems. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Urban, M., Mailey, S.: Conversation design: principles, strategies, and practical application. In: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (CHI EA 2019), pp. 1–3. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3290607.3298821
Vasconcelos, M., Candello, H., Pinhanez, C., dos Santos, T.: Bottester: testing conversational systems with simulated users. In: Proceedings of the XVI Brazilian Symposium on Human Factors in Computing Systems (IHC 2017). Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3160504.3160584
White, J., et al.: A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv (2023)
Wilson, C.: User interface inspection methods: a user-centered design method. Newnes (2013)
Wilson, C.E.: Triangulation: the explicit use of multiple methods, measures, and approaches for determining core issues in product development. Interactions 13(6), 46-ff (2006). https://doi.org/10.1145/1167948.1167980
de Wit, J., Braggaar, A.: Tilbot: a visual design platform to facilitate open science research into conversational user interfaces. In: Proceedings of the 5th International Conference on Conversational User Interfaces (CUI 2023). Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3571884.3604403
Acknowledgments
I am sincerely grateful to Serkan Girgin (University of Twente) for helping to sprout the idea for this exploration, the eScience Center for their continuous support with developing Tilbot, and the funded WeCare project with the Elisabeth-TweeSteden hospital and the Heracleum Fund for supporting our studies into the use of conversational agents in medical practice. Finally, I greatly appreciate the reviewers’ suggestions based on the initial version of this paper, and the valuable questions and suggestions from CONVERSATIONS workshop attendees.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
de Wit, J. (2024). Leveraging Large Language Models as Simulated Users for Initial, Low-Cost Evaluations of Designed Conversations. In: Følstad, A., et al. Chatbot Research and Design. CONVERSATIONS 2023. Lecture Notes in Computer Science, vol 14524. Springer, Cham. https://doi.org/10.1007/978-3-031-54975-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-54975-5_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54974-8
Online ISBN: 978-3-031-54975-5
eBook Packages: Computer ScienceComputer Science (R0)