Skip to main content

Leveraging Large Language Models as Simulated Users for Initial, Low-Cost Evaluations of Designed Conversations

  • Conference paper
  • First Online:
Chatbot Research and Design (CONVERSATIONS 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14524))

Included in the following conference series:

  • 239 Accesses

Abstract

In this paper, we explore the use of large language models, in this case the ChatGPT API, as simulated users to evaluate designed, rule-based conversations. This type of evaluation can be introduced as a low-cost method to identify common usability issues prior to testing conversational agents with actual users. Preliminary findings show that ChatGPT is good at playing the part of a user, providing realistic testing scenarios for designed conversations even if these involve certain background knowledge or context. GPT-4 shows vast improvements over ChatGPT (3.5). In future work, it is important to evaluate the performance of simulated users in a more structured, generalizable manner, for example by comparing their behavior to that of actual users. In addition, ways to fine-tune the LLM could be explored to improve its performance, and the output of simulated conversations could be analyzed to automatically derive usability metrics such as the number of turns needed to reach the goal. Finally, the use of simulated users with open-ended conversational agents could be explored, where the LLM may also be able to reflect on the user experience of the conversation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See, for example, Botanalytics (https://botanalytics.co).

  2. 2.

    https://github.com/tilbotio/guesswho.

  3. 3.

    https://github.com/tilbotio/tilbot-main.

  4. 4.

    GPT-4 appears to be much better than GPT-3.5 at following multiple choice instructions: It chooses an answer option and returns only that answer option. GPT-3.5 sometimes adds text, or rephrases the answer option.

  5. 5.

    For a discussion of the value of this type of autoethnographic work, please see [49].

  6. 6.

    https://osf.io/8p4zn/.

References

  1. Afzali, J., Drzewiecki, A.M., Balog, K., Zhang, S.: UserSimCRS: a user simulation toolkit for evaluating conversational recommender systems. In: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (WSDM 2023), pp. 1160–1163. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3539597.3573029

  2. Akbar, S., Coiera, E., Magrabi, F.: Safety concerns with consumer-facing mobile health applications and their consequences: a scoping review. J. Am. Med. Inform. Assoc. 27(2), 330–340 (2019). https://doi.org/10.1093/jamia/ocz175

  3. Allouch, M., Azaria, A., Azoulay, R.: Conversational agents: goals, technologies, vision and challenges. Sensors 21(24), 8448 (2021). https://doi.org/10.3390/s21248448

  4. Argyle, L.P., Busby, E.C., Fulda, N., Gubler, J.R., Rytting, C., Wingate, D.: Out of one, many: using language models to simulate human samples. Polit. Anal. 31(3), 337–351 (2023). https://doi.org/10.1017/pan.2023.2

    Article  Google Scholar 

  5. Bell, G., Blythe, M., Sengers, P.: Making by making strange: defamiliarization and the design of domestic technologies. ACM Trans. Comput. Hum. Interact. 12(2), 149–173 (2005). https://doi.org/10.1145/1067860.1067862

  6. Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: can language models be too big? In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT 2021), pp. 610–623. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3442188.3445922

  7. Blythe, M., Buie, E.: Chatbots of the gods: imaginary abstracts for techno-spirituality research. In: Proceedings of the 8th Nordic Conference on Human-Computer Interaction: Fun, Fast, Foundational (NordiCHI 2014), pp. 227–236. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2639189.2641212

  8. Bozic, J., Tazl, O.A., Wotawa, F.: Chatbot testing using AI planning. In: 2019 IEEE International Conference On Artificial Intelligence Testing (AITest), pp. 37–44 (2019). https://doi.org/10.1109/AITest.2019.00-10

  9. Bravo-Santos, S., Guerra, E., de Lara, J.: Testing chatbots with charm. In: Shepperd, M., Brito e Abreu, F., Rodrigues da Silva, A., Pérez-Castillo, R. (eds.) Quality of Information and Communications Technology: 13th International Conference, QUATIC 2020, Faro, Portugal, September 9–11, 2020, Proceedings, pp. 426–438. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58793-2_34

  10. Cameron, G., et al.: Back to the future: lessons from knowledge engineering methodologies for chatbot design and development. In: British HCI Conference 2018. BCS Learning & Development Ltd. (2018)

    Google Scholar 

  11. Choi, Y., Monserrat, T.J.K.P., Park, J., Shin, H., Lee, N., Kim, J.: ProtoChat: supporting the conversation design process with crowd feedback. Proc. ACM Hum. Comput. Interact. 4(CSCW3), 1–27 (2021). https://doi.org/10.1145/3432924

  12. Cockton, G., Woolrych, A.: Sale must end: Should discount methods be cleared off HCI’s shelves? Interactions 9(5), 13–18 (2002). https://doi.org/10.1145/566981.566990

  13. Cowan, B.R., Clark, L., Candello, H., Tsai, J.: Introduction to this special issue: guiding the conversation: new theory and design perspectives for conversational user interfaces. Hum. Comput. Interact. 38(3–4), 159–167 (2023). https://doi.org/10.1080/07370024.2022.2161905

  14. Dall’Acqua, A., Tamburini, F.: Toward a linguistically grounded dialog model for chatbot design. Italian J. Comput. Linguist. 7(7–1, 2), 191–222 (2021)

    Google Scholar 

  15. Davis, F.D.: Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quart. 13(3), 319–340 (1989)

    Google Scholar 

  16. Desai, S., Sharma, T., Saha, P.: Using ChatGPT in HCI research-a trioethnography. In: Proceedings of the 5th International Conference on Conversational User Interfaces (CUI 2023). Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3571884.3603755

  17. Deshpande, A., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K.: Toxicity in ChatGPT: analyzing persona-assigned language models. arXiv (2023)

    Google Scholar 

  18. Diederich, S., Brendel, A.B., Morana, S., Kolbe, L.: On the design of and interaction with conversational agents: an organizing and assessing review of human-computer interaction research. J. Assoc. Inf. Syst. 23(1), 96–138 (2022)

    Google Scholar 

  19. Eckert, W., Levin, E., Pieraccini, R.: User modeling for spoken dialogue system evaluation. In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pp. 80–87 (1997). https://doi.org/10.1109/ASRU.1997.658991

  20. Engelbrecht, K.P., Quade, M., Möller, S.: Analysis of a new simulation approach to dialog system evaluation. Speech Commun. 51(12), 1234–1252 (2009)

    Article  Google Scholar 

  21. Følstad, A., et al.: Future directions for chatbot research: an interdisciplinary research agenda. Computing 103(12), 2915–2942 (2021)

    Google Scholar 

  22. Følstad, A., Brandtzaeg, P.B.: Users’ experiences with chatbots: findings from a questionnaire study. Qual. User Exp. 5(1), 3 (2020)

    Google Scholar 

  23. Fuchs, A., Passarella, A., Conti, M.: Modeling, replicating, and predicting human behavior: a survey. ACM Trans. Autonom. Adapt. Syst. 18(2), 1–47 (2023). https://doi.org/10.1145/3580492

  24. Gatt, A., Krahmer, E.: Survey of the state of the art in natural language generation: core tasks, applications and evaluation. J. Artif. Intell. Res. 61, 65–170 (2018)

    Google Scholar 

  25. Goes, F., Sawicki, P., Grześ, M., Brown, D., Volpe, M.: Is GPT-4 good enough to evaluate jokes? In: 14th International Conference for Computational Creativity, Waterloo (2023). https://kar.kent.ac.uk/101552/

  26. Guo, F., Metallinou, A., Khatri, C., Raju, A., Venkatesh, A., Ram, A.: Topic-based evaluation for conversational bots. In: Proceedings of the Conversational AI Workshop at the 31st Conference on Neural Information Processing Systems (NIPS 2017) (2017)

    Google Scholar 

  27. Hassenzahl, M., Tractinsky, N.: User experience–a research agenda. Behav. Inf. Technol. 25(2), 91–97 (2006). https://doi.org/10.1080/01449290500330331

  28. Holmes, S., Moorhead, A., Bond, R., Zheng, H., Coates, V., McTear, M.: Usability testing of a healthcare chatbot: can we use conventional methods to assess conversational user interfaces? In: Proceedings of the 31st European Conference on Cognitive Ergonomics (ECCE 2019), pp. 207–214. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3335082.3335094

  29. Horton, J.J.: Large language models as simulated economic agents: what can we learn from homo silicus? Working Paper 31122, National Bureau of Economic Research (2023). https://doi.org/10.3386/w31122

  30. Janssen, A., Grützner, L., Breitner, M.H.: Why do chatbots fail? a critical success factors analysis. In: International Conference on Information Systems (ICIS) (2021)

    Google Scholar 

  31. Keizer, S., Rossignol, S., Chandramohan, S., Pietquin, O.: User Simulation in the Development of Statistical Spoken Dialogue Systems, pp. 39–73. Springer, New York (2012). https://doi.org/10.1007/978-1-4614-4803-7_4

  32. Kicken, M., van der Lee, C., Tenfelde, K., Maat, B., de Wit, J.: Introducing a framework for designing and evaluating interactions with conversational agents. In: Position Paper Presented at CONVERSATIONS 2022 – The 6th International Workshop on Chatbot Research and Design (2022)

    Google Scholar 

  33. Kocaballi, A.B.: Conversational AI-powered design: ChatGPT as designer, user, and product. arXiv (2023)

    Google Scholar 

  34. Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 22199–22213. Curran Associates, Inc. (2022)

    Google Scholar 

  35. Langevin, R., Lordon, R.J., Avrahami, T., Cowan, B.R., Hirsch, T., Hsieh, G.: Heuristic evaluation of conversational agents. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI 2021). Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3411764.3445312

  36. van der Lee, C., Gatt, A., van Miltenburg, E., Krahmer, E.: Human evaluation of automatically generated text: current trends and best practice guidelines. Comput. Speech Lang. 67, 101151 (2021)

    Google Scholar 

  37. Lewandowski, T., Heuer, M., Vogel, P., Böhmann, T.: Design knowledge for the lifecycle management of conversational agents. In: Wirtschaftsinformatik 2022 Proceedings. No. 3 (2022)

    Google Scholar 

  38. Li, X., Lipton, Z.C., Dhingra, B., Li, L., Gao, J., Chen, Y.N.: A user simulator for task-completion dialogues. arXiv (2017)

    Google Scholar 

  39. Li, Z., Chen, W., Li, S., Wang, H., Qian, J., Yan, X.: Controllable dialogue simulation with in-context learning. arXiv (2023)

    Google Scholar 

  40. Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, Proceedings of the ACL-04 Workshop, pp. 74–81 (2004)

    Google Scholar 

  41. Liu, C.W., Lowe, R., Serban, I.V., Noseworthy, M., Charlin, L., Pineau, J.: How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv (2017)

    Google Scholar 

  42. Liu, H., Cai, Y., Ou, Z., Huang, Y., Feng, J.: A generative user simulator with GPT-based architecture and goal state tracking for reinforced multi-domain dialog systems. arXiv (2022)

    Google Scholar 

  43. Liu, Y., et al.: One cannot stand for everyone! leveraging multiple user simulators to train task-oriented dialogue systems. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1–21. Association for Computational Linguistics, Toronto (2023). https://doi.org/10.18653/v1/2023.acl-long.1

  44. Lowe, R., Noseworthy, M., Serban, I.V., Angelard-Gontier, N., Bengio, Y., Pineau, J.: Towards an automatic turing test: learning to evaluate dialogue responses. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2017)

    Google Scholar 

  45. McTear, M.: Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots. Springer, Cham (2022)

    Google Scholar 

  46. Meyer, S., Elsweiler, D., Ludwig, B., Fernandez-Pichel, M., Losada, D.E.: Do we still need human assessors? prompt-based GPT-3 user simulation in conversational AI. In: Proceedings of the 4th Conference on Conversational User Interfaces (CUI 2022). Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3543829.3544529

  47. Möller, S., et al.: MeMo: Towards automatic usability evaluation of spoken dialogue services by user error simulations. In: INTERSPEECH (2006)

    Google Scholar 

  48. Murad, C., Munteanu, C., Cowan, B.R., Clark, L.: Revolution or evolution? speech interaction and HCI design guidelines. IEEE Pervas. Comput. 18(2), 33–45 (2019). https://doi.org/10.1109/MPRV.2019.2906991

    Article  Google Scholar 

  49. Neustaedter, C., Sengers, P.: Autobiographical design in HCI research: Designing and learning through use-it-yourself. In: Proceedings of the Designing Interactive Systems Conference (DIS 2012), pp. 514–523. Association for Computing Machinery, New York (2012). https://doi.org/10.1145/2317956.2318034

  50. Nielsen, J.: Usability inspection methods. In: Conference Companion on Human Factors in Computing Systems (CHI 1994), pp. 413–414. Association for Computing Machinery, New York (1994). https://doi.org/10.1145/259963.260531

  51. Paoli, S.D.: Writing user personas with large language models: testing phase 6 of a thematic analysis of semi-structured interviews. arXiv (2023)

    Google Scholar 

  52. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  53. Radziwill, N., Benton, M.: Evaluating quality of chatbots and intelligent conversational agents. Softw. Qual. Profess. 19(3), 25 (2017)

    Google Scholar 

  54. Sadek, M., Calvo, R.A., Mougenot, C.: Trends, challenges and processes in conversational agent design: exploring practitioners’ views through semi-structured interviews. In: Proceedings of the 5th International Conference on Conversational User Interfaces (CUI 2023). Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3571884.3597143

  55. Sambasivan, N., Arnesen, E., Hutchinson, B., Doshi, T., Prabhakaran, V.: Re-imagining algorithmic fairness in India and beyond. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT 2021), pp. 315–328. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3442188.3445896

  56. Schatzmann, J., Weilhammer, K., Stuttle, M., Young, S.: A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowl. Eng. Rev. 21(2), 97–126 (2006). https://doi.org/10.1017/S0269888906000944

    Article  Google Scholar 

  57. Silva, G.R.S., Canedo, E.D.: Towards user-centric guidelines for chatbot conversational design. Int. J. Hum.-Comput. Interact. (2022). https://doi.org/10.1080/10447318.2022.2118244

    Article  Google Scholar 

  58. Sugisaki, K., Bleiker, A.: Usability guidelines and evaluation criteria for conversational user interfaces: a heuristic and linguistic approach. In: Proceedings of Mensch Und Computer 2020 (MuC 2020), pp. 309–319. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3404983.3405505

  59. Tao, C., Mou, L., Zhao, D., Yan, R.: Ruber: an unsupervised method for automatic evaluation of open-domain dialog systems. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  60. Urban, M., Mailey, S.: Conversation design: principles, strategies, and practical application. In: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (CHI EA 2019), pp. 1–3. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3290607.3298821

  61. Vasconcelos, M., Candello, H., Pinhanez, C., dos Santos, T.: Bottester: testing conversational systems with simulated users. In: Proceedings of the XVI Brazilian Symposium on Human Factors in Computing Systems (IHC 2017). Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3160504.3160584

  62. White, J., et al.: A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv (2023)

    Google Scholar 

  63. Wilson, C.: User interface inspection methods: a user-centered design method. Newnes (2013)

    Google Scholar 

  64. Wilson, C.E.: Triangulation: the explicit use of multiple methods, measures, and approaches for determining core issues in product development. Interactions 13(6), 46-ff (2006). https://doi.org/10.1145/1167948.1167980

  65. de Wit, J., Braggaar, A.: Tilbot: a visual design platform to facilitate open science research into conversational user interfaces. In: Proceedings of the 5th International Conference on Conversational User Interfaces (CUI 2023). Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3571884.3604403

Download references

Acknowledgments

I am sincerely grateful to Serkan Girgin (University of Twente) for helping to sprout the idea for this exploration, the eScience Center for their continuous support with developing Tilbot, and the funded WeCare project with the Elisabeth-TweeSteden hospital and the Heracleum Fund for supporting our studies into the use of conversational agents in medical practice. Finally, I greatly appreciate the reviewers’ suggestions based on the initial version of this paper, and the valuable questions and suggestions from CONVERSATIONS workshop attendees.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan de Wit .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

de Wit, J. (2024). Leveraging Large Language Models as Simulated Users for Initial, Low-Cost Evaluations of Designed Conversations. In: Følstad, A., et al. Chatbot Research and Design. CONVERSATIONS 2023. Lecture Notes in Computer Science, vol 14524. Springer, Cham. https://doi.org/10.1007/978-3-031-54975-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-54975-5_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-54974-8

  • Online ISBN: 978-3-031-54975-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics