skip to main content
10.1145/3650105.3652296acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections

Reality Bites: Assessing the Realism of Driving Scenarios with Large Language Models

Published: 12 June 2024 Publication History


Large Language Models (LLMs) are demonstrating outstanding potential for tasks such as text generation, summarization, and classification. Given that such models are trained on a humongous amount of online knowledge, we hypothesize that LLMs can assess whether driving scenarios generated by autonomous driving testing techniques are realistic, i.e., being aligned with real-world driving conditions. To test this hypothesis, we conducted an empirical evaluation to assess whether LLMs are effective and robust in performing the task. This reality check is an important step towards devising LLM-based autonomous driving testing techniques. For our empirical evaluation, we selected 64 realistic scenarios from DeepScenario-an open driving scenario dataset. Next, by introducing minor changes to them, we created 512 additional realistic scenarios, to form an overall dataset of 576 scenarios. With this dataset, we evaluated three LLMs (GPT-3.5, Llama2-13B, and Mistral-7B) to assess their robustness in assessing the realism of driving scenarios. Our results show that: (1) Overall, GPT-3.5 achieved the highest robustness compared to Llama2-13B and Mistral-7B, consistently throughout almost all scenarios, roads, and weather conditions; (2) Mistral-7B performed the worst consistently; (3) Llama2-13B achieved good results under certain conditions; and (4) roads and weather conditions do influence the robustness of the LLMs.


Raja Ben Abdessalem, Shiva Nejati, Lionel C Briand, and Thomas Stifter. 2018. Testing vision-based control systems using learnable evolutionary algorithms. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 1016--1026.
Raja Ben Abdessalem, Annibale Panichella, Shiva Nejati, Lionel C Briand, and Thomas Stifter. 2018. Testing autonomous cars for feature interaction failures using many-objective search. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 143--154.
Alessandro Calò, Paolo Arcaini, Shaukat Ali, Florian Hauer, and Fuyuki Ishikawa. 2020. Generating avoidable collision scenarios for testing autonomous driving systems. In 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST). IEEE, 375--386.
Baiming Chen, Xiang Chen, Qiong Wu, and Liang Li. 2022. Adversarial Evaluation of Autonomous Vehicles in Lane-Change Scenarios. IEEE Transactions on Intelligent Transportation Systems 23, 8 (2022), 10333--10342.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, Tianren Gao, Erlong Li, Kun Tang, Zhipeng Cao, Tong Zhou, Ao Liu, Xinrui Yan, Shuqi Mei, Jianguo Cao, Ziran Wang, and Chao Zheng. 2023. A Survey on Multimodal Large Language Models for Autonomous Driving. arXiv:2311.12320 [cs.AI]
Yao Deng, Jiaohong Yao, Zhi Tu, Xi Zheng, Mengshi Zhang, and Tianyi Zhang. 2023. TARGET: Automated Scenario Generation from Traffic Rules for Testing Autonomous Vehicles. arXiv:2305.06018 [cs.SE]
Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, and Xiaomeng Li. 2023. HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving. arXiv:2309.05186 [cs.CV]
Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533 (2023).
Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated Repair of Programs from Large Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1469--1481.
Shuo Feng, Haowei Sun, Xintao Yan, Haojie Zhu, Zhengxia Zou, Shengyin Shen, and Henry X Liu. 2023. Dense reinforcement learning for safety validation of autonomous vehicles. Nature 615, 7953 (2023), 620--627.
Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. 2024. Drive like a human: Rethinking autonomous driving with large language models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 910--919.
Alessio Gambi, Tri Huynh, and Gordon Fraser. 2019. Generating effective test cases for self-driving cars from police reports. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 257--267.
Luca Giamattei, Antonio Guerriero, Roberto Pietrantuono, and Stefano Russo. 2023. Causality-driven Testing of Autonomous Driving Systems. ACM Transactions on Software Engineering and Methodology (2023).
Fitash Ul Haq, Donghwan Shin, and Lionel C. Briand. 2023. Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE '23). IEEE Press, 1814--1826.
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620 (2023).
Yuheng Huang, Jiayang Song, Zhijie Wang, Huaming Chen, and Lei Ma. 2023. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236 (2023).
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
Ali Keysan, Andreas Look, Eitan Kosman, Gonca Gürsun, Jörg Wagner, Yu Yao, and Barbara Rakitsch. 2023. Can you text what is happening? Integrating pre-trained language encoders into trajectory prediction models for autonomous driving. arXiv:2309.05282 [cs.CV]
Thomas Laurent, Stefan Klikovits, Paolo Arcaini, Fuyuki Ishikawa, and Anthony Ventresque. 2023. Parameter Coverage for Testing of Autonomous Driving Systems Under Uncertainty. ACM Transactions on Software Engineering and Methodology 32, 3 (2023), 1--31.
Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. 2023. CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 919--931.
Guanpeng Li, Yiran Li, Saurabh Jha, Timothy Tsai, Michael Sullivan, Siva Kumar Sastry Hari, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2020. Av-fuzzer: Finding safety violations in autonomous driving systems. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, 25--36.
Guanpeng Li, Yiran Li, Saurabh Jha, Timothy Tsai, Michael Sullivan, Siva Kumar Sastry Hari, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2020. AV-FUZZER: Finding Safety Violations in Autonomous Driving Systems. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). 25--36.
Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, and Deva Ramanan. 2023. Multimodality helps unimodality: Cross-modal few-shot learning with multi-modal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19325--19337.
Jiaqi Liu, Peng Hang, Xiao qi, Jianqiang Wang, and Jian Sun. 2023. MTD-GPT: A Multi-Task Decision-Making GPT Model for Autonomous Driving at Unsignalized Intersections. arXiv:2307.16118 [cs.RO]
Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2023. Testing the Limits: Unusual Text Inputs Generation for Mobile App Crash Detection with Large Language Model. arXiv preprint arXiv:2310.15657 (2023).
Chengjie Lu, Yize Shi, Huihui Zhang, Man Zhang, Tiexin Wang, Tao Yue, and Shaukat Ali. 2022. Learning configurations of operating environment of autonomous vehicles to maximize their collisions. IEEE Transactions on Software Engineering 49, 1 (2022), 384--402.
Chengjie Lu, Tao Yue, and Shaukat Ali. 2023. DeepScenario: An Open Driving Scenario Dataset for Autonomous Driving System Testing. In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). 52--56.
Chengjie Lu, Tao Yue, Man Zhang, and Shaukat Ali. 2023. DeepQTest: Testing Autonomous Driving Systems with Reinforcement Learning and Real-world Weather Data. arXiv:2310.05170 [cs.SE]
Dipeeka Luitel, Shabnam Hassani, and Mehrdad Sabetzadeh. 2023. Improving requirements completeness: Automated assistance through large language models. arXiv preprint arXiv:2308.03784 (2023).
Xianchang Luo, Yinxing Xue, Zhenchang Xing, and Jiamou Sun. 2022. PRCBERT: Prompt Learning for Requirement Classification using BERT-based Pretrained Language Models. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1--13.
Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 336--347.
Demin Nalic, Tomislav Mihalj, Maximilian Bäumler, Matthias Lehmann, Arno Eichberger, and Stefan Bernsteiner. 2020. Scenario based testing of automated driving systems: A literature survey. In FISITA web Congress, Vol. 10.
OpenAI. 2023. GPT 3.5.
OpenWeather. 2012. OpenWeather: Weather forecasts, nowcasts and history in a fast and elegant way.
Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2023. LLM is like a box of chocolates: the non-determinism of ChatGPT in code generation. arXiv preprint arXiv:2308.02828 (2023).
Jishnu Jaykumar P, Kamalesh Palanisamy, Yu-Wei Chao, Xinya Du, and Yu Xiang. 2023. Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning. arXiv:2307.03073 [cs.CV]
Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining Zero-Shot Vulnerability Repair with Large Language Models. In 2023 IEEE Symposium on Security and Privacy (SP). 2339--2356.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485--5551.
Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen, Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, and Mingyu Ding. 2023. LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving. arXiv:2310.03026 [cs.RO]
Andrea Stocco, Brian Pulfer, and Paolo Tonella. 2023. Mind the Gap! A Study on the Transferability of Virtual Versus Physical-World Testing of Autonomous Driving Systems. IEEE Trans. Softw. Eng. 49, 4 (apr 2023), 1928--1940.
Yang Sun, Christopher M. Poskitt, Jun Sun, Yuqi Chen, and Zijiang Yang. 2023. LawBreaker: An Approach for Specifying Traffic Laws and Fuzzing Autonomous Vehicles. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (Rochester, MI, USA) (ASE '22). Association for Computing Machinery, New York, NY, USA, Article 62, 12 pages.
Shuncheng Tang, Zhenya Zhang, Yi Zhang, Jixiang Zhou, Yan Guo, Shuang Liu, Shengjian Guo, Yan-Fu Li, Lei Ma, Yinxing Xue, and Yang Liu. 2023. A Survey on Automated Driving System Testing: Landscapes and Trends. ACM Trans. Softw. Eng. Methodol. 32, 5, Article 124 (jul 2023), 62 pages.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
Simon Ulbrich, Till Menzel, Andreas Reschka, Fabian Schuldt, and Markus Maurer. 2015. Defining and substantiating the terms scene, situation, and scenario for automated driving. In 2015 IEEE 18th international conference on intelligent transportation systems. IEEE, 982--988.
Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2023. Software testing with large language model: Survey, landscape, and vision. arXiv preprint arXiv:2307.07221 (2023).
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120 (2023).
Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, and Yu Qiao. 2023. DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models. arXiv preprint arXiv:2309.16292 (2023).
Dongming Wu, Wencheng Han, Tiancai Wang, Yingfei Liu, Xiangyu Zhang, and Jianbing Shen. 2023. Language Prompt for Autonomous Driving. arXiv:2309.04379 [cs.CV]
Jiahui Wu, Chengjie Lu, Aitor Arrieta, Tao Yue, and Shaukat Ali. 2024. Repository for the paper "Reality Bites: Assessing the Realism of Driving Scenarios with Large Language Models".
Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1--10.
Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee. K. Wong, Zhenguo Li, and Hengshuang Zhao. 2023. DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model. arXiv:2310.01412 [cs.CV]
Xintao Yan, Zhengxia Zou, Shuo Feng, Haojie Zhu, Haowei Sun, and Henry X Liu. 2023. Learning naturalistic driving environment with statistical realism. Nature Communications 14, 1 (2023), 2037.
Zhenjie Yang, Xiaosong Jia, Hongyang Li, and Junchi Yan. 2023. LLM4Drive: A Survey of Large Language Models for Autonomous Driving. arXiv:2311.01043 [cs.AI]
He Zhang, Huajun Zhou, Jian Sun, and Ye Tian. 2022. Risk Assessment of Highly Automated Vehicles with Naturalistic Driving Data: A Surrogate-based optimization Method. In 2022 IEEE Intelligent Vehicles Symposium (IV). 580--585.
Jianzhang Zhang, Yiyang Chen, Nan Niu, and Chuang Liu. 2023. A Preliminary Evaluation of ChatGPT in Requirements Information Retrieval. arXiv preprint arXiv:2304.12562 (2023).
Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen. 2023. A Survey on Large Language Models for Software Engineering. arXiv preprint arXiv:2312.15223 (2023).
Xinhai Zhang, Jianbo Tao, Kaige Tan, Martin Törngren, José Manuel Gaspar Sánchez, Muhammad Rusyadi Ramli, Xin Tao, Magnus Gyllenhammar, Franz Wotawa, Naveen Mohan, Mihai Nica, and Hermann Felbinger. 2023. Finding Critical Scenarios for Automated Driving Systems: A Systematic Mapping Study. IEEE Trans. Softw. Eng. 49, 3 (mar 2023), 991--1026.

Cited By

View all
  • (2024)Foundation Models for the Digital Twins Creation of Cyber-Physical SystemsLeveraging Applications of Formal Methods, Verification and Validation. Application Areas10.1007/978-3-031-75390-9_2(9-26)Online publication date: 27-Oct-2024



Information & Contributors


Published In

cover image ACM Conferences
FORGE '24: Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering
April 2024
140 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].



Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2024

Check for updates

Author Tags

  1. large language models
  2. realistic driving scenarios
  3. robustness


  • Research-article

Funding Sources



Upcoming Conference

ICSE 2025


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)120
  • Downloads (Last 6 weeks)15
Reflects downloads up to 25 Feb 2025

Other Metrics


Cited By

View all
  • (2024)Foundation Models for the Digital Twins Creation of Cyber-Physical SystemsLeveraging Applications of Formal Methods, Verification and Validation. Application Areas10.1007/978-3-031-75390-9_2(9-26)Online publication date: 27-Oct-2024

View Options

Login options

View options


View or Download as a PDF file.



View online with eReader.







Share this Publication link

Share on social media