Skip to main content

Making Large Language Models Better Planners with Reasoning-Decision Alignment

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15094))

Included in the following conference series:

  • 363 Accesses

Abstract

Data-driven approaches for autonomous driving (AD) have been widely adopted in the past decade but are confronted with dataset bias and uninterpretability. Inspired by the knowledge-driven nature of human driving, recent approaches explore the potential of large language models (LLMs) to improve understanding and decision-making in traffic scenarios. They find that the pretrain-finetune paradigm of LLMs on downstream data with the Chain-of-Thought (CoT) reasoning process can enhance explainability and scene understanding. However, such a popular strategy proves to suffer from the notorious problems of misalignment between the crafted CoTs against the consequent decision-making, which remains untouched by previous LLM-based AD methods. To address this problem, we motivate an end-to-end decision-making model based on multimodality-augmented LLM, which simultaneously executes CoT reasoning and carries out planning results. Furthermore, we propose a reasoning-decision alignment constraint between the paired CoTs and planning results, imposing the correspondence between reasoning and decision-making. Moreover, we redesign the CoTs to enable the model to comprehend complex scenarios and enhance decision-making performance. We dub our proposed large language planners with reasoning-decision alignment as RDA-Driver. Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate the effectiveness of our RDA-Driver in enhancing the performance of end-to-end AD systems. Specifically, our RDA-Driver achieves state-of-the-art planning performance on the nuScenes dataset with 0.80 L2 error and 0.32 collision rate, and also achieves leading results on challenging DriveLM-nuScenes benchmarks with 0.82 L2 error and 0.38 collision rate.

Z. Huang and T. Tang—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. Aggarwal, S., Mandowara, D., Agrawal, V., Khandelwal, D., Singla, P., Garg, D.: Explanations for commonsenseQA: new dataset and models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3050–3065 (2021)

    Google Scholar 

  3. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)

    Google Scholar 

  4. Chen, S., et al.: VADv2: end-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243 (2024)

  5. Cobbe, K., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

  6. Cui, C., Ma, Y., Cao, X., Ye, W., Wang, Z.: Drive as you speak: enabling human-like interaction with large language models in autonomous vehicles. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 902–909 (2024)

    Google Scholar 

  7. Cui, C., Yang, Z., Zhou, Y., Ma, Y., Lu, J., Wang, Z.: Large language models for autonomous driving: real-world experiments. arXiv preprint arXiv:2312.09397 (2023)

  8. Da, F., Zhang, Y.: Path-aware graph attention for HD maps in motion prediction. In: 2022 International Conference on Robotics and Automation (ICRA), pp. 6430–6436. IEEE (2022)

    Google Scholar 

  9. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: Conference on Robot Learning, pp. 1–16. PMLR (2017)

    Google Scholar 

  10. Gao, J., et al.: VectorNet: encoding HD maps and agent dynamics from vectorized representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11525–11533 (2020)

    Google Scholar 

  11. Gao, L., et al.: Cola-HRL: continuous-lattice hierarchical reinforcement learning for autonomous driving. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 13143–13150. IEEE (2022)

    Google Scholar 

  12. Gu, J., et al.: ViP3D: end-to-end visual trajectory prediction via 3D agent queries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5496–5506 (2023)

    Google Scholar 

  13. Han, W., Guo, D., Xu, C.Z., Shen, J.: DME-driver: integrating human decision logic and 3d scene perception in autonomous driving. arXiv preprint arXiv:2401.03641 (2024)

  14. Hu, P., Huang, A., Dolan, J., Held, D., Ramanan, D.: Safe local motion planning with self-supervised freespace forecasting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12732–12741 (2021)

    Google Scholar 

  15. Hu, S., Chen, L., Wu, P., Li, H., Yan, J., Tao, D.: ST-P3: end-to-end vision-based autonomous driving via spatial-temporal feature learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13698, pp. 533–549. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19839-7_31

    Chapter  Google Scholar 

  16. Hu, Y., et al.: Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17853–17862 (2023)

    Google Scholar 

  17. Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: BEVDet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)

  18. Jiang, B., et al.: VAD: vectorized scene representation for efficient autonomous driving. arXiv preprint arXiv:2303.12077 (2023)

  19. Khurana, T., Hu, P., Dave, A., Ziglar, J., Held, D., Ramanan, D.: Differentiable raycasting for self-supervised occupancy forecasting. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13698, pp. 353–369. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19839-7_21

    Chapter  Google Scholar 

  20. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  21. Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1

    Chapter  Google Scholar 

  22. Li, Z., et al.: Is ego status all you need for open-loop end-to-end autonomous driving? arXiv preprint arXiv:2312.03031 (2023)

  23. Liang, T., et al.: BEVFusion: a simple and robust lidar-camera fusion framework. Adv. Neural. Inf. Process. Syst. 35, 10421–10434 (2022)

    Google Scholar 

  24. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Adv. Neural Inf. Process. Syst. 36 (2024)

    Google Scholar 

  25. Liu, Z., et al.: BEVFusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2774–2781. IEEE (2023)

    Google Scholar 

  26. Mao, J., Qian, Y., Zhao, H., Wang, Y.: GPT-driver: learning to drive with GPT. arXiv preprint arXiv:2310.01415 (2023)

  27. Mao, J., Ye, J., Qian, Y., Pavone, M., Wang, Y.: A language agent for autonomous driving. arXiv preprint arXiv:2311.10813 (2023)

  28. Nie, M., et al.: Reason2drive: towards interpretable and chain-based reasoning for autonomous driving. arXiv preprint arXiv:2312.03661 (2023)

  29. Ouyang, L., Ray, A., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022)

    Google Scholar 

  30. Pan, C., et al.: VLP: vision language planning for autonomous driving. arXiv preprint arXiv:2401.05577 (2024)

  31. Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XIV. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12

    Chapter  Google Scholar 

  32. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  33. Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 36 (2024)

    Google Scholar 

  34. Ramamurthy, R., et al.: Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241 (2022)

  35. Sadat, A., Casas, S., Ren, M., Wu, X., Dhawan, P., Urtasun, R.: Perceive, predict, and plan: safe motion planning through interpretable semantic representations. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XXIII. LNCS, vol. 12368, pp. 414–430. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_25

    Chapter  Google Scholar 

  36. Scheel, O., Bergamini, L., Wolczyk, M., Osiński, B., Ondruska, P.: Urban driver: learning to drive from real-world demonstrations using policy gradients. In: Conference on Robot Learning, pp. 718–728. PMLR (2022)

    Google Scholar 

  37. Sha, H., et al.: LanguageMPC: large language models as decision makers for autonomous driving. arXiv preprint arXiv:2310.03026 (2023)

  38. Shao, H., Hu, Y., Wang, L., Waslander, S.L., Liu, Y., Li, H.: LMDrive: closed-loop end-to-end driving with large language models. arXiv preprint arXiv:2312.07488 (2023)

  39. Sima, C., et al.: DriveLM: driving with graph visual question answering. arXiv preprint arXiv:2312.14150 (2023)

  40. Tian, X., et al.: DriveVLM: the convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289 (2024)

  41. Touvron, H., et al.: LLaMa: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  42. Wang, P., et al.: Making large language models better reasoners with alignment. arXiv preprint arXiv:2309.02144 (2023)

  43. Wang, P., et al.: BEVGPT: generative pre-trained large model for autonomous driving prediction, decision-making, and planning. arXiv preprint arXiv:2310.10357 (2023)

  44. Wang, W., et al.: DriveMLM: aligning multi-modal large language models with behavioral planning states for autonomous driving. arXiv preprint arXiv:2312.09245 (2023)

  45. Wang, Y., et al.: Empowering autonomous driving with large language models: a safety perspective. arXiv preprint arXiv:2312.00812 (2023)

  46. Wen, L., et al.: DiLu: a knowledge-driven approach to autonomous driving with large language models. arXiv preprint arXiv:2309.16292 (2023)

  47. Wen, L., et al.: On the road with GPT-4V(ision): early explorations of visual-language model on autonomous driving. arXiv preprint arXiv:2311.05332 (2023)

  48. Xu, Z., et al.: DriveGPT4: interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412 (2023)

  49. Yuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S., Huang, F.: RRHF: rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302 (2023)

  50. Zeng, W., et.: End-to-end interpretable neural motion planner. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8660–8669 (2019)

    Google Scholar 

  51. Zhai, J.T., et al.: Rethinking the open-loop evaluation of end-to-end autonomous driving in nuScenes. arXiv preprint arXiv:2305.10430 (2023)

  52. Zhao, Y., Joshi, R., Liu, T., Khalman, M., Saleh, M., Liu, P.J.: SLIC-HF: sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425 (2023)

Download references

Acknowledgements

This work was supported in part by National Science and Technology Major Project (2020AAA0109704), National Science and Technology Ministry Youth Talent Funding No. 2022WRQB002, Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), Mobility Grant Award under Grant No. M-0461, Shenzhen Science and Technology Program (Grant No. GJHZ20220913142600001), Nansha Key RD Program under Grant No. 2022ZD014.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaodan Liang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 651 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, Z. et al. (2025). Making Large Language Models Better Planners with Reasoning-Decision Alignment. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15094. Springer, Cham. https://doi.org/10.1007/978-3-031-72764-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72764-1_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72763-4

  • Online ISBN: 978-3-031-72764-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics