Making Large Language Models Better Planners with Reasoning-Decision Alignment

Huang, Zhijian; Tang, Tao; Chen, Shaoxiang; Lin, Sihao; Jie, Zequn; Ma, Lin; Wang, Guangrun; Liang, Xiaodan

doi:10.1007/978-3-031-72764-1_5

Zhijian Huang¹³,
Tao Tang¹³,
Shaoxiang Chen¹⁴,
Sihao Lin¹⁵,
Zequn Jie¹⁴,
Lin Ma¹⁴,
Guangrun Wang¹⁶ &
…
Xiaodan Liang^13,17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15094))

Included in the following conference series:

European Conference on Computer Vision

896 Accesses
2 Citations

Abstract

Data-driven approaches for autonomous driving (AD) have been widely adopted in the past decade but are confronted with dataset bias and uninterpretability. Inspired by the knowledge-driven nature of human driving, recent approaches explore the potential of large language models (LLMs) to improve understanding and decision-making in traffic scenarios. They find that the pretrain-finetune paradigm of LLMs on downstream data with the Chain-of-Thought (CoT) reasoning process can enhance explainability and scene understanding. However, such a popular strategy proves to suffer from the notorious problems of misalignment between the crafted CoTs against the consequent decision-making, which remains untouched by previous LLM-based AD methods. To address this problem, we motivate an end-to-end decision-making model based on multimodality-augmented LLM, which simultaneously executes CoT reasoning and carries out planning results. Furthermore, we propose a reasoning-decision alignment constraint between the paired CoTs and planning results, imposing the correspondence between reasoning and decision-making. Moreover, we redesign the CoTs to enable the model to comprehend complex scenarios and enhance decision-making performance. We dub our proposed large language planners with reasoning-decision alignment as RDA-Driver. Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate the effectiveness of our RDA-Driver in enhancing the performance of end-to-end AD systems. Specifically, our RDA-Driver achieves state-of-the-art planning performance on the nuScenes dataset with 0.80 L2 error and 0.32 collision rate, and also achieves leading results on challenging DriveLM-nuScenes benchmarks with 0.82 L2 error and 0.38 collision rate.

Z. Huang and T. Tang—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Reason2Drive: Towards Interpretable and Chain-Based Reasoning for Autonomous Driving

Legally-Guided Automated Decision-Making System Using Language Model Agents for Autonomous Driving

PKRD-CoT: A Unified Chain-of-Thought Prompting for Multi-modal Large Language Models in Autonomous Driving

References

Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Aggarwal, S., Mandowara, D., Agrawal, V., Khandelwal, D., Singla, P., Garg, D.: Explanations for commonsenseQA: new dataset and models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3050–3065 (2021)
Google Scholar
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Google Scholar
Chen, S., et al.: VADv2: end-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243 (2024)
Cobbe, K., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)
Cui, C., Ma, Y., Cao, X., Ye, W., Wang, Z.: Drive as you speak: enabling human-like interaction with large language models in autonomous vehicles. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 902–909 (2024)
Google Scholar
Cui, C., Yang, Z., Zhou, Y., Ma, Y., Lu, J., Wang, Z.: Large language models for autonomous driving: real-world experiments. arXiv preprint arXiv:2312.09397 (2023)
Da, F., Zhang, Y.: Path-aware graph attention for HD maps in motion prediction. In: 2022 International Conference on Robotics and Automation (ICRA), pp. 6430–6436. IEEE (2022)
Google Scholar
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: Conference on Robot Learning, pp. 1–16. PMLR (2017)
Google Scholar
Gao, J., et al.: VectorNet: encoding HD maps and agent dynamics from vectorized representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11525–11533 (2020)
Google Scholar
Gao, L., et al.: Cola-HRL: continuous-lattice hierarchical reinforcement learning for autonomous driving. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 13143–13150. IEEE (2022)
Google Scholar
Gu, J., et al.: ViP3D: end-to-end visual trajectory prediction via 3D agent queries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5496–5506 (2023)
Google Scholar
Han, W., Guo, D., Xu, C.Z., Shen, J.: DME-driver: integrating human decision logic and 3d scene perception in autonomous driving. arXiv preprint arXiv:2401.03641 (2024)
Hu, P., Huang, A., Dolan, J., Held, D., Ramanan, D.: Safe local motion planning with self-supervised freespace forecasting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12732–12741 (2021)
Google Scholar
Hu, S., Chen, L., Wu, P., Li, H., Yan, J., Tao, D.: ST-P3: end-to-end vision-based autonomous driving via spatial-temporal feature learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13698, pp. 533–549. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19839-7_31
Chapter Google Scholar
Hu, Y., et al.: Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17853–17862 (2023)
Google Scholar
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: BEVDet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)
Jiang, B., et al.: VAD: vectorized scene representation for efficient autonomous driving. arXiv preprint arXiv:2303.12077 (2023)
Khurana, T., Hu, P., Dave, A., Ziglar, J., Held, D., Ramanan, D.: Differentiable raycasting for self-supervised occupancy forecasting. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13698, pp. 353–369. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19839-7_21
Chapter Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Chapter Google Scholar
Li, Z., et al.: Is ego status all you need for open-loop end-to-end autonomous driving? arXiv preprint arXiv:2312.03031 (2023)
Liang, T., et al.: BEVFusion: a simple and robust lidar-camera fusion framework. Adv. Neural. Inf. Process. Syst. 35, 10421–10434 (2022)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Liu, Z., et al.: BEVFusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2774–2781. IEEE (2023)
Google Scholar
Mao, J., Qian, Y., Zhao, H., Wang, Y.: GPT-driver: learning to drive with GPT. arXiv preprint arXiv:2310.01415 (2023)
Mao, J., Ye, J., Qian, Y., Pavone, M., Wang, Y.: A language agent for autonomous driving. arXiv preprint arXiv:2311.10813 (2023)
Nie, M., et al.: Reason2drive: towards interpretable and chain-based reasoning for autonomous driving. arXiv preprint arXiv:2312.03661 (2023)
Ouyang, L., Ray, A., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022)
Google Scholar
Pan, C., et al.: VLP: vision language planning for autonomous driving. arXiv preprint arXiv:2401.05577 (2024)
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XIV. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Chapter Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Ramamurthy, R., et al.: Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241 (2022)
Sadat, A., Casas, S., Ren, M., Wu, X., Dhawan, P., Urtasun, R.: Perceive, predict, and plan: safe motion planning through interpretable semantic representations. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XXIII. LNCS, vol. 12368, pp. 414–430. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_25
Chapter Google Scholar
Scheel, O., Bergamini, L., Wolczyk, M., Osiński, B., Ondruska, P.: Urban driver: learning to drive from real-world demonstrations using policy gradients. In: Conference on Robot Learning, pp. 718–728. PMLR (2022)
Google Scholar
Sha, H., et al.: LanguageMPC: large language models as decision makers for autonomous driving. arXiv preprint arXiv:2310.03026 (2023)
Shao, H., Hu, Y., Wang, L., Waslander, S.L., Liu, Y., Li, H.: LMDrive: closed-loop end-to-end driving with large language models. arXiv preprint arXiv:2312.07488 (2023)
Sima, C., et al.: DriveLM: driving with graph visual question answering. arXiv preprint arXiv:2312.14150 (2023)
Tian, X., et al.: DriveVLM: the convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289 (2024)
Touvron, H., et al.: LLaMa: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Wang, P., et al.: Making large language models better reasoners with alignment. arXiv preprint arXiv:2309.02144 (2023)
Wang, P., et al.: BEVGPT: generative pre-trained large model for autonomous driving prediction, decision-making, and planning. arXiv preprint arXiv:2310.10357 (2023)
Wang, W., et al.: DriveMLM: aligning multi-modal large language models with behavioral planning states for autonomous driving. arXiv preprint arXiv:2312.09245 (2023)
Wang, Y., et al.: Empowering autonomous driving with large language models: a safety perspective. arXiv preprint arXiv:2312.00812 (2023)
Wen, L., et al.: DiLu: a knowledge-driven approach to autonomous driving with large language models. arXiv preprint arXiv:2309.16292 (2023)
Wen, L., et al.: On the road with GPT-4V(ision): early explorations of visual-language model on autonomous driving. arXiv preprint arXiv:2311.05332 (2023)
Xu, Z., et al.: DriveGPT4: interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412 (2023)
Yuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S., Huang, F.: RRHF: rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302 (2023)
Zeng, W., et.: End-to-end interpretable neural motion planner. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8660–8669 (2019)
Google Scholar
Zhai, J.T., et al.: Rethinking the open-loop evaluation of end-to-end autonomous driving in nuScenes. arXiv preprint arXiv:2305.10430 (2023)
Zhao, Y., Joshi, R., Liu, T., Khalman, M., Saleh, M., Liu, P.J.: SLIC-HF: sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425 (2023)

Download references

Acknowledgements

This work was supported in part by National Science and Technology Major Project (2020AAA0109704), National Science and Technology Ministry Youth Talent Funding No. 2022WRQB002, Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), Mobility Grant Award under Grant No. M-0461, Shenzhen Science and Technology Program (Grant No. GJHZ20220913142600001), Nansha Key RD Program under Grant No. 2022ZD014.

Author information

Authors and Affiliations

Shenzhen Campus of Sun Yat-sen University, Guangzhou, China
Zhijian Huang, Tao Tang & Xiaodan Liang
Meituan Inc., Beijing, China
Shaoxiang Chen, Zequn Jie & Lin Ma
University of Technology Sydney, Ultimo, Australia
Sihao Lin
Sun Yat-sen University, Guangzhou, China
Guangrun Wang
Research Institute of Multiple Agents and Embodied Intelligence, Peng Cheng Laboratory, Shenzhen, China
Xiaodan Liang

Authors

Zhijian Huang
View author publications
Search author on:PubMed Google Scholar
Tao Tang
View author publications
Search author on:PubMed Google Scholar
Shaoxiang Chen
View author publications
Search author on:PubMed Google Scholar
Sihao Lin
View author publications
Search author on:PubMed Google Scholar
Zequn Jie
View author publications
Search author on:PubMed Google Scholar
Lin Ma
View author publications
Search author on:PubMed Google Scholar
Guangrun Wang
View author publications
Search author on:PubMed Google Scholar
Xiaodan Liang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Xiaodan Liang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 651 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, Z. et al. (2025). Making Large Language Models Better Planners with Reasoning-Decision Alignment. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15094. Springer, Cham. https://doi.org/10.1007/978-3-031-72764-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-72764-1_5
Published: 25 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72763-4
Online ISBN: 978-3-031-72764-1
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics

Making Large Language Models Better Planners with Reasoning-Decision Alignment