Beyond Segmentation: Road Network Generation with Multi-modal LLMs

Rasal, Sumedh; Boddhu, Sanjay Kumar

doi:10.1007/978-3-031-62281-6_22

Sumedh Rasal^10,11 &
Sanjay Kumar Boddhu¹¹

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 1016))

Included in the following conference series:

Science and Information Conference

364 Accesses

Abstract

This paper introduces an innovative approach to road network generation through the utilization of a multi-modal Large Language Model (LLM). Our model is specifically designed to process aerial images of road layouts and produce detailed, navigable road networks within the input images. The core innovation of our system lies in the unique training methodology employed for the large language model to generate road networks as its output. This approach draws inspiration from the BLIP-2 architecture, leveraging pre-trained frozen image encoders and large language models to create a versatile multi-modal LLM. Our work also offers an alternative to the reasoning segmentation method proposed in the LISA paper. By training the large language model with our approach, the necessity for generating binary segmentation masks, as suggested in the LISA paper, is effectively eliminated. Experimental results underscore the efficacy of our multi-modal LLM in providing precise and valuable navigational guidance. This research represents a significant stride in bolstering autonomous navigation systems, especially in road network scenarios, where accurate guidance is of paramount importance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Real-Time Semantic Mapping for Autonomous Off-Road Navigation

Full-Glow: Fully Conditional Glow for More Realistic Image Generation

Probabilistic Image-Driven Traffic Modeling via Remote Sensing

References

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Chiang, W.-L., et al.: Vicuna: an open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org (2023)
Lai, X., et al.: Lisa: reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.-S.: NExT-GPT: any-to-any multimodal LLM. arXiv preprint arXiv:2309.05519 (2023)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2023)
Driess, D., et al.: Palm-e: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
Huang, S., et al.: Language is not all you need: aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (2022)
Google Scholar
Taori, R., et al.: Stanford alpaca: an instruction-following llama model (2023)
Google Scholar
Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S.M., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Wei, J., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2023)
Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Zero-shot video question answering via frozen bidirectional language models. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Yang, Z., et al.: Mm-react: prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)
Zhu, D., Chen, J., Haydarov, K., Shen, X., Zhang,W., Elhoseiny, M.: Chatgpt asks, blip-2 answers: automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594 (2023)
Chen, X., et al.: Pali: a jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2023)
Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-andlanguage tasks via text generation. In: International Conference on Machine Learning (2021)
Google Scholar
Guo, J., et al.: From images to textual prompts: zero-shot vqa with frozen large language models. arXiv preprint arXiv:2212.10846 (2022)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Fang, Y., et al.: Eva: exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Google Scholar
Dong, L., et al.: Unified language model pre-training for natural language understanding and generation. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Dai, W., Hou, L., Shang, L., Jiang, X., Liu, Q., Fung, P.: Enabling multimodal generation on CLIP via vision-language knowledge distillation. arXiv preprint arXiv:2203.06386 (2022)
Chung, H.W, et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2024)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
Google Scholar
Zhai, X., et al.: Lit: zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Zhang, P., et al.: Vinvl: making visual representations matter in vision-language models. arXiv preprint arXiv:2101.00529 (2021)
Zhang, S., et al.: Opt: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
Koh, J.Y., Fried, D., Salakhutdinov, R.: Generating images with multimodal language models. arXiv preprint arXiv:2305.17216 (2024)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Dosovitskiy, A., et al.: An image is worth 16×16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
OpenAI: Gpt-4 technical report
Google Scholar
OpenAI: Introducing chatgpt. https://openai.com/blog/chatgpt

Download references

Acknowledgments

We extend our gratitude to HERE North America LLC for generously providing the hardware necessary for model training and conducting our experiments. We also appreciate HERE for granting us access to their aerial imagery service and the road network line strings that were instrumental in NavGPT’s training.

Author information

Authors and Affiliations

Georgia Institute of Technology, Atlanta, USA
Sumedh Rasal
HERE North America LLC, Chicago, USA
Sumedh Rasal & Sanjay Kumar Boddhu

Authors

Sumedh Rasal
View author publications
You can also search for this author in PubMed Google Scholar
Sanjay Kumar Boddhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sumedh Rasal .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rasal, S., Boddhu, S.K. (2024). Beyond Segmentation: Road Network Generation with Multi-modal LLMs. In: Arai, K. (eds) Intelligent Computing. SAI 2024. Lecture Notes in Networks and Systems, vol 1016. Springer, Cham. https://doi.org/10.1007/978-3-031-62281-6_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-62281-6_22
Published: 14 June 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-62280-9
Online ISBN: 978-3-031-62281-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics