Skip to main content

Beyond Segmentation: Road Network Generation with Multi-modal LLMs

  • Conference paper
  • First Online:
Intelligent Computing (SAI 2024)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 1016))

Included in the following conference series:

  • 364 Accesses

Abstract

This paper introduces an innovative approach to road network generation through the utilization of a multi-modal Large Language Model (LLM). Our model is specifically designed to process aerial images of road layouts and produce detailed, navigable road networks within the input images. The core innovation of our system lies in the unique training methodology employed for the large language model to generate road networks as its output. This approach draws inspiration from the BLIP-2 architecture, leveraging pre-trained frozen image encoders and large language models to create a versatile multi-modal LLM. Our work also offers an alternative to the reasoning segmentation method proposed in the LISA paper. By training the large language model with our approach, the necessity for generating binary segmentation masks, as suggested in the LISA paper, is effectively eliminated. Experimental results underscore the efficacy of our multi-modal LLM in providing precise and valuable navigational guidance. This research represents a significant stride in bolstering autonomous navigation systems, especially in road network scenarios, where accurate guidance is of paramount importance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  2. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  3. Chiang, W.-L., et al.: Vicuna: an open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org (2023)

  4. Lai, X., et al.: Lisa: reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)

  5. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  6. Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.-S.: NExT-GPT: any-to-any multimodal LLM. arXiv preprint arXiv:2309.05519 (2023)

  7. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2023)

  8. Driess, D., et al.: Palm-e: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

  9. Huang, S., et al.: Language is not all you need: aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)

  10. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (2022)

    Google Scholar 

  11. Taori, R., et al.: Stanford alpaca: an instruction-following llama model (2023)

    Google Scholar 

  12. Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S.M., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  13. Wei, J., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2023)

  14. Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Zero-shot video question answering via frozen bidirectional language models. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  15. Yang, Z., et al.: Mm-react: prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)

  16. Zhu, D., Chen, J., Haydarov, K., Shen, X., Zhang,W., Elhoseiny, M.: Chatgpt asks, blip-2 answers: automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594 (2023)

  17. Chen, X., et al.: Pali: a jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2023)

  18. Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-andlanguage tasks via text generation. In: International Conference on Machine Learning (2021)

    Google Scholar 

  19. Guo, J., et al.: From images to textual prompts: zero-shot vqa with frozen large language models. arXiv preprint arXiv:2212.10846 (2022)

  20. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  21. Fang, Y., et al.: Eva: exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

    Google Scholar 

  22. Dong, L., et al.: Unified language model pre-training for natural language understanding and generation. In: Advances in Neural Information Processing Systems (2019)

    Google Scholar 

  23. Dai, W., Hou, L., Shang, L., Jiang, X., Liu, Q., Fung, P.: Enabling multimodal generation on CLIP via vision-language knowledge distillation. arXiv preprint arXiv:2203.06386 (2022)

  24. Chung, H.W, et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2024)

  25. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)

    Google Scholar 

  26. Zhai, X., et al.: Lit: zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)

    Google Scholar 

  27. Zhang, P., et al.: Vinvl: making visual representations matter in vision-language models. arXiv preprint arXiv:2101.00529 (2021)

  28. Zhang, S., et al.: Opt: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)

  29. Koh, J.Y., Fried, D., Salakhutdinov, R.: Generating images with multimodal language models. arXiv preprint arXiv:2305.17216 (2024)

  30. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  31. Dosovitskiy, A., et al.: An image is worth 16×16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  32. OpenAI: Gpt-4 technical report

    Google Scholar 

  33. OpenAI: Introducing chatgpt. https://openai.com/blog/chatgpt

Download references

Acknowledgments

We extend our gratitude to HERE North America LLC for generously providing the hardware necessary for model training and conducting our experiments. We also appreciate HERE for granting us access to their aerial imagery service and the road network line strings that were instrumental in NavGPT’s training.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sumedh Rasal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rasal, S., Boddhu, S.K. (2024). Beyond Segmentation: Road Network Generation with Multi-modal LLMs. In: Arai, K. (eds) Intelligent Computing. SAI 2024. Lecture Notes in Networks and Systems, vol 1016. Springer, Cham. https://doi.org/10.1007/978-3-031-62281-6_22

Download citation

Publish with us

Policies and ethics