Abstract
This paper introduces an innovative approach to road network generation through the utilization of a multi-modal Large Language Model (LLM). Our model is specifically designed to process aerial images of road layouts and produce detailed, navigable road networks within the input images. The core innovation of our system lies in the unique training methodology employed for the large language model to generate road networks as its output. This approach draws inspiration from the BLIP-2 architecture, leveraging pre-trained frozen image encoders and large language models to create a versatile multi-modal LLM. Our work also offers an alternative to the reasoning segmentation method proposed in the LISA paper. By training the large language model with our approach, the necessity for generating binary segmentation masks, as suggested in the LISA paper, is effectively eliminated. Experimental results underscore the efficacy of our multi-modal LLM in providing precise and valuable navigational guidance. This research represents a significant stride in bolstering autonomous navigation systems, especially in road network scenarios, where accurate guidance is of paramount importance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Chiang, W.-L., et al.: Vicuna: an open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org (2023)
Lai, X., et al.: Lisa: reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.-S.: NExT-GPT: any-to-any multimodal LLM. arXiv preprint arXiv:2309.05519 (2023)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2023)
Driess, D., et al.: Palm-e: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
Huang, S., et al.: Language is not all you need: aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (2022)
Taori, R., et al.: Stanford alpaca: an instruction-following llama model (2023)
Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S.M., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. In: Advances in Neural Information Processing Systems (2021)
Wei, J., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2023)
Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Zero-shot video question answering via frozen bidirectional language models. In: Advances in Neural Information Processing Systems (2022)
Yang, Z., et al.: Mm-react: prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)
Zhu, D., Chen, J., Haydarov, K., Shen, X., Zhang,W., Elhoseiny, M.: Chatgpt asks, blip-2 answers: automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594 (2023)
Chen, X., et al.: Pali: a jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2023)
Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-andlanguage tasks via text generation. In: International Conference on Machine Learning (2021)
Guo, J., et al.: From images to textual prompts: zero-shot vqa with frozen large language models. arXiv preprint arXiv:2212.10846 (2022)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Fang, Y., et al.: Eva: exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Dong, L., et al.: Unified language model pre-training for natural language understanding and generation. In: Advances in Neural Information Processing Systems (2019)
Dai, W., Hou, L., Shang, L., Jiang, X., Liu, Q., Fung, P.: Enabling multimodal generation on CLIP via vision-language knowledge distillation. arXiv preprint arXiv:2203.06386 (2022)
Chung, H.W, et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2024)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
Zhai, X., et al.: Lit: zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Zhang, P., et al.: Vinvl: making visual representations matter in vision-language models. arXiv preprint arXiv:2101.00529 (2021)
Zhang, S., et al.: Opt: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
Koh, J.Y., Fried, D., Salakhutdinov, R.: Generating images with multimodal language models. arXiv preprint arXiv:2305.17216 (2024)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Dosovitskiy, A., et al.: An image is worth 16×16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
OpenAI: Gpt-4 technical report
OpenAI: Introducing chatgpt. https://openai.com/blog/chatgpt
Acknowledgments
We extend our gratitude to HERE North America LLC for generously providing the hardware necessary for model training and conducting our experiments. We also appreciate HERE for granting us access to their aerial imagery service and the road network line strings that were instrumental in NavGPT’s training.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Rasal, S., Boddhu, S.K. (2024). Beyond Segmentation: Road Network Generation with Multi-modal LLMs. In: Arai, K. (eds) Intelligent Computing. SAI 2024. Lecture Notes in Networks and Systems, vol 1016. Springer, Cham. https://doi.org/10.1007/978-3-031-62281-6_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-62281-6_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-62280-9
Online ISBN: 978-3-031-62281-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)