research-article

A Method for Visual Spatial Description Based on Large Language Model Fine-tuning

Authors:

Jun YuAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 11414 - 11419

https://doi.org/10.1145/3664647.3688991

Published: 28 October 2024 Publication History

Abstract

In recent years, the task of image-to-text generation has received considerable attention from scholars. One of its subtasks, Visual Spatial Description (VSD), focuses on a model's ability to understand spatial relationships. VSD is a novel task that emphasizes spatial semantics by generating sentences describing the spatial relationships between two objects in a given image. In this work, a VSD method based on large language model fine-tuning (LFVSD) is proposed to enhance the accuracy and robustness of visual spatial relationship descriptions. Initially, image and text features are extracted using pre-trained models, and Q-Former is employed for feature fusion. The original and fused features are then fed into FlanT5XXL. Object overlap priors are introduced, and momentum distillation is used to filter hard negative samples and generate soft labels. Finally, multiple VSD models are trained using data augmentation and long-tail data balancing techniques. Through multimodal feature fusion and fine-tuning, our approach is evaluated on the VSD2024 test set, which includes 5,855 images and their corresponding textual descriptions. The results demonstrate the effectiveness of our proposed method.

References

[1]

Xu K, Ba J, Kiros R, et al. 2015. Show, attend and tell: Neural image caption generation with visual attention[C]//International conference on machine learning. PMLR, 2048--2057.

[2]

Shashirangana J, Padmasiri H, Meedeniya D, et al. 2020. Automated license plate recognition: a survey on methods and techniques[J]. IEEE Access, 9: 11203--11225.

[3]

Diao H, Zhang Y, Ma L, et al. 2021. Similarity reasoning and filtration for image-text matching[C]//Proceedings of the AAAI conference on artificial intelligence. 35(2): 1218--1226.

[4]

Ahmed K T, Aslam S, Afzal H, et al. 2021. Symmetric image contents analysis and retrieval using decimation, pattern analysis, orientation, and features fusion[J]. IEEE Access, 9: 57215--57242.

[5]

Pendao C, Moreira A. 2019. FastGraph enhanced: High accuracy automatic indoor navigation and mapping[J]. IEEE Transactions on Mobile Computing, 20(3): 1027--1045.

[6]

Wang C, Pei J, Luo S, et al. 2023. SAR ship target recognition via multiscale feature attention and adaptive-weighed classifier[J]. IEEE Geoscience and Remote Sensing Letters, 20: 1--5.

[7]

Luo G, Darrell T, Rohrbach A. 2021. Newsclippings: Automatic generation of out-of-context multimodal media[J]. arXiv preprint arXiv:2104.05893.

[8]

Cross-modal text and visual generation: A systematic review. Part 1: Image to text[J]. Information Fusion, 93: 302--329.

[9]

Zhao Y, Wei J, Lin Z, et al. 2022. Visual spatial description: Controlled spatial-oriented image-to-text generation[J]. arXiv preprint arXiv:2210.11109.

[10]

Li J, Selvaraju R, Gotmare A, et al. 2021. Align before fuse: Vision and language representation learning with momentum distillation[J]. Advances in neural information processing systems, 34: 9694--9705.

[11]

Vinyals O, Toshev A, Bengio S, et al. 2015. Show and tell: A neural image caption generator[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.

[12]

Vaswani A, Shazeer N, Parmar N, et al. 2017. Attention is all you need[J]. Advances in neural information processing systems, 6000--6010.

[13]

Lu J, Batra D, Parikh D, et al. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[J]. Advances in neural information processing systems, 32.

[14]

Sun C, Myers A, Vondrick C, et al. 2019. Videobert: A joint model for video and language representation learning[C]//Proceedings of the IEEE/CVF international conference on computer vision. 7464--7473.

[15]

Tan H, Bansal M. 2019. Lxmert: Learning cross-modality encoder representations from transformers[J]. arXiv preprint arXiv:1908.07490.

[16]

Li L H, Yatskar M, Yin D, et al. 2019. Visualbert: A simple and performant baseline for vision and language[J]. arXiv preprint arXiv:1908.03557.

[17]

Cornia M, Stefanini M, Baraldi L, et al. 2020. Meshed-memory transformer for image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10578--10587.

[18]

Zhou L, Palangi H, Zhang L, et al. 2020. Unified vision-language pre-training for image captioning and vqa[C]//Proceedings of the AAAI conference on artificial intelligence. 34(07): 13041--13049.

[19]

Fei H, Ren Y, Wu S, et al. 2021. Latent target-opinion as prior for document-level sentiment classification: A variational approach from fine-grained perspective[C]//Proceedings of the web conference 2021. 553--564.

[20]

Ronghang H, Amanpreet S. 2021. Transformer is all you need: Multimodal multitask learning with a unified transformer[J]. arXiv preprint arXiv:2102.10772.

[21]

Li W, Gao C, Niu G, et al. 2020. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning[J]. arXiv preprint arXiv:2012.15409.

[22]

Karpathy A, Fei-Fei L. 2015. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 3128--3137.

[23]

Johnson J, Hariharan B, Van Der Maaten L, et al. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2901--2910.

[24]

Agrawal A, Batra D, Parikh D, et al. 2018. Don't just assume; look and answer: Overcoming priors for visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 4971--4980.

[25]

Hudson D A, Manning C D. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700--6709.

[26]

Chen L, Jiang Z, Xiao J, et al. 2021. Human-like controllable image captioning with verb-specific semantic roles[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16846--16856.

[27]

Yang K, Russakovsky O, Deng J. 2019. Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2051--2060.

[28]

Chiou M J, Zimmermann R, Feng J. 2021. Visual relationship detection with visual-linguistic knowledge from multimodal representations[J]. IEEE Access, 9: 50441--50451.

[29]

Yu W, Zhu C, Qin L, et al. 2022. Diversifying content generation for commonsense reasoning with mixture of knowledge graph experts[J]. arXiv preprint arXiv:2203.07285.

[30]

Zhao Y, Fei H, Ji W, et al. 2023. Generating visual spatial description via holistic 3D scene understanding[J]. arXiv preprint arXiv:2305.11768.

[31]

Hu W, Xu Y, Li Y, et al. 2024. Bliva: A simple multimodal llm for better handling of text-rich visual questions[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 38(3): 2256--2264.

[32]

Chung H W, Hou L, Longpre S, et al. 2024. Scaling instruction-finetuned language models[J]. Journal of Machine Learning Research, 25(70): 1--53.

[33]

Sun Q, Fang Y, Wu L, et al. 2023. Eva-clip: Improved training techniques for clip at scale[J]. arXiv preprint arXiv:2303.15389.

[34]

Papineni K, Roukos S, Ward T, et al. 2002. Bleu: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.

[35]

Anderson P, Fernando B, Johnson M, et al. 2016. Spice: Semantic propositional image caption evaluation[C]//Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part V 14. Springer International Publishing, 382--398.

Index Terms

A Method for Visual Spatial Description Based on Large Language Model Fine-tuning
1. Computing methodologies
  1. Artificial intelligence

Recommendations

RAG-Guided Large Language Models for Visual Spatial Description with Adaptive Hallucination Corrector
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Visual Spatial Description (VSD) is an emerging image-to-text task which aims at generating descriptions of the spatial relationships between given objects in an image. In this paper, we apply Retrieval-Augmented Generation (RAG) technology in guiding ...
LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Visual Spatial Description (VSD) aims to generate texts that describe the spatial relationships between objects within images. Traditional visual spatial relationship classification (VSRC) methods typically output the spatial relationship between two ...
A Novel Lightweight Audio-visual Saliency Model for Videos
Audio information has not been considered an important factor in visual attention models regardless of many psychological studies that have shown the importance of audio information in the human visual perception system. Since existing visual attention ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
61
Total Downloads

Downloads (Last 12 months)61
Downloads (Last 6 weeks)7

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten