skip to main content
10.1145/3664647.3688991acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

A Method for Visual Spatial Description Based on Large Language Model Fine-tuning

Published: 28 October 2024 Publication History

Abstract

In recent years, the task of image-to-text generation has received considerable attention from scholars. One of its subtasks, Visual Spatial Description (VSD), focuses on a model's ability to understand spatial relationships. VSD is a novel task that emphasizes spatial semantics by generating sentences describing the spatial relationships between two objects in a given image. In this work, a VSD method based on large language model fine-tuning (LFVSD) is proposed to enhance the accuracy and robustness of visual spatial relationship descriptions. Initially, image and text features are extracted using pre-trained models, and Q-Former is employed for feature fusion. The original and fused features are then fed into FlanT5XXL. Object overlap priors are introduced, and momentum distillation is used to filter hard negative samples and generate soft labels. Finally, multiple VSD models are trained using data augmentation and long-tail data balancing techniques. Through multimodal feature fusion and fine-tuning, our approach is evaluated on the VSD2024 test set, which includes 5,855 images and their corresponding textual descriptions. The results demonstrate the effectiveness of our proposed method.

References

[1]
Xu K, Ba J, Kiros R, et al. 2015. Show, attend and tell: Neural image caption generation with visual attention[C]//International conference on machine learning. PMLR, 2048--2057.
[2]
Shashirangana J, Padmasiri H, Meedeniya D, et al. 2020. Automated license plate recognition: a survey on methods and techniques[J]. IEEE Access, 9: 11203--11225.
[3]
Diao H, Zhang Y, Ma L, et al. 2021. Similarity reasoning and filtration for image-text matching[C]//Proceedings of the AAAI conference on artificial intelligence. 35(2): 1218--1226.
[4]
Ahmed K T, Aslam S, Afzal H, et al. 2021. Symmetric image contents analysis and retrieval using decimation, pattern analysis, orientation, and features fusion[J]. IEEE Access, 9: 57215--57242.
[5]
Pendao C, Moreira A. 2019. FastGraph enhanced: High accuracy automatic indoor navigation and mapping[J]. IEEE Transactions on Mobile Computing, 20(3): 1027--1045.
[6]
Wang C, Pei J, Luo S, et al. 2023. SAR ship target recognition via multiscale feature attention and adaptive-weighed classifier[J]. IEEE Geoscience and Remote Sensing Letters, 20: 1--5.
[7]
Luo G, Darrell T, Rohrbach A. 2021. Newsclippings: Automatic generation of out-of-context multimodal media[J]. arXiv preprint arXiv:2104.05893.
[8]
Cross-modal text and visual generation: A systematic review. Part 1: Image to text[J]. Information Fusion, 93: 302--329.
[9]
Zhao Y, Wei J, Lin Z, et al. 2022. Visual spatial description: Controlled spatial-oriented image-to-text generation[J]. arXiv preprint arXiv:2210.11109.
[10]
Li J, Selvaraju R, Gotmare A, et al. 2021. Align before fuse: Vision and language representation learning with momentum distillation[J]. Advances in neural information processing systems, 34: 9694--9705.
[11]
Vinyals O, Toshev A, Bengio S, et al. 2015. Show and tell: A neural image caption generator[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.
[12]
Vaswani A, Shazeer N, Parmar N, et al. 2017. Attention is all you need[J]. Advances in neural information processing systems, 6000--6010.
[13]
Lu J, Batra D, Parikh D, et al. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[J]. Advances in neural information processing systems, 32.
[14]
Sun C, Myers A, Vondrick C, et al. 2019. Videobert: A joint model for video and language representation learning[C]//Proceedings of the IEEE/CVF international conference on computer vision. 7464--7473.
[15]
Tan H, Bansal M. 2019. Lxmert: Learning cross-modality encoder representations from transformers[J]. arXiv preprint arXiv:1908.07490.
[16]
Li L H, Yatskar M, Yin D, et al. 2019. Visualbert: A simple and performant baseline for vision and language[J]. arXiv preprint arXiv:1908.03557.
[17]
Cornia M, Stefanini M, Baraldi L, et al. 2020. Meshed-memory transformer for image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10578--10587.
[18]
Zhou L, Palangi H, Zhang L, et al. 2020. Unified vision-language pre-training for image captioning and vqa[C]//Proceedings of the AAAI conference on artificial intelligence. 34(07): 13041--13049.
[19]
Fei H, Ren Y, Wu S, et al. 2021. Latent target-opinion as prior for document-level sentiment classification: A variational approach from fine-grained perspective[C]//Proceedings of the web conference 2021. 553--564.
[20]
Ronghang H, Amanpreet S. 2021. Transformer is all you need: Multimodal multitask learning with a unified transformer[J]. arXiv preprint arXiv:2102.10772.
[21]
Li W, Gao C, Niu G, et al. 2020. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning[J]. arXiv preprint arXiv:2012.15409.
[22]
Karpathy A, Fei-Fei L. 2015. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 3128--3137.
[23]
Johnson J, Hariharan B, Van Der Maaten L, et al. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2901--2910.
[24]
Agrawal A, Batra D, Parikh D, et al. 2018. Don't just assume; look and answer: Overcoming priors for visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 4971--4980.
[25]
Hudson D A, Manning C D. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700--6709.
[26]
Chen L, Jiang Z, Xiao J, et al. 2021. Human-like controllable image captioning with verb-specific semantic roles[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16846--16856.
[27]
Yang K, Russakovsky O, Deng J. 2019. Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2051--2060.
[28]
Chiou M J, Zimmermann R, Feng J. 2021. Visual relationship detection with visual-linguistic knowledge from multimodal representations[J]. IEEE Access, 9: 50441--50451.
[29]
Yu W, Zhu C, Qin L, et al. 2022. Diversifying content generation for commonsense reasoning with mixture of knowledge graph experts[J]. arXiv preprint arXiv:2203.07285.
[30]
Zhao Y, Fei H, Ji W, et al. 2023. Generating visual spatial description via holistic 3D scene understanding[J]. arXiv preprint arXiv:2305.11768.
[31]
Hu W, Xu Y, Li Y, et al. 2024. Bliva: A simple multimodal llm for better handling of text-rich visual questions[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 38(3): 2256--2264.
[32]
Chung H W, Hou L, Longpre S, et al. 2024. Scaling instruction-finetuned language models[J]. Journal of Machine Learning Research, 25(70): 1--53.
[33]
Sun Q, Fang Y, Wu L, et al. 2023. Eva-clip: Improved training techniques for clip at scale[J]. arXiv preprint arXiv:2303.15389.
[34]
Papineni K, Roukos S, Ward T, et al. 2002. Bleu: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.
[35]
Anderson P, Fernando B, Johnson M, et al. 2016. Spice: Semantic propositional image caption evaluation[C]//Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part V 14. Springer International Publishing, 382--398.

Index Terms

  1. A Method for Visual Spatial Description Based on Large Language Model Fine-tuning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. feature fusion
    2. large language model fine-tuning
    3. momentum distillation
    4. overlap priors
    5. visual spatial description

    Qualifiers

    • Research-article

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 61
      Total Downloads
    • Downloads (Last 12 months)61
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media