Sequential image encoding for vision-to-language problems

Wang, Jicheng; Zhou, Yuanen; Hu, Zhenzhen; Zhang, Xu; Wang, Meng

doi:10.1007/s11042-019-08439-7

Sequential image encoding for vision-to-language problems

Published: 02 December 2019

Volume 80, pages 16141–16152, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jicheng Wang¹,
Yuanen Zhou¹,
Zhenzhen Hu ORCID: orcid.org/0000-0003-1042-8361¹,
Xu Zhang² &
…
Meng Wang¹

299 Accesses
2 Citations
Explore all metrics

Abstract

The combination of visual recognition and language understanding is aim to build a commonly shared space between heterogeneous data of vision and text, such as the tasks of image captioning and visual question answering (VQA). Most existing approaches convert an image into a semantic visual feature vector via deep convolutional neural networks (CNN), while keep the sequential property of text data and represent it with Recurrent Neural Networks(RNN). The key to analyse multi-source heterogeneous data is to construct the inherent correlations between data. In order to reduce the heterogeneous gap among the vision and language, in this work, we represent the image in a sequential way as well as the text. We utilize the objects in the visual scenes and convert the image to a sequence of detected objects and their locations. Then we analogize a sequence of objects(visual language) to a sequence of words(natural language). We take the order of objects into account and evaluate different permutations and combinations of objects. Experimental results on the image captioning and VQA benchmarks demonstrate our hypothesis it’s beneficial to appropriately arrange objects sequence on the Vision-to-Language(V2L) problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

CBAM: Convolutional Block Attention Module

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Ranjay Krishna, Yuke Zhu, … Li Fei-Fei

Learning to Prompt for Vision-Language Models

Article 31 July 2022

Kaiyang Zhou, Jingkang Yang, … Ziwei Liu

Notes

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR
Andreas J, Rohrbach M, Darrell T, Klein D (2016) Learning to compose neural networks for question answering. arXiv:1601.01705
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Dai B, Ye D, Lin D (2018) Rethinking the form of latent states in image captioning ECCV
Elliott D, Keller F (2013) Image description using visual dependency representations. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1292–1302
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neur Comput 9(8):1735–1780
Article Google Scholar
Hu R, Andreas J, Rohrbach M, Darrell T, Saenko K (2017) Learning to reason: end-to-end module networks for visual question answering. arXiv:1704.05526, 3
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding long-short term memory for image caption generation. arXiv:1509.04942
Johnson J, Hariharan B, van der Maaten L, Hoffman J, Fei-Fei L, Zitnick CL, Girshick RB (2017) Inferring and executing programs for visual reasoning. In: ICCV, pp 3008–3017
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Kim JH, Lee SW, Kwak D, Heo MO, Kim J, Ha JW, Zhang BT (2016) Multimodal residual learning for visual qa. In: Advances in neural information processing systems, pp 361–369
Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers, vol 1. Association for Computational Linguistics, pp 359–368
Li L, Tang S, Deng L, Zhang Y, Tian Q (2017) Image caption with global-local attention. In: AAAI, pp 4133–4139
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Lin Y, Pang Z, Wang D, Zhuang Y (2018) Feature enhancement in attention for visual question answering. In: IJCAI, pp 4216–4222
Liu C, Sun F, Wang C, Wang F, Yuille A (2017) Mat: a multimodal attentive translator for image captioning. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17, pp 4033–4039. https://doi.org/10.24963/ijcai.2017/563
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems, pp 289–297
Lu P, Ji L, Zhang W, Duan N, Zhou M, Wang J (2018) R-vqa: learning visual relation facts with semantic attention for visual question answering. arXiv:1805.09701
Lu P, Li H, Zhang W, Wang J, Wang X (2018) Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In: AAAI 2018, pp 7218–7225
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632
Mason R, Charniak E (2014) Nonparametric method for data-driven image captioning. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: short papers), vol 2, pp 592–598
Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé H III (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics. Association for Computational Linguistics, pp 747–756
Mun J, Cho M, Han B (2017) Text-guided attention model for image captioning. In: AAAI, pp 4233–4239
Ordonez V, Han X, Kuznetsova P, Kulkarni G, Mitchell M, Yamaguchi K, Stratos K, Goyal A, Dodge J, Mensch A et al (2016) Large scale retrieval and generation of image descriptions. Int J Comput Vis 119(1):46–59
Article MathSciNet Google Scholar
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566– 4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang P, Wu Q, Shen C, Hengel Avd, Dick A (2015) Explicit knowledge-based reasoning for visual question answering. arXiv:1511.02570
Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2017) Fvqa: fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence
Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
Wu Q, Wang P, Shen C, Dick A, van den Hengel A (2016) Ask me anything: free-form visual question answering based on knowledge from external sources. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4622–4630
Wu Q, Teney D, Wang P, Shen C, Dick A, van den Hengel A (2017) Visual question answering: a survey of methods and datasets. Comput Vis Image Underst 163:21–40
Article Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
Yin X, Ordonez V (2017) Obj2text: generating visually descriptive language from object layouts. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 177–187
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 4187–4195
Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proc. IEEE int. conf. comp. vis, vol 3
Zhang Y, Hare J, Prügel-Bennett A (2018) Learning to count objects in natural images for visual question answering. arXiv:1802.05766
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv:1512.02167

Download references

Author information

Authors and Affiliations

Hefei University of Technology, Hefei, China
Jicheng Wang, Yuanen Zhou, Zhenzhen Hu & Meng Wang
Suzhou Vocational University, Suzhou, China
Xu Zhang

Authors

Jicheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuanen Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Zhenzhen Hu
View author publications
You can also search for this author in PubMed Google Scholar
Xu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Meng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenzhen Hu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, J., Zhou, Y., Hu, Z. et al. Sequential image encoding for vision-to-language problems. Multimed Tools Appl 80, 16141–16152 (2021). https://doi.org/10.1007/s11042-019-08439-7

Download citation

Received: 09 May 2019
Revised: 28 July 2019
Accepted: 06 November 2019
Published: 02 December 2019
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11042-019-08439-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Sequential image encoding for vision-to-language problems

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sequential image encoding for vision-to-language problems

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation