skip to main content
10.1145/3463945.3469054acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Be Specific, Be Clear: Bridging Machine and Human Captions by Scene-Guided Transformer

Published: 27 August 2021 Publication History

Abstract

Automatically generating natural language descriptions for images, i.e., image captioning, is one of the primary goals for multimedia understanding. The recent success of deep neural networks in image captioning has been accompanied by region-based bottom-up-attention features. Region-based features are representative of the contents of local regions while lacking an overall understanding of images, which is critical to more specific and clear language expression. Visual scene perception can facilitate overall understanding and provide prior knowledge to generate specific and clear captions of objects, object relations, and overall image scenes. In this paper, we propose a Scene-Guided Transformer (SG-Transformer) model that leverages the scene-level global context to generate more specific and descriptive image captions. SG-Transformer adopts an encoder-decoder architecture. The encoder aggregates global scene context as external knowledge with object region-based features in attention learning to facilitate object relation reasoning. It also incorporates high-level auxiliary scene-guided tasks towards more specific visual representation learning. Then the decoder integrates both object-level and scene-level information refined by the encoder for an overall image perception. Extensive experiments on MSCOCO and Flickr30k benchmarks show the superiority and generality of SG-Transformer. Besides, the proposed scene-guided approach can enrich object-level and scene graph visual representations in the encoder and generalize to both RNN- and Transformer-based architectures in the decoder.

Supplementary Material

ZIP File (mmpt008aux.zip)
Be Specific, Be Clear: Bridging Machine and Human Captions by Scene-Guided Transformer -------------------------------------------------- Supplementary material =============== In the supplementary file, we provide more descriptions on preliminaries of Transformer. We further provide more details on compared methods, results and analysis on the online COCO test server, human evaluation and visualization analysis. We report performance on the offline Karpathy test split and the online COCO test server in the supplementary file.

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision. 382--398.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.
[3]
Kun Fu, Junqi Jin, Runpeng Cui, Fei Sha, and Changshui Zhang. 2016. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE transactions on pattern analysis and machine intelligence, Vol. 39, 12 (2016), 2321--2334.
[4]
Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2019. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE transactions on pattern analysis and machine intelligence (2019).
[5]
Longteng Guo, Jing Liu, Shichen Lu, and Hanqing Lu. 2019 a. Show, Tell and Polish: Ruminant Decoding for Image Captioning. IEEE Transactions on Multimedia (2019).
[6]
Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu. 2019 b. Aligning Linguistic Words and Visual Semantic Units for Image Captioning. In Proceedings of the 27th ACM International Conference on Multimedia. 765--773.
[7]
Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image Captioning: Transforming Objects into Words. In Advances in Neural Information Processing Systems. 11135--11145.
[8]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634--4643.
[9]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128--3137.
[10]
Lei Ke, Wenjie Pei, Ruiyu Li, Xiaoyong Shen, and Yu-Wing Tai. 2019. Reflective Decoding Network for Image Captioning. In Proceedings of the IEEE International Conference on Computer Vision. 8888--8897.
[11]
Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Entangled Transformer for Image Captioning. In Proceedings of the IEEE International Conference on Computer Vision. 8928--8937.
[12]
Xiangyang Li and Shuqiang Jiang. 2019. Know more say less: Image captioning based on scene graphs. IEEE Transactions on Multimedia, Vol. 21, 8 (2019), 2117--2130.
[13]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.
[14]
Fenglin Liu, Yuanxin Liu, Xuancheng Ren, Xiaodong He, and Xu Sun. 2019. Aligning visual regions and textual concepts for semantic-grounded image representations. In Advances in Neural Information Processing Systems. 6847--6857.
[15]
Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, and Xiaogang Wang. 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Proceedings of the European Conference on Computer Vision. 338--354.
[16]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 375--383.
[17]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2018. Neural baby talk. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7219--7228.
[18]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008--7024.
[19]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[20]
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019 a. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685--10694.
[21]
Xu Yang, Hanwang Zhang, and Jianfei Cai. 2019 b. Learning to collocate neural modules for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4250--4260.
[22]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV). 684--699.
[23]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894--4902.
[24]
Ji Zhang, Kuizhi Mei, Yu Zheng, and Jianping Fan. 2020. Integrating Part of Speech Guidance for Image Captioning. IEEE Transactions on Multimedia (2020).

Cited By

View all
  • (2024)HDDA: Human-perception-centric Deepfake Detection Adapter2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651342(1-9)Online publication date: 30-Jun-2024
  • (2023)CYBORG: Blending Human Saliency Into the Loss Improves Deep Learning-Based Synthetic Face Detection2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00605(6097-6106)Online publication date: Jan-2023
  • (2023)Model Focus Improves Performance of Deep Learning-Based Synthetic Face DetectorsIEEE Access10.1109/ACCESS.2023.328292711(63430-63441)Online publication date: 2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding
August 2021
60 pages
ISBN:9781450385305
DOI:10.1145/3463945
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 August 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. context
  2. image captioning
  3. scene

Qualifiers

  • Research-article

Conference

ICMR '21
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)HDDA: Human-perception-centric Deepfake Detection Adapter2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651342(1-9)Online publication date: 30-Jun-2024
  • (2023)CYBORG: Blending Human Saliency Into the Loss Improves Deep Learning-Based Synthetic Face Detection2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00605(6097-6106)Online publication date: Jan-2023
  • (2023)Model Focus Improves Performance of Deep Learning-Based Synthetic Face DetectorsIEEE Access10.1109/ACCESS.2023.328292711(63430-63441)Online publication date: 2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media