research-article

Be Specific, Be Clear: Bridging Machine and Human Captions by Scene-Guided Transformer

Authors:

Yupan Huang,

Zhaoyang Zeng,

Yutong LuAuthors Info & Claims

MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

Pages 4 - 13

https://doi.org/10.1145/3463945.3469054

Published: 27 August 2021 Publication History

Get Access

Abstract

Automatically generating natural language descriptions for images, i.e., image captioning, is one of the primary goals for multimedia understanding. The recent success of deep neural networks in image captioning has been accompanied by region-based bottom-up-attention features. Region-based features are representative of the contents of local regions while lacking an overall understanding of images, which is critical to more specific and clear language expression. Visual scene perception can facilitate overall understanding and provide prior knowledge to generate specific and clear captions of objects, object relations, and overall image scenes. In this paper, we propose a Scene-Guided Transformer (SG-Transformer) model that leverages the scene-level global context to generate more specific and descriptive image captions. SG-Transformer adopts an encoder-decoder architecture. The encoder aggregates global scene context as external knowledge with object region-based features in attention learning to facilitate object relation reasoning. It also incorporates high-level auxiliary scene-guided tasks towards more specific visual representation learning. Then the decoder integrates both object-level and scene-level information refined by the encoder for an overall image perception. Extensive experiments on MSCOCO and Flickr30k benchmarks show the superiority and generality of SG-Transformer. Besides, the proposed scene-guided approach can enrich object-level and scene graph visual representations in the encoder and generalize to both RNN- and Transformer-based architectures in the decoder.

Supplementary Material

ZIP File (mmpt008aux.zip)

Be Specific, Be Clear: Bridging Machine and Human Captions by Scene-Guided Transformer -------------------------------------------------- Supplementary material =============== In the supplementary file, we provide more descriptions on preliminaries of Transformer. We further provide more details on compared methods, results and analysis on the online COCO test server, human evaluation and visualization analysis. We report performance on the offline Karpathy test split and the online COCO test server in the supplementary file.

Download
8.04 MB

References

[1]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision. 382--398.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Integrating Scene Semantic Knowledge into Image Captioning

Modeling Image Context Using Object Centered Grid

Interpreting Context of Images Using Scene Graphs

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations