research-article

Bridging the Gap between Vision and Language Domains for Improved Image Captioning

Authors:

Fenglin Liu,

Xian Wu,

Shen Ge,

Xiaoyu Zhang,

Wei Fan,

Yuexian ZouAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 4153 - 4161

https://doi.org/10.1145/3394171.3414004

Published: 12 October 2020 Publication History

Get Access

Abstract

Image captioning has attracted extensive research interests in recent years. Due to the great disparities between vision and language, an important goal of image captioning is to link the information in visual domain to textual domain. However, many approaches conduct this process only in the decoder, making it hard to understand the images and generate captions effectively. In this paper, we propose to bridge the gap between the vision and language domains in the encoder, by enriching visual information with textual concepts, to achieve deep image understandings. To this end, we propose to explore the textual-enriched image features. Specifically, we introduce two modules, namely Textual Distilling Module and Textual Association Module. The former distills relevant textual concepts from image features, while the latter further associates extracted concepts according to their semantics. In this manner, we acquire textual-enriched image features, which provide clear textual representations of image under no explicit supervision. The proposed approach can be used as a plugin and easily embedded into a wide range of existing image captioning systems. We conduct the extensive experiments on two benchmark image captioning datasets, i.e., MSCOCO and Flickr30k. The experimental results and analysis show that, by incorporating the proposed approach, all baseline models receive consistent improvements over all metrics, with the most significant improvement up to 10% and 9%, in terms of the task-specific metrics CIDEr and SPICE, respectively. The results demonstrate that our approach is effective and generalizes well to a wide range of models for image captioning.

Supplementary Material

MP4 File (3394171.3414004.mp4)

In this paper, we focus on bridging the gap between vision and language domains by enriching image features with textual concepts, which provides a solid basis for describing images. In particular, we explore the textual representations of image features to describe salient image regions on the textual level. Our proposed solution successfully promotes the performance of all the strong baselines across all metrics over the board.

Download
29.34 MB

References

[1]

Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. 2019. Fusion of Detected Objects in Text for Visual Question Answering. In EMNLP.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Towards local visual modeling for image captioning

Bi-Directional Co-Attention Network for Image Captioning

Multi-decoder Based Co-attention for Image Captioning

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations