research-article

Triangle-Reward Reinforcement Learning: A Visual-Linguistic Semantic Alignment for Image Captioning

Authors:
Weizhi Nie

Tianjin University & People's Daily Online, Tianjin, China

Tianjin University & People's Daily Online, Tianjin, China
View Profile

,
Jiesi Li

Tianjin University, Tianjin, China

Tianjin University, Tianjin, China
View Profile

,
Ning Xu

Tianjin University, Tianjin, China

Tianjin University, Tianjin, China
View Profile

,
An-An Liu

Tianjin University, Tianjin, China

Tianjin University, Tianjin, China
View Profile

,
Xuanya Li

Baidu Inc., Beijing, China

Baidu Inc., Beijing, China
View Profile

,
Yongdong Zhang

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

MM '21: Proceedings of the 29th ACM International Conference on MultimediaOctober 2021Pages 4510–4518https://doi.org/10.1145/3474085.3475604

Published:17 October 2021Publication History

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 4510–4518

ABSTRACT

Image captioning aims to generate a sentence consisting of sequential linguistic words, to describe visual units (i.e., objects, relationships, and attributes) in a given image. Most of existing methods rely on the prevalent supervised learning with cross-entropy (XE) function to transfer visual units into a sequence of linguistic words. However, we argue that the XE objective is not sensitive to visual-linguistic alignment, which cannot discriminately penalize the semantic inconsistency and shrink the context gap. To solve these problems, we propose the Triangle-Reward Reinforcement Learning (TRRL) method. TRRL uses the scene graph (G)---objects as nodes and relationships as edges---to represent images, generated sentences, and ground truth sentences individually, and mutually align them during the training process. Specifically, TRRL formulates the image captioning into cooperative agents, where the first agent aims to extract visual scene graph (Gimg) from image (I) and the second agent translates this graph into sentence (S). To discriminately penalize the visual-linguistic inconsistency, TRRL proposes the novel triangle-reward function: 1) the generated sentence and its corresponding ground truth are decomposed into the linguistic scene graph (Gsen) and ground-truth scene graph (Ggt), respectively; 2) Gimg, Gsen, and Ggt are paired to calculate the semantic similarity scores which are proportionally assigned to reward each agent. Meanwhile, to make the training objective sensitive to context changes, we propose the node-level and triplet-level scoring methods to jointly measure the visual-linguistic graph correlations. Extensive experiments on the MSCOCO dataset demonstrate the superiority of TRRL. Additional ablation studies further validate its effectiveness.

Supplemental Material

MM21-mfp2390.mp4

mp4

23.3 MB

Download

References

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV, Vol. 9909. 382--398.Google Scholar
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR. 6077--6086.Google Scholar
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop on MT. 65--72.Google Scholar
Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019. Counterfactual critic multi-agent training for scene graph generation. In ICCV. 4612--4622.Google Scholar
Kenneth Ward Church. 2017. Word2Vec. Natural Language Engineering, Vol. 23, 1 (2017), 155--162.Google ScholarCross Ref
Hui Cui, Lei Zhu, Jingjing Li, Yang Yang, and Liqiang Nie. 2020. Scalable deep hashing for large-scale social image retrieval. IEEE Trans. Image Processing, Vol. 29 (2020), 1271--1284.Google ScholarCross Ref
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In CVPR. 1080--1089.Google Scholar
Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In ECCV, Vol. 6314. 15--29. Google ScholarDigital Library
Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. 2018. Stack-captioning: Coarse-to-fine learning for image captioning. In AAAI. 6837--6844.Google Scholar
Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. In NIPS. 11135--11145. Google ScholarDigital Library
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, Vol. 7, 1 (2017), 411--420.Google Scholar
Unnat Jain, Svetlana Lazebnik, and Alexander G. Schwing. 2018. Two can play this game: Visual dialog with discriminative question generation and answering. In CVPR. 5754--5763.Google Scholar
Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, and Wei Liu. 2018a. Learning to guide decoding for image captioning. In AAAI. 6959--6966.Google Scholar
Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, and Tong Zhang. 2018b. Recurrent fusion network for image captioning. In ECCV, Vol. 11206. 510--526.Google Scholar
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. 3128--3137.Google Scholar
Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. 1889--1897. Google ScholarDigital Library
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.Google Scholar
Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In ACL. 423--430. Google ScholarDigital Library
Xiangyu Kong, Bo Xin, Yizhou Wang, and Gang Hua. 2017. Collaborative deep reinforcement learning for joint object search. In CVPR. 7072--7081.Google Scholar
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, Vol. 123, 1 (2017), 32--73. Google ScholarDigital Library
Alexander Krull, Eric Brachmann, Sebastian Nowozin, Frank Michel, Jamie Shotton, and Carsten Rother. 2017. Poseagent: Budget-constrained 6d object pose estimation via reinforcement learning. In CVPR. 2566--2574.Google Scholar
Chin-Yew Lin and Eduard H. Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In NAACL. Google ScholarDigital Library
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740--755.Google Scholar
An-An Liu, Yanhui Wang, Xu Ning, Weizhi Nie, Jie Nie, and Yongdong Zhang. 2020. Adaptively Clustering-Driven Learning for Visual Relationship Detection. IEEE Transactions on Multimedia (2020).Google ScholarCross Ref
Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018. Context-aware visual policy network for sequence-level image captioning. In ACM MM. 1416--1424. Google ScholarDigital Library
Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Improved image captioning via policy gradient optimization of spider. In ICCV. 873--881.Google Scholar
Zhenguang Liu, Haoming Chen, Runyang Feng, Shuang Wu, Shouling Ji, Bailin Yang, and Xun Wang. 2021 a. Deep Dual Consecutive Network for Human Pose Estimation. In CVPR. 525--534.Google Scholar
Zhenguang Liu, Peng Qian, Xiaoyang Wang, Yuan Zhuang, Lin Qiu, and Xun Wang. 2021 b. Combining Graph Neural Networks with Expert Knowledge for Smart Contract Vulnerability Detection. IEEE Transactions on Knowledge and Data Engineering (TKDE) (2021), 1--1.Google Scholar
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2018. Neural baby talk. In CVPR. 7219--7228.Google Scholar
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In ACL. 311--318. Google ScholarDigital Library
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS.Google Scholar
Yu Qin, Jiajun Du, Yonghua Zhang, and Hongtao Lu. 2019. Look back and predict forward in image captioning. In CVPR. 8367--8375.Google Scholar
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS. 91--99. Google ScholarDigital Library
Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. 2017. Deep reinforcement learning-based image captioning with embedding reward. In CVPR. 1151--1159.Google Scholar
Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In CVPR. 1179--1195.Google Scholar
Sebastian Schuster, Ranjay Krishna, Angel X. Chang, Li Fei-Fei, and Christopher D. Manning. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the Fourth Workshop on Vision and Language. 70--80.Google Scholar
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature, Vol. 529, 7587 (2016), 484--489.Google Scholar
Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. In NIPS. 1057--1063. Google ScholarDigital Library
Hongshuo Tian, Ning Xu, An-An Liu, and Yongdong Zhang. 2020. Part-aware interactive learning for scene graph generation. In ACM MM. 3155--3163. Google ScholarDigital Library
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR. 4566--4575.Google Scholar
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. 3156--3164.Google Scholar
Aming Wu, Yahong Han, Yi Yang, Qinghua Hu, and Fei Wu. 2020. Convolutional reconstruction-to-sequence for video captioning. IEEE Trans. Circuits Syst. Video Techn, Vol. 30, 11 (2020), 4299--4308.Google ScholarCross Ref
Shaomei Wu, Jeffrey Wieland, Omid Farivar, and Julie Schiller. 2017. Automatic alt-text: Computer-generated image descriptions for blind users on a social network service. In ACM CSCW. 1180--1192. Google ScholarDigital Library
Hongtao Xie, Zhendong Mao, Yongdong Zhang, Han Deng, Chenggang Yan, and Zhineng Chen. 2019. Double-bit quantization and index hashing for nearest neighbor search. IEEE Trans. Multimedia, Vol. 21, 5 (2019), 1248--1260.Google ScholarDigital Library
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML, Vol. 37. 2048--2057. Google ScholarDigital Library
Ning Xu, Hanwang Zhang, An-An Liu, Weizhi Nie, Yuting Su, Jie Nie, and Yongdong Zhang. 2020. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Trans. Multimedia, Vol. 22, 5 (2020), 1372--1383.Google ScholarCross Ref
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In CVPR. 10685--10694.Google Scholar
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In ECCV, Vol. 11218. 711--727.Google Scholar
Sangdoo Yun, Jongwon Choi, Youngjoon Yoo, Kimin Yun, and Jin Young Choi. 2017. Action-decision networks for visual tracking with deep reinforcement learning. In CVPR. 1349--1358.Google Scholar
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In CVPR. 5831--5840.Google Scholar
Li Zhang, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin Yang, and Timothy M. Hospedales. 2017. Actor-critic sequence training for image captioning. In NIPS Workshop on Visually-Grounded Interaction and Language.Google Scholar
Yan Zhang, Jonathon S. Hare, and Adam Prü gel-Bennett. 2018. Learning to count objects in natural images for visual question answering. In ICLR.Google Scholar

Index Terms

Triangle-Reward Reinforcement Learning: A Visual-Linguistic Semantic Alignment for Image Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems

Recent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...
Read More
Comprehensive Image Captioning via Scene Graph Decomposition
Computer Vision – ECCV 2020
Abstract
We address the challenging problem of image captioning by revisiting the representation of image scene graph. At the core of our method lies the decomposition of a scene graph into a set of sub-graphs, with each sub-graph capturing a semantic ...
Read More
Improve Image Captioning by Modeling Dynamic Scene Graph Extension
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

Recently, scene graph generation methods have been used in image captioning to encode the objects and their relationships in the encoder-decoder framework, where the decoder selects part of the graph nodes as input for word inference. However, current ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
image captioning
reinforcement learning
scene graph
visual-linguistic alignment
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 290
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Triangle-Reward Reinforcement Learning: A Visual-Linguistic Semantic Alignment for Image Captioning

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Reward Shaping in Episodic Reinforcement Learning

Comprehensive Image Captioning via Scene Graph Decomposition

Improve Image Captioning by Modeling Dynamic Scene Graph Extension

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media