ABSTRACT
Image captioning aims to generate a sentence consisting of sequential linguistic words, to describe visual units (i.e., objects, relationships, and attributes) in a given image. Most of existing methods rely on the prevalent supervised learning with cross-entropy (XE) function to transfer visual units into a sequence of linguistic words. However, we argue that the XE objective is not sensitive to visual-linguistic alignment, which cannot discriminately penalize the semantic inconsistency and shrink the context gap. To solve these problems, we propose the Triangle-Reward Reinforcement Learning (TRRL) method. TRRL uses the scene graph (G)---objects as nodes and relationships as edges---to represent images, generated sentences, and ground truth sentences individually, and mutually align them during the training process. Specifically, TRRL formulates the image captioning into cooperative agents, where the first agent aims to extract visual scene graph (Gimg) from image (I) and the second agent translates this graph into sentence (S). To discriminately penalize the visual-linguistic inconsistency, TRRL proposes the novel triangle-reward function: 1) the generated sentence and its corresponding ground truth are decomposed into the linguistic scene graph (Gsen) and ground-truth scene graph (Ggt), respectively; 2) Gimg, Gsen, and Ggt are paired to calculate the semantic similarity scores which are proportionally assigned to reward each agent. Meanwhile, to make the training objective sensitive to context changes, we propose the node-level and triplet-level scoring methods to jointly measure the visual-linguistic graph correlations. Extensive experiments on the MSCOCO dataset demonstrate the superiority of TRRL. Additional ablation studies further validate its effectiveness.
Supplemental Material
- Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV, Vol. 9909. 382--398.Google Scholar
- Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR. 6077--6086.Google Scholar
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop on MT. 65--72.Google Scholar
- Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019. Counterfactual critic multi-agent training for scene graph generation. In ICCV. 4612--4622.Google Scholar
- Kenneth Ward Church. 2017. Word2Vec. Natural Language Engineering, Vol. 23, 1 (2017), 155--162.Google ScholarCross Ref
- Hui Cui, Lei Zhu, Jingjing Li, Yang Yang, and Liqiang Nie. 2020. Scalable deep hashing for large-scale social image retrieval. IEEE Trans. Image Processing, Vol. 29 (2020), 1271--1284.Google ScholarCross Ref
- Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In CVPR. 1080--1089.Google Scholar
- Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In ECCV, Vol. 6314. 15--29. Google ScholarDigital Library
- Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. 2018. Stack-captioning: Coarse-to-fine learning for image captioning. In AAAI. 6837--6844.Google Scholar
- Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. In NIPS. 11135--11145. Google ScholarDigital Library
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
- Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, Vol. 7, 1 (2017), 411--420.Google Scholar
- Unnat Jain, Svetlana Lazebnik, and Alexander G. Schwing. 2018. Two can play this game: Visual dialog with discriminative question generation and answering. In CVPR. 5754--5763.Google Scholar
- Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, and Wei Liu. 2018a. Learning to guide decoding for image captioning. In AAAI. 6959--6966.Google Scholar
- Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, and Tong Zhang. 2018b. Recurrent fusion network for image captioning. In ECCV, Vol. 11206. 510--526.Google Scholar
- Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. 3128--3137.Google Scholar
- Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. 1889--1897. Google ScholarDigital Library
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.Google Scholar
- Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In ACL. 423--430. Google ScholarDigital Library
- Xiangyu Kong, Bo Xin, Yizhou Wang, and Gang Hua. 2017. Collaborative deep reinforcement learning for joint object search. In CVPR. 7072--7081.Google Scholar
- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, Vol. 123, 1 (2017), 32--73. Google ScholarDigital Library
- Alexander Krull, Eric Brachmann, Sebastian Nowozin, Frank Michel, Jamie Shotton, and Carsten Rother. 2017. Poseagent: Budget-constrained 6d object pose estimation via reinforcement learning. In CVPR. 2566--2574.Google Scholar
- Chin-Yew Lin and Eduard H. Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In NAACL. Google ScholarDigital Library
- Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740--755.Google Scholar
- An-An Liu, Yanhui Wang, Xu Ning, Weizhi Nie, Jie Nie, and Yongdong Zhang. 2020. Adaptively Clustering-Driven Learning for Visual Relationship Detection. IEEE Transactions on Multimedia (2020).Google ScholarCross Ref
- Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018. Context-aware visual policy network for sequence-level image captioning. In ACM MM. 1416--1424. Google ScholarDigital Library
- Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Improved image captioning via policy gradient optimization of spider. In ICCV. 873--881.Google Scholar
- Zhenguang Liu, Haoming Chen, Runyang Feng, Shuang Wu, Shouling Ji, Bailin Yang, and Xun Wang. 2021 a. Deep Dual Consecutive Network for Human Pose Estimation. In CVPR. 525--534.Google Scholar
- Zhenguang Liu, Peng Qian, Xiaoyang Wang, Yuan Zhuang, Lin Qiu, and Xun Wang. 2021 b. Combining Graph Neural Networks with Expert Knowledge for Smart Contract Vulnerability Detection. IEEE Transactions on Knowledge and Data Engineering (TKDE) (2021), 1--1.Google Scholar
- Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2018. Neural baby talk. In CVPR. 7219--7228.Google Scholar
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In ACL. 311--318. Google ScholarDigital Library
- Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS.Google Scholar
- Yu Qin, Jiajun Du, Yonghua Zhang, and Hongtao Lu. 2019. Look back and predict forward in image captioning. In CVPR. 8367--8375.Google Scholar
- Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS. 91--99. Google ScholarDigital Library
- Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. 2017. Deep reinforcement learning-based image captioning with embedding reward. In CVPR. 1151--1159.Google Scholar
- Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In CVPR. 1179--1195.Google Scholar
- Sebastian Schuster, Ranjay Krishna, Angel X. Chang, Li Fei-Fei, and Christopher D. Manning. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the Fourth Workshop on Vision and Language. 70--80.Google Scholar
- David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature, Vol. 529, 7587 (2016), 484--489.Google Scholar
- Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. In NIPS. 1057--1063. Google ScholarDigital Library
- Hongshuo Tian, Ning Xu, An-An Liu, and Yongdong Zhang. 2020. Part-aware interactive learning for scene graph generation. In ACM MM. 3155--3163. Google ScholarDigital Library
- Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR. 4566--4575.Google Scholar
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. 3156--3164.Google Scholar
- Aming Wu, Yahong Han, Yi Yang, Qinghua Hu, and Fei Wu. 2020. Convolutional reconstruction-to-sequence for video captioning. IEEE Trans. Circuits Syst. Video Techn, Vol. 30, 11 (2020), 4299--4308.Google ScholarCross Ref
- Shaomei Wu, Jeffrey Wieland, Omid Farivar, and Julie Schiller. 2017. Automatic alt-text: Computer-generated image descriptions for blind users on a social network service. In ACM CSCW. 1180--1192. Google ScholarDigital Library
- Hongtao Xie, Zhendong Mao, Yongdong Zhang, Han Deng, Chenggang Yan, and Zhineng Chen. 2019. Double-bit quantization and index hashing for nearest neighbor search. IEEE Trans. Multimedia, Vol. 21, 5 (2019), 1248--1260.Google ScholarDigital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML, Vol. 37. 2048--2057. Google ScholarDigital Library
- Ning Xu, Hanwang Zhang, An-An Liu, Weizhi Nie, Yuting Su, Jie Nie, and Yongdong Zhang. 2020. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Trans. Multimedia, Vol. 22, 5 (2020), 1372--1383.Google ScholarCross Ref
- Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In CVPR. 10685--10694.Google Scholar
- Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In ECCV, Vol. 11218. 711--727.Google Scholar
- Sangdoo Yun, Jongwon Choi, Youngjoon Yoo, Kimin Yun, and Jin Young Choi. 2017. Action-decision networks for visual tracking with deep reinforcement learning. In CVPR. 1349--1358.Google Scholar
- Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In CVPR. 5831--5840.Google Scholar
- Li Zhang, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin Yang, and Timothy M. Hospedales. 2017. Actor-critic sequence training for image captioning. In NIPS Workshop on Visually-Grounded Interaction and Language.Google Scholar
- Yan Zhang, Jonathon S. Hare, and Adam Prü gel-Bennett. 2018. Learning to count objects in natural images for visual question answering. In ICLR.Google Scholar
Index Terms
- Triangle-Reward Reinforcement Learning: A Visual-Linguistic Semantic Alignment for Image Captioning
Recommendations
Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent SystemsRecent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...
Comprehensive Image Captioning via Scene Graph Decomposition
Computer Vision – ECCV 2020AbstractWe address the challenging problem of image captioning by revisiting the representation of image scene graph. At the core of our method lies the decomposition of a scene graph into a set of sub-graphs, with each sub-graph capturing a semantic ...
Improve Image Captioning by Modeling Dynamic Scene Graph Extension
ICMR '22: Proceedings of the 2022 International Conference on Multimedia RetrievalRecently, scene graph generation methods have been used in image captioning to encode the objects and their relationships in the encoder-decoder framework, where the decoder selects part of the graph nodes as input for word inference. However, current ...
Comments