skip to main content
10.1145/3474085.3475604acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Triangle-Reward Reinforcement Learning: A Visual-Linguistic Semantic Alignment for Image Captioning

Published:17 October 2021Publication History

ABSTRACT

Image captioning aims to generate a sentence consisting of sequential linguistic words, to describe visual units (i.e., objects, relationships, and attributes) in a given image. Most of existing methods rely on the prevalent supervised learning with cross-entropy (XE) function to transfer visual units into a sequence of linguistic words. However, we argue that the XE objective is not sensitive to visual-linguistic alignment, which cannot discriminately penalize the semantic inconsistency and shrink the context gap. To solve these problems, we propose the Triangle-Reward Reinforcement Learning (TRRL) method. TRRL uses the scene graph (G)---objects as nodes and relationships as edges---to represent images, generated sentences, and ground truth sentences individually, and mutually align them during the training process. Specifically, TRRL formulates the image captioning into cooperative agents, where the first agent aims to extract visual scene graph (Gimg) from image (I) and the second agent translates this graph into sentence (S). To discriminately penalize the visual-linguistic inconsistency, TRRL proposes the novel triangle-reward function: 1) the generated sentence and its corresponding ground truth are decomposed into the linguistic scene graph (Gsen) and ground-truth scene graph (Ggt), respectively; 2) Gimg, Gsen, and Ggt are paired to calculate the semantic similarity scores which are proportionally assigned to reward each agent. Meanwhile, to make the training objective sensitive to context changes, we propose the node-level and triplet-level scoring methods to jointly measure the visual-linguistic graph correlations. Extensive experiments on the MSCOCO dataset demonstrate the superiority of TRRL. Additional ablation studies further validate its effectiveness.

Skip Supplemental Material Section

Supplemental Material

MM21-mfp2390.mp4

mp4

23.3 MB

References

  1. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV, Vol. 9909. 382--398.Google ScholarGoogle Scholar
  2. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR. 6077--6086.Google ScholarGoogle Scholar
  3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop on MT. 65--72.Google ScholarGoogle Scholar
  4. Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019. Counterfactual critic multi-agent training for scene graph generation. In ICCV. 4612--4622.Google ScholarGoogle Scholar
  5. Kenneth Ward Church. 2017. Word2Vec. Natural Language Engineering, Vol. 23, 1 (2017), 155--162.Google ScholarGoogle ScholarCross RefCross Ref
  6. Hui Cui, Lei Zhu, Jingjing Li, Yang Yang, and Liqiang Nie. 2020. Scalable deep hashing for large-scale social image retrieval. IEEE Trans. Image Processing, Vol. 29 (2020), 1271--1284.Google ScholarGoogle ScholarCross RefCross Ref
  7. Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In CVPR. 1080--1089.Google ScholarGoogle Scholar
  8. Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In ECCV, Vol. 6314. 15--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. 2018. Stack-captioning: Coarse-to-fine learning for image captioning. In AAAI. 6837--6844.Google ScholarGoogle Scholar
  10. Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. In NIPS. 11135--11145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, Vol. 7, 1 (2017), 411--420.Google ScholarGoogle Scholar
  13. Unnat Jain, Svetlana Lazebnik, and Alexander G. Schwing. 2018. Two can play this game: Visual dialog with discriminative question generation and answering. In CVPR. 5754--5763.Google ScholarGoogle Scholar
  14. Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, and Wei Liu. 2018a. Learning to guide decoding for image captioning. In AAAI. 6959--6966.Google ScholarGoogle Scholar
  15. Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, and Tong Zhang. 2018b. Recurrent fusion network for image captioning. In ECCV, Vol. 11206. 510--526.Google ScholarGoogle Scholar
  16. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. 3128--3137.Google ScholarGoogle Scholar
  17. Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. 1889--1897. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.Google ScholarGoogle Scholar
  19. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In ACL. 423--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Xiangyu Kong, Bo Xin, Yizhou Wang, and Gang Hua. 2017. Collaborative deep reinforcement learning for joint object search. In CVPR. 7072--7081.Google ScholarGoogle Scholar
  21. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, Vol. 123, 1 (2017), 32--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Alexander Krull, Eric Brachmann, Sebastian Nowozin, Frank Michel, Jamie Shotton, and Carsten Rother. 2017. Poseagent: Budget-constrained 6d object pose estimation via reinforcement learning. In CVPR. 2566--2574.Google ScholarGoogle Scholar
  23. Chin-Yew Lin and Eduard H. Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In NAACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740--755.Google ScholarGoogle Scholar
  25. An-An Liu, Yanhui Wang, Xu Ning, Weizhi Nie, Jie Nie, and Yongdong Zhang. 2020. Adaptively Clustering-Driven Learning for Visual Relationship Detection. IEEE Transactions on Multimedia (2020).Google ScholarGoogle ScholarCross RefCross Ref
  26. Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018. Context-aware visual policy network for sequence-level image captioning. In ACM MM. 1416--1424. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Improved image captioning via policy gradient optimization of spider. In ICCV. 873--881.Google ScholarGoogle Scholar
  28. Zhenguang Liu, Haoming Chen, Runyang Feng, Shuang Wu, Shouling Ji, Bailin Yang, and Xun Wang. 2021 a. Deep Dual Consecutive Network for Human Pose Estimation. In CVPR. 525--534.Google ScholarGoogle Scholar
  29. Zhenguang Liu, Peng Qian, Xiaoyang Wang, Yuan Zhuang, Lin Qiu, and Xun Wang. 2021 b. Combining Graph Neural Networks with Expert Knowledge for Smart Contract Vulnerability Detection. IEEE Transactions on Knowledge and Data Engineering (TKDE) (2021), 1--1.Google ScholarGoogle Scholar
  30. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2018. Neural baby talk. In CVPR. 7219--7228.Google ScholarGoogle Scholar
  31. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In ACL. 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS.Google ScholarGoogle Scholar
  33. Yu Qin, Jiajun Du, Yonghua Zhang, and Hongtao Lu. 2019. Look back and predict forward in image captioning. In CVPR. 8367--8375.Google ScholarGoogle Scholar
  34. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS. 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. 2017. Deep reinforcement learning-based image captioning with embedding reward. In CVPR. 1151--1159.Google ScholarGoogle Scholar
  36. Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In CVPR. 1179--1195.Google ScholarGoogle Scholar
  37. Sebastian Schuster, Ranjay Krishna, Angel X. Chang, Li Fei-Fei, and Christopher D. Manning. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the Fourth Workshop on Vision and Language. 70--80.Google ScholarGoogle Scholar
  38. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature, Vol. 529, 7587 (2016), 484--489.Google ScholarGoogle Scholar
  39. Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. In NIPS. 1057--1063. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Hongshuo Tian, Ning Xu, An-An Liu, and Yongdong Zhang. 2020. Part-aware interactive learning for scene graph generation. In ACM MM. 3155--3163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR. 4566--4575.Google ScholarGoogle Scholar
  42. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. 3156--3164.Google ScholarGoogle Scholar
  43. Aming Wu, Yahong Han, Yi Yang, Qinghua Hu, and Fei Wu. 2020. Convolutional reconstruction-to-sequence for video captioning. IEEE Trans. Circuits Syst. Video Techn, Vol. 30, 11 (2020), 4299--4308.Google ScholarGoogle ScholarCross RefCross Ref
  44. Shaomei Wu, Jeffrey Wieland, Omid Farivar, and Julie Schiller. 2017. Automatic alt-text: Computer-generated image descriptions for blind users on a social network service. In ACM CSCW. 1180--1192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Hongtao Xie, Zhendong Mao, Yongdong Zhang, Han Deng, Chenggang Yan, and Zhineng Chen. 2019. Double-bit quantization and index hashing for nearest neighbor search. IEEE Trans. Multimedia, Vol. 21, 5 (2019), 1248--1260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML, Vol. 37. 2048--2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Ning Xu, Hanwang Zhang, An-An Liu, Weizhi Nie, Yuting Su, Jie Nie, and Yongdong Zhang. 2020. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Trans. Multimedia, Vol. 22, 5 (2020), 1372--1383.Google ScholarGoogle ScholarCross RefCross Ref
  48. Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In CVPR. 10685--10694.Google ScholarGoogle Scholar
  49. Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In ECCV, Vol. 11218. 711--727.Google ScholarGoogle Scholar
  50. Sangdoo Yun, Jongwon Choi, Youngjoon Yoo, Kimin Yun, and Jin Young Choi. 2017. Action-decision networks for visual tracking with deep reinforcement learning. In CVPR. 1349--1358.Google ScholarGoogle Scholar
  51. Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In CVPR. 5831--5840.Google ScholarGoogle Scholar
  52. Li Zhang, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin Yang, and Timothy M. Hospedales. 2017. Actor-critic sequence training for image captioning. In NIPS Workshop on Visually-Grounded Interaction and Language.Google ScholarGoogle Scholar
  53. Yan Zhang, Jonathon S. Hare, and Adam Prü gel-Bennett. 2018. Learning to count objects in natural images for visual question answering. In ICLR.Google ScholarGoogle Scholar

Index Terms

  1. Triangle-Reward Reinforcement Learning: A Visual-Linguistic Semantic Alignment for Image Captioning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '21: Proceedings of the 29th ACM International Conference on Multimedia
      October 2021
      5796 pages
      ISBN:9781450386517
      DOI:10.1145/3474085

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 October 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader