skip to main content
10.1145/3240508.3240695acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Paragraph Generation Network with Visual Relationship Detection

Published:15 October 2018Publication History

ABSTRACT

Paragraph generation of images is a new concept, aiming to produce multiple sentences to describe a given image. In this paper, we propose a paragraph generation network with introducing visual relationship detection. We first detect regions which may contain important visual objects and then predict their relationships. Paragraphs are produced based on object regions which have valid relationship with others. Compared with previous works which generate sentences based on region features, we explicitly explore and utilize visual relationships in order to improve final captions. The experimental results show that such strategy could improve paragraph generating performance from two aspects: more details about object relations are detected and more accurate sentences are obtained. Furthermore, our model is more robust to region detection fluctuation.

References

  1. Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6298--6306.Google ScholarGoogle Scholar
  2. Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting Visual Relationships With Deep Relational Networks. In The IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 3298--3308.Google ScholarGoogle Scholar
  3. Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation .Google ScholarGoogle ScholarCross RefCross Ref
  4. Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Computer Vision and Pattern Recognition. 677--691. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Andrea Frome, Gregory S Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marcaurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. neural information processing systems (2013), 2121--2129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Stephen Gould, Jim Rodgers, David Cohen, Gal Elidan, and Daphne Koller. 2008. Multi-Class Segmentation with Relative Location Prior. International Journal of Computer Vision , Vol. 80, 3 (2008), 300--316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Abhinav Gupta, Aniruddha Kembhavi, and Larry S. Davis. 2009. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence , Vol. 31, 10 (2009), 1775--1789. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research , Vol. 47, 1 (2013), 853--899. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In Computer Vision and Pattern Recognition. 4565--4574.Google ScholarGoogle Scholar
  10. Andrej Karpathy and Fei Fei Li. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , Vol. 39, 4 (2015), 664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A Hierarchical Approach for Generating Descriptive Image Paragraphs. In IEEE Conference on Computer Vision and Pattern Recognition. 3337--3345.Google ScholarGoogle ScholarCross RefCross Ref
  12. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. https://arxiv.org/abs/1602.07332 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cewu Lu, Ranjay Krishna, Michael S Bernstein, and Li Feifei. 2016. Visual Relationship Detection with Language Priors. European Conference on Computer Vision (2016), 852--869.Google ScholarGoogle Scholar
  14. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarGoogle ScholarCross RefCross Ref
  15. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain Images with Multimodal Recurrent Neural Networks. Computer Science (2014).Google ScholarGoogle Scholar
  16. Kishore Papineni, Salim Roukos, Todd Ward, and Weijing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. Association for Computational Linguistics (2002), 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Marco Pedersoli, Thomas Lucas, Cordelia Schmid, and Jakob Verbeek. 2017. Areas of Attention for Image Captioning. In The IEEE International Conference on Computer Vision (ICCV). 1251--1259.Google ScholarGoogle Scholar
  18. Shaoqing Ren, Kaiming He, Ross B Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence , Vol. 39, 6 (2017), 1137--1149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lin Tsung-Yi, Maire Michael, Belongie Serge, Bourdev Lubomir, Girshick Ross, Hays James, Perona Pietro, Ramanan Deva, Zitnick C. Lawrence, and Dollar Piotr. 2015. Microsoft COCO: Common Objects in Context. https://arxiv.org/pdf/1405.0312.pdfGoogle ScholarGoogle Scholar
  20. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. computer vision and pattern recognition (2015), 4566--4575.Google ScholarGoogle Scholar
  21. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition . 3156--3164.Google ScholarGoogle Scholar
  22. Linjie Yang, Kevin Tang, Jianchao Yang, and Li-Jia Li. 2017. Dense Captioning With Joint Inference and Visual Context. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1978--1987.Google ScholarGoogle Scholar
  23. Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017b. Visual Translation Embedding Network for Visual Relation Detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 3107--3115.Google ScholarGoogle Scholar
  24. Ji Zhang, Mohamed Elhoseiny, Scott Cohen, Walter Chang, and Ahmed Elgammal. 2017a. Relationship Proposal Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5226--5234.Google ScholarGoogle Scholar

Index Terms

  1. Paragraph Generation Network with Visual Relationship Detection

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '18: Proceedings of the 26th ACM international conference on Multimedia
          October 2018
          2167 pages
          ISBN:9781450356657
          DOI:10.1145/3240508

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 15 October 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          MM '18 Paper Acceptance Rate209of757submissions,28%Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader