Elsevier

Pattern Recognition Letters

Volume 145, May 2021, Pages 187-193
Pattern Recognition Letters

Scene-Graph-Guided message passing network for dense captioning

https://doi.org/10.1016/j.patrec.2021.01.024Get rights and content

Highlights

  • We propose to leverage the rich visual concepts and the structured knowledge for dense caption generation.

  • We use the objective function of scene graph generation to propagate the structured knowledge through the refining pipeline.

  • Expcrimental results and qualitative experiments confirm the cffect of our model.

Abstract

Dense captioning task aims to both localize and describe salient regions in images with natural languages. It can benefit from the rich visual concepts, including objects, pair-wise relationships and so on. However, due to the challenging combinatorial complexity of formulating <subject-predicate-object> triplets, very little work has been done to integrate them into the dense captioning task. Inspired by the recent success in scene graph generation for object and relationship detections, we propose a scene-graph-guided message passing network for dense caption generation. We first exploit message passing between objects and their relationships with a feature refining structure. Moreover, we formulate the message passing as the inter-connected visual concept generation problem while the objective function of scene graph generation is used to guide the region feature learning. Scene graph guide can propagate the structured knowledge of graph through the concept-region message passing mechanism (CR-MPM), which can improve the regional feature representation. Finally, the refined regional features are encoded by a LSTM-based decoder to generate dense captions. Our model can achieve competing performances on Visual Genome comparing against existing methods. Qualitative experiments also confirm the effect of our model in the dense captioning task.

Introduction

Recently, image captioning [1] has received much attention in the field of computer vision. A list of existing approaches [2], [3], [4], [5], [6] have achieved remarkable success on popular datasets such as MSCOCO [7] and Flicker30k [8]. Further, due to the challeges of complete visual understanding from holistic images, [9] propose the task of dense captioning, where the dense description of image regions is known as a better interpretation of the visual content. Particularly, the generated captions are able to provide more fine-grained semantic cues for image regions, which further enables complex reasoning on visual context. Hence, dense captioning task can be used in visual question answering [10], [11] since the task of visual questions answering requires more visual details for question-relevant regions to answer open-ended questions [12], [13].

Intuitively, image data consists of diverse visual concepts at different semantic levels such as objects and relationships between them. These visual concepts are highly correlated with the task of dense captioning. Particularly, the visual concepts provide the rich semantic cues and complex scene understanding, which can benefit the dense captioning task. As shown in Fig. 1, object detection focuses on detecting salient objects such as people, boat, and pants. Relationship detection determines the holistic interpretation, which connects a pair of localized objects as a <subject-predicate-object> triplet, such as <people-wearing-pants> and <people-standing on-boat>. Since objects and relationships contains concept information of different semantic levels, they can not only complement each other, but also provide fine-grained semantic of the salient regions. Hence, the message passing across different visual concepts are highly correlated and can provide interactive information for dense captioning.

Due to the challenging combinatorial complexity of formulating <subject-predicate-object> triplets, very little work explore the message passing among them for dense captioning task. Recently, [14] has proposed to represent visual scenes as graphs containing objects and relationships between them. Intuitively, the scene graph forms an interpretable structured representation of the image that can support higher-level semantic cues. In this paper, we propose a sence-graph-guided message passing network for the dense captioning task. As shown in Fig. 2, we first exploit message passing between objects and their relationships with a feature refining structure. Moreover, we formulate the message passing as the inter-connected visual concept generation problem while the objective function of scene graph generation is used to guide the region feature learning. Scene graph guide can propagate the structured knowledge of graph through the concept-region message passing mechanism (CR-MPM), which can improve the regional feature representation. Finally, the refined regional features are encoded by a LSTM-based decoder to generate dense captions.

The key contributions are as follow:

  • We propose the Scene-Graph-Guided Message Passing Network to leverage the rich visual concepts and the structured knowledge for dense caption generation;

  • We use the objective function of scene graph generation to guide the feature learning between objects and relationships, which can propagate the structured knowledge through the proposed CR-MPM and further improve the regional feature representation;

  • We evaluate the proposed model on Visual Genome dataset. Experimental results show competitive performances against state-of-the-art methods. Qualitative experiments also confirm the effect of our model.

Section snippets

Related works

Image Captioning.

Several pioneering methods [2], [15], [16], [17] have explored to describe images with natural language. They can be divided into sequence-based and attention-based methods. Sequence-based methods [1], [18], [19] first use an encoder to map images to feature vectors and then generate translated sentences by a decoder. Attention-based methods [3], [20], [21], [22], [23], [24] use attention mechanism to weight each regional feature vector to exploit the spatial structure for

Framework overview

The proposed method is an end-to-end framework that can generate regional captions from an image. As shown in Fig. 2, the whole network consists of three components: proposal generator, concept-driven refining network, and the dense caption decoder. Particularly, given an image, the proposal generator first extracts three types of region-of-interests (ROIs) for region, relationship, and object proposals (Fig. 2(a)). Then, the concept-driven refining network is devised to incorporate

Datasets

We use the Visual Genome dataset [45] as the evaluation benchmark, which provides the annotations of objects, relationships, and region descriptions within each image. Since raw Visual Genome contains too many noisy labels, we do some preprocessing on the relationship annotations as in [44]. All the annotations are normalized in different tenses, where the 150 most frequent object categories and 50 relationship categories are automatically selected from Genome. We use the same train/test splits

Conclusions

In this paper, we propose the Scene-Graph-Guided Message Passing Network to leverage the rich visual concepts and the structured knowledge for dense captioning task. We first exploit message passing between objects and their relationships. Then, we formulate the message passing as the inter-connected visual concept generation problem while the objective function of scene graph generation is used to guide the region feature learning. Scene graph guide can propagate the structured knowledge of

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowlgedgments

This work was supported in part by the National Key Research and Development Program of China (2020YFB1406602), the National Natural Science Foundation of China (61772359, 62002257), the grant of Tianjin New Generation Artificial Intelligence Major Program (19ZXZNGX00110, 18ZXZNGX00150), the Baidu Program. Besides, we sincerely thank to the Baidu Pinecone Program for the Paddlepaddle platform.

References (52)

  • S. Ding et al.

    Stimulus-driven and concept-driven analysis for image caption generation

    Neurocomputing

    (2020)
  • Z. Gao et al.

    Multiple discrimination and pairwise CNN for view-based 3d object retrieval

    Neural Networks

    (2020)
  • J. Donahue et al.

    Long-term recurrent convolutional networks for visual recognition and description

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • Q. Wu et al.

    What value do explicit high level concepts have in vision to language problems?

    CVPR

    (2016)
  • P. Anderson et al.

    Bottom-up and top-down attention for image captioning and visual question answering

    CVPR

    (2018)
  • J. Lu et al.

    Neural baby talk

    CVPR

    (2018)
  • A. Liu et al.

    Multi-level policy and reward reinforcement learning for image captioning

    IJCAI

    (2018)
  • N. Xu et al.

    Multi-level policy and reward-based deep reinforcement learning framework for image captioning

    IEEE Trans. Multim.

    (2020)
  • T. Lin et al.

    Microsoft COCO: common objects in context

    ECCV

    (2014)
  • P. Young et al.

    From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions

    TACL

    (2014)
  • J. Johnson et al.

    Densecap: Fully convolutional localization networks for dense captioning

    CVPR

    (2016)
  • J. Wu et al.

    Generating question relevant captions to aid visual question answering

  • Y. Xi et al.

    Visual question answering model based on visual relationship detection

    Signal Process. Image Commun.

    (2020)
  • K.J. Shih et al.

    Where to look: Focus regions for visual question answering

    CVPR

    (2016)
  • Z. Yu et al.

    Deep modular co-attention networks for visual question answering

    CVPR

    (2019)
  • D. Xu et al.

    Scene graph generation by iterative message passing

    CVPR

    (2017)
  • Z. Liu et al.

    Multiview and multimodal pervasive indoor localization

    ACMMM

    (2017)
  • Z. Cheng et al.

    MMALFM: Explainable recommendation by leveraging reviews and images

    ACM Trans. Inf. Syst.

    (2019)
  • J. Mao et al.

    Learning like a child: Fast novel visual concept learning from sentence descriptions of images

    ICCV

    (2015)
  • L. Chen et al.

    SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning

    CVPR

    (2017)
  • L. Li et al.

    GLA: Global-local attention for image description

    IEEE Trans. Multimedia

    (2018)
  • C. Cui et al.

    Distribution-oriented aesthetics assessment with semantic-aware hybrid network

    IEEE Trans. Multimedia

    (2019)
  • N. Xu et al.

    Dual-stream recurrent neural network for video captioning

    IEEE Trans. Circuits Syst. Video Techn.

    (2019)
  • C. Yan et al.

    Stat: spatial-temporal attention mechanism for video captioning

    IEEE Trans. Multimedia

    (2019)
  • L. Yang et al.

    Dense captioning with joint inference and visual context

    CVPR

    (2017)
  • G. Yin et al.

    Context and attribute grounded dense captioning

    CVPR

    (2019)
  • Cited by (6)

    • Image captioning based on scene graphs: A survey

      2023, Expert Systems with Applications
    • Uncertainty-Aware Scene Graph Generation

      2023, Pattern Recognition Letters
      Citation Excerpt :

      In SGG task, the relationship labels are not very sufficient and have much noise, which need natural exploration with the help of Bayes by Backprop. Scene graph generation (SGG) task [1], developed from visual relationship detection (VRD) task [2], extracts graphical semantics from an input image and benefits several downstream tasks [3–23,32,33]. Technically, SGG is often divided into two fields: object detection and relationship detection.

    • Aligned visual semantic scene graph for image captioning

      2022, Displays
      Citation Excerpt :

      The structured semantics are modeled by the scene graph [7,8] which contains a set of semantic contents such as the objects(“girl”), relationships(“man riding horse”) and attributes (“green ball”) of an image. Scene graphs can provide complementary informations to assist the model to generate the fine-grained descriptions [9–11]. Thus, correlating the visual features and the semantic graph features together will benefit the image captioning [12].

    View full text