Scene-Graph-Guided message passing network for dense captioning
Introduction
Recently, image captioning [1] has received much attention in the field of computer vision. A list of existing approaches [2], [3], [4], [5], [6] have achieved remarkable success on popular datasets such as MSCOCO [7] and Flicker30k [8]. Further, due to the challeges of complete visual understanding from holistic images, [9] propose the task of dense captioning, where the dense description of image regions is known as a better interpretation of the visual content. Particularly, the generated captions are able to provide more fine-grained semantic cues for image regions, which further enables complex reasoning on visual context. Hence, dense captioning task can be used in visual question answering [10], [11] since the task of visual questions answering requires more visual details for question-relevant regions to answer open-ended questions [12], [13].
Intuitively, image data consists of diverse visual concepts at different semantic levels such as objects and relationships between them. These visual concepts are highly correlated with the task of dense captioning. Particularly, the visual concepts provide the rich semantic cues and complex scene understanding, which can benefit the dense captioning task. As shown in Fig. 1, object detection focuses on detecting salient objects such as people, boat, and pants. Relationship detection determines the holistic interpretation, which connects a pair of localized objects as a subject-predicate-object triplet, such as people-wearing-pants and people-standing on-boat. Since objects and relationships contains concept information of different semantic levels, they can not only complement each other, but also provide fine-grained semantic of the salient regions. Hence, the message passing across different visual concepts are highly correlated and can provide interactive information for dense captioning.
Due to the challenging combinatorial complexity of formulating subject-predicate-object triplets, very little work explore the message passing among them for dense captioning task. Recently, [14] has proposed to represent visual scenes as graphs containing objects and relationships between them. Intuitively, the scene graph forms an interpretable structured representation of the image that can support higher-level semantic cues. In this paper, we propose a sence-graph-guided message passing network for the dense captioning task. As shown in Fig. 2, we first exploit message passing between objects and their relationships with a feature refining structure. Moreover, we formulate the message passing as the inter-connected visual concept generation problem while the objective function of scene graph generation is used to guide the region feature learning. Scene graph guide can propagate the structured knowledge of graph through the concept-region message passing mechanism (CR-MPM), which can improve the regional feature representation. Finally, the refined regional features are encoded by a LSTM-based decoder to generate dense captions.
The key contributions are as follow:
- •
We propose the Scene-Graph-Guided Message Passing Network to leverage the rich visual concepts and the structured knowledge for dense caption generation;
- •
We use the objective function of scene graph generation to guide the feature learning between objects and relationships, which can propagate the structured knowledge through the proposed CR-MPM and further improve the regional feature representation;
- •
We evaluate the proposed model on Visual Genome dataset. Experimental results show competitive performances against state-of-the-art methods. Qualitative experiments also confirm the effect of our model.
Section snippets
Related works
Image Captioning.
Several pioneering methods [2], [15], [16], [17] have explored to describe images with natural language. They can be divided into sequence-based and attention-based methods. Sequence-based methods [1], [18], [19] first use an encoder to map images to feature vectors and then generate translated sentences by a decoder. Attention-based methods [3], [20], [21], [22], [23], [24] use attention mechanism to weight each regional feature vector to exploit the spatial structure for
Framework overview
The proposed method is an end-to-end framework that can generate regional captions from an image. As shown in Fig. 2, the whole network consists of three components: proposal generator, concept-driven refining network, and the dense caption decoder. Particularly, given an image, the proposal generator first extracts three types of region-of-interests (ROIs) for region, relationship, and object proposals (Fig. 2(a)). Then, the concept-driven refining network is devised to incorporate
Datasets
We use the Visual Genome dataset [45] as the evaluation benchmark, which provides the annotations of objects, relationships, and region descriptions within each image. Since raw Visual Genome contains too many noisy labels, we do some preprocessing on the relationship annotations as in [44]. All the annotations are normalized in different tenses, where the 150 most frequent object categories and 50 relationship categories are automatically selected from Genome. We use the same train/test splits
Conclusions
In this paper, we propose the Scene-Graph-Guided Message Passing Network to leverage the rich visual concepts and the structured knowledge for dense captioning task. We first exploit message passing between objects and their relationships. Then, we formulate the message passing as the inter-connected visual concept generation problem while the objective function of scene graph generation is used to guide the region feature learning. Scene graph guide can propagate the structured knowledge of
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowlgedgments
This work was supported in part by the National Key Research and Development Program of China (2020YFB1406602), the National Natural Science Foundation of China (61772359, 62002257), the grant of Tianjin New Generation Artificial Intelligence Major Program (19ZXZNGX00110, 18ZXZNGX00150), the Baidu Program. Besides, we sincerely thank to the Baidu Pinecone Program for the Paddlepaddle platform.
References (52)
- et al.
Stimulus-driven and concept-driven analysis for image caption generation
Neurocomputing
(2020) - et al.
Multiple discrimination and pairwise CNN for view-based 3d object retrieval
Neural Networks
(2020) - et al.
Long-term recurrent convolutional networks for visual recognition and description
IEEE Trans. Pattern Anal. Mach. Intell.
(2017) - et al.
What value do explicit high level concepts have in vision to language problems?
CVPR
(2016) - et al.
Bottom-up and top-down attention for image captioning and visual question answering
CVPR
(2018) - et al.
Neural baby talk
CVPR
(2018) - et al.
Multi-level policy and reward reinforcement learning for image captioning
IJCAI
(2018) - et al.
Multi-level policy and reward-based deep reinforcement learning framework for image captioning
IEEE Trans. Multim.
(2020) - et al.
Microsoft COCO: common objects in context
ECCV
(2014) - et al.
From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions
TACL
(2014)
Densecap: Fully convolutional localization networks for dense captioning
CVPR
Generating question relevant captions to aid visual question answering
Visual question answering model based on visual relationship detection
Signal Process. Image Commun.
Where to look: Focus regions for visual question answering
CVPR
Deep modular co-attention networks for visual question answering
CVPR
Scene graph generation by iterative message passing
CVPR
Multiview and multimodal pervasive indoor localization
ACMMM
MMALFM: Explainable recommendation by leveraging reviews and images
ACM Trans. Inf. Syst.
Learning like a child: Fast novel visual concept learning from sentence descriptions of images
ICCV
SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning
CVPR
GLA: Global-local attention for image description
IEEE Trans. Multimedia
Distribution-oriented aesthetics assessment with semantic-aware hybrid network
IEEE Trans. Multimedia
Dual-stream recurrent neural network for video captioning
IEEE Trans. Circuits Syst. Video Techn.
Stat: spatial-temporal attention mechanism for video captioning
IEEE Trans. Multimedia
Dense captioning with joint inference and visual context
CVPR
Context and attribute grounded dense captioning
CVPR
Cited by (6)
Image captioning based on scene graphs: A survey
2023, Expert Systems with ApplicationsUncertainty-Aware Scene Graph Generation
2023, Pattern Recognition LettersCitation Excerpt :In SGG task, the relationship labels are not very sufficient and have much noise, which need natural exploration with the help of Bayes by Backprop. Scene graph generation (SGG) task [1], developed from visual relationship detection (VRD) task [2], extracts graphical semantics from an input image and benefits several downstream tasks [3–23,32,33]. Technically, SGG is often divided into two fields: object detection and relationship detection.
Aligned visual semantic scene graph for image captioning
2022, DisplaysCitation Excerpt :The structured semantics are modeled by the scene graph [7,8] which contains a set of semantic contents such as the objects(“girl”), relationships(“man riding horse”) and attributes (“green ball”) of an image. Scene graphs can provide complementary informations to assist the model to generate the fine-grained descriptions [9–11]. Thus, correlating the visual features and the semantic graph features together will benefit the image captioning [12].
Editorial paper for Pattern Recognition Letters VSI on cross model understanding for visual question answering
2022, Pattern Recognition LettersA State-of-the-Art Computer Vision Adopting Non-Euclidean Deep-Learning Models
2023, International Journal of Intelligent Systems