Transformer-based Cross Reference Network for video salient object detection
Introduction
Video salient object detection (VSOD) aims at localizing and segmenting the most conspicuous objects in a video sequence. It appears to be a fundamental processing tool in a variety of vision tasks, such as video object segmentation [1], video compression [2], video summarization [3] and medical analysis [4]. VSOD is related to human eye fixation prediction [5] that targets at finding the focus of human eyes when free-viewing scenes. However, VSOD concentrates more on highlighting the whole salient object regions with clear boundaries. VSOD presents more challenges compared with still-image salient object detection due to the complexity of temporal motion.
There are mainly two issues involved in VSOD, i.e., (1) how to extract effective feature representations from appearance and motion cues, and (2) how to combine both of them into robust saliency representation. For the first issue, existing state-of-the-art methods leverage on Convolutional Neural Networks (CNNs) to extract multi-scale feature representations optimized by saliency supervision. For the second issue, widely-used strategies include: two-stream spatial-temporal fusion [6,7], step-to-step propagation (e.g., recurrent neural network (RNN)) [8,9], and attention-based aggregation (e.g., self-attention) [10]. Although these methods have gained promising improvement, they still confront several problems: (1) they all rely on CNNs to extract deep features, therefore, it is limited in modeling global long-range dependencies; (2) appearance and motion features are encoded individually at sequentially distinctive stages or in paralleled separate branches, thus they cannot modulate each other in a full collaborative way.
Focusing on the above-mentioned problems, we seek to leverage on Transformer [11] to model global long-range dependencies for feature extraction and integration in VSOD. Transformer has first been proposed for modeling long-range relations among word sequences for machine translation. Recently it has been successfully applied to image recognition (e.g., ViT [12]), showing great potential in solving vision tasks. With this inspiration, we rethink and investigate video salient object detection from a sequence-to-sequence prediction perspective.
In this paper, we propose a novel Transformer-based Cross Reference Network (TCRN) for VSOD. The proposed method fully exploits long-range context dependencies in feature representation extraction, as well as cross-modal integration. Unlike existing convolution-based methods, feature representation extraction in TCRN is achieved by a multi-scale token based Transformer, i.e., T2T-ViT [13]. In addition, we design a Gated Cross Reference (GCR) module to facilitate fully collaborative learning between appearance and motion features. The GCR further adopts a gate mechanism to integrate appearance and motion complementarily into a holistic saliency representation. Extensive evaluations on five widely-used benchmarks demonstrate that the proposed Transformer-based approach performs favorably against the existing state-of-the-art methods.
Section snippets
Related works
Video salient object detection. Compared with still-image salient object detection [14], [15], [16], video salient object detection (VSOD) presents more challenges due to the complexity of temporal feature extraction. Early works concentrate on designing effective hand-crafted saliency feature extractor and spatio-temporal fusion methods [17], [18], [19]. Recently, deep learning based methods have shown promising performance in VSOD. Wang et al. [20] employs Fully Convolutional Networks (FCNs)
Overview
An overview of the proposed architecture is shown in Fig. 1. Given a video clip containing consecutive frame , we first utilize the motion flow generator FlowNet-2.0 [26], to generate optical flow maps . is computed between two adjacent frames and . The proposed architecture takes as input a RGB image and its paired optical flow map , producing the final saliency map. First, and are fed into two independent Transformer-based encoder, which adopts a
Experiment setup
Datasets. We evaluate our proposed method on five widely used Video Salient Object Detection (VSOD) datasets, including DAVIS [28], FBMS [29], VOS [30], SegTrackv2 [31], ViSal [18]. DAVIS and SegTrackV2 are densely annotated datasets, which contains 50 videos and 14 videos, respectively. FBMS, ViSal and VOS are sparsely annotated datasets, which contains 59 videos (720 annotated frames), 19 videos (193 annotated frames) and 200 videos (7467 annotated frames), respectively. The evaluation
Conclusion
In this paper we propose a novel Transformer-based Cross Reference Network (TCRN) for VSOD. The long-range context dependencies are fully exploited in both feature representation extraction and cross-modal integration. Specifically, we design a Gated Cross Reference (GCR) module to allow collaborative feature learning between appearance and motion features, deriving more robust saliency representation. Experiments show that Transformer-based architecture can perform on par with, or even better
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was supported by the National Natural Science Foundation of China (No. 62101316).
References (45)
- et al.
Coherency based spatiotemporal saliency detection for video object segmentation
IEEE J. Selected Top. Signal Process.
(2014) Automatic foveation for video compression using a neurobiological model of visual attention
IEEE Trans. Image Process.
(2004)- et al.
A user attention model for video summarization
- et al.
Volumetric attention for 3d medical image segmentation and detection
- et al.
A model of saliency-based visual attention for rapid scene analysis
IEEE Trans. Pattern Anal. Mach. Intell.
(1998) - et al.
Flow guided recurrent neural encoder for video salient object detection
- et al.
Motion guided attention for video salient object detection
- et al.
Pyramid dilated deeper convlstm for video salient object detection
- et al.
Shifting more attention to video salient object detection
- et al.
Pyramid constrained self-attention network for fast video salient object detection
Attention is all you need
An image is worth 16x16 words: Transformers for image recognition at scale
Proceedings of the International Conference on Learning Representation
Tokens-to-token vit: training vision transformers from scratch on imagenet
Frequency-tuned salient region detection
Deeply supervised salient object detection with short connections
Eg- net:edge guidance network for salient object detection
Video saliency incorporating spatiotemporal cues and uncertainty weighting
Consistent video saliency using local gradient flow optimization and global refinement
IEEE Trans. Image Process.
Scom: spatiotemporal constrained optimization for salient object detection
IEEE Trans. Image Process.
Video salient object detection via fully convolutional networks
IEEE Trans. Image Process.
Swin transformer: hierarchical vision transformer using shifted windows
Pyramid vision transformer: a versatile backbone for dense prediction without convolutions
Cited by (22)
Frame-part-activated deep reinforcement learning for Action Prediction
2024, Pattern Recognition LettersTSVT: Token Sparsification Vision Transformer for robust RGB-D salient object detection
2024, Pattern RecognitionCrossFormer: Cross-guided attention for multi-modal object detection
2024, Pattern Recognition LettersMotion Context guided Edge-preserving network for video salient object detection
2023, Expert Systems with ApplicationsPFGAN: Fast transformers for image synthesis
2023, Pattern Recognition LettersThermal images-aware guided early fusion network for cross-illumination RGB-T salient object detection
2023, Engineering Applications of Artificial IntelligenceCitation Excerpt :Salient object detection mimics the human visual attention system and is used to detect and segment the most attention-grabbing regions or objects in an image. As a fundamental topic in the field of computer vision, salient object detection methods are widely used in the fields of video salient object detection (Huang et al., 2022b; Shokri et al., 2020; Kompella et al., 2021), object tracking (Liu et al., 2022b; Fiaz et al., 2019; Meinhardt et al., 2022), image segmentation (Cheng et al., 2022a; Strudel et al., 2021; Shivakumar et al., 2020) and other fields (Fu et al., 2022). In the past decade, most research has focused on RGB salient object detection and achieved excellent detection results.