Transformer-based Cross Reference Network for video salient object detection

doi:10.1016/j.patrec.2022.06.006

Pattern Recognition Letters

Volume 160, August 2022, Pages 122-127

https://doi.org/10.1016/j.patrec.2022.06.006 Get rights and content

Highlights

•
We propose a Transformer-based network architecture for video salient object detection.
•
We propose a Gated Cross Reference (GCR) module to facilitate collaborative learning between appearance and motion cues.
•
The proposed Transformer-based method reaches excellent performance on five public benchmarks.

Abstract

Video salient object detection is a fundamental computer vision task aimed at highlighting the most conspicuous objects in a video sequence. There are two key challenges presented in video salient object detection: (1) how to extract effective feature representations from appearance and motion cues, and (2) how to combine both of them into robust saliency representation. To handle these challenges, in this paper, we propose a novel Transformer-based Cross Reference Network (TCRN), which fully exploits long-range context dependencies in both feature representation extraction and cross-modal (i.e., appearance and motion) integration. In contrast to existing CNN-based methods, our approach formulates video salient object detection as a sequence-to-sequence prediction task. In the proposed approach, the deep feature extraction is achieved by a pure vision transformer with multi-resolution token representations. Specifically, we design a Gated Cross Reference (GCR) module to effectively integrate appearance and motion into saliency representation. The GCR first propagates global context information between different modalities, and then perform cross-modal fusion by a gate mechanism. Extensive evaluations on five widely-used benchmarks show that the proposed Transformer-based method performs favorably against the existing state-of-the-art methods

Introduction

Video salient object detection (VSOD) aims at localizing and segmenting the most conspicuous objects in a video sequence. It appears to be a fundamental processing tool in a variety of vision tasks, such as video object segmentation [1], video compression [2], video summarization [3] and medical analysis [4]. VSOD is related to human eye fixation prediction [5] that targets at finding the focus of human eyes when free-viewing scenes. However, VSOD concentrates more on highlighting the whole salient object regions with clear boundaries. VSOD presents more challenges compared with still-image salient object detection due to the complexity of temporal motion.

There are mainly two issues involved in VSOD, i.e., (1) how to extract effective feature representations from appearance and motion cues, and (2) how to combine both of them into robust saliency representation. For the first issue, existing state-of-the-art methods leverage on Convolutional Neural Networks (CNNs) to extract multi-scale feature representations optimized by saliency supervision. For the second issue, widely-used strategies include: two-stream spatial-temporal fusion [6,7], step-to-step propagation (e.g., recurrent neural network (RNN)) [8,9], and attention-based aggregation (e.g., self-attention) [10]. Although these methods have gained promising improvement, they still confront several problems: (1) they all rely on CNNs to extract deep features, therefore, it is limited in modeling global long-range dependencies; (2) appearance and motion features are encoded individually at sequentially distinctive stages or in paralleled separate branches, thus they cannot modulate each other in a full collaborative way.

Focusing on the above-mentioned problems, we seek to leverage on Transformer [11] to model global long-range dependencies for feature extraction and integration in VSOD. Transformer has first been proposed for modeling long-range relations among word sequences for machine translation. Recently it has been successfully applied to image recognition (e.g., ViT [12]), showing great potential in solving vision tasks. With this inspiration, we rethink and investigate video salient object detection from a sequence-to-sequence prediction perspective.

In this paper, we propose a novel Transformer-based Cross Reference Network (TCRN) for VSOD. The proposed method fully exploits long-range context dependencies in feature representation extraction, as well as cross-modal integration. Unlike existing convolution-based methods, feature representation extraction in TCRN is achieved by a multi-scale token based Transformer, i.e., T2T-ViT [13]. In addition, we design a Gated Cross Reference (GCR) module to facilitate fully collaborative learning between appearance and motion features. The GCR further adopts a gate mechanism to integrate appearance and motion complementarily into a holistic saliency representation. Extensive evaluations on five widely-used benchmarks demonstrate that the proposed Transformer-based approach performs favorably against the existing state-of-the-art methods.

Section snippets

Related works

Video salient object detection. Compared with still-image salient object detection [14], [15], [16], video salient object detection (VSOD) presents more challenges due to the complexity of temporal feature extraction. Early works concentrate on designing effective hand-crafted saliency feature extractor and spatio-temporal fusion methods [17], [18], [19]. Recently, deep learning based methods have shown promising performance in VSOD. Wang et al. [20] employs Fully Convolutional Networks (FCNs)

Overview

An overview of the proposed architecture is shown in Fig. 1. Given a video clip containing $T$ consecutive frame ${I^{t}}_{t = 1}^{T}$ , we first utilize the motion flow generator FlowNet-2.0 [26], to generate $T - 1$ optical flow maps ${M^{t}}_{t = 1}^{T - 1}$ . $M^{t}$ is computed between two adjacent frames $I^{t}$ and $I^{t + 1}$ . The proposed architecture takes as input a RGB image $I^{t}$ and its paired optical flow map $M^{t}$ , producing the final saliency map. First, $I^{t}$ and $M^{t}$ are fed into two independent Transformer-based encoder, which adopts a

Experiment setup

Datasets. We evaluate our proposed method on five widely used Video Salient Object Detection (VSOD) datasets, including DAVIS [28], FBMS [29], VOS [30], SegTrackv2 [31], ViSal [18]. DAVIS and SegTrackV2 are densely annotated datasets, which contains 50 videos and 14 videos, respectively. FBMS, ViSal and VOS are sparsely annotated datasets, which contains 59 videos (720 annotated frames), 19 videos (193 annotated frames) and 200 videos (7467 annotated frames), respectively. The evaluation

Conclusion

In this paper we propose a novel Transformer-based Cross Reference Network (TCRN) for VSOD. The long-range context dependencies are fully exploited in both feature representation extraction and cross-modal integration. Specifically, we design a Gated Cross Reference (GCR) module to allow collaborative feature learning between appearance and motion features, deriving more robust saliency representation. Experiments show that Transformer-based architecture can perform on par with, or even better

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the National Natural Science Foundation of China (No. 62101316).

References (45)

D. Mahapatra et al.
Coherency based spatiotemporal saliency detection for video object segmentation
IEEE J. Selected Top. Signal Process.
(2014)
L. Itti
Automatic foveation for video compression using a neurobiological model of visual attention
IEEE Trans. Image Process.
(2004)
Y.F. Ma et al.
A user attention model for video summarization
X. Wang et al.
Volumetric attention for 3d medical image segmentation and detection
L. Itti et al.
A model of saliency-based visual attention for rapid scene analysis
IEEE Trans. Pattern Anal. Mach. Intell.
(1998)
G. Li et al.
Flow guided recurrent neural encoder for video salient object detection
H. Li et al.
Motion guided attention for video salient object detection
H. Song et al.
Pyramid dilated deeper convlstm for video salient object detection
D.P. Fan et al.
Shifting more attention to video salient object detection
Y. Gu et al.
Pyramid constrained self-attention network for fast video salient object detection

A. Vaswani et al.

Attention is all you need

A. Dosovitskiy et al.

An image is worth 16x16 words: Transformers for image recognition at scale

Proceedings of the International Conference on Learning Representation

(2021)

L. Yuan et al.

Tokens-to-token vit: training vision transformers from scratch on imagenet

R. Achanta et al.

Frequency-tuned salient region detection

Q. Hou et al.

Deeply supervised salient object detection with short connections

J.X. Zhao et al.

Eg- net:edge guidance network for salient object detection

Y. Fang et al.

Video saliency incorporating spatiotemporal cues and uncertainty weighting

W. Wang et al.

Consistent video saliency using local gradient flow optimization and global refinement

IEEE Trans. Image Process.

(2015)

Y. Chen et al.

Scom: spatiotemporal constrained optimization for salient object detection

IEEE Trans. Image Process.

(2018)

W. Wang et al.

Video salient object detection via fully convolutional networks

IEEE Trans. Image Process.

(2018)

Z. Liu et al.

Swin transformer: hierarchical vision transformer using shifted windows

W. Wang et al.

Pyramid vision transformer: a versatile backbone for dense prediction without convolutions

Cited by (22)

Frame-part-activated deep reinforcement learning for Action Prediction
2024, Pattern Recognition Letters
In this paper, we propose a frame-part-activated deep reinforcement learning (FPA-DRL) for action prediction. Most existing methods for action prediction utilize the evolution of whole frames to model actions, which cannot avoid the noise of the current action, especially in the early prediction. Moreover, the loss of structural information of human body diminishes the capacity of features to describe actions. To address this, we design a FPA-DRL to exploit the structure of the human body by extracting skeleton proposals and reduce the redundancy of frames under a deep reinforcement learning framework. Specifically, we extract features from different parts of the human body individually, activate the action-related parts in features and the action-related frames in videos to enhance the representation. Our method not only exploits the structure information of the human body, but also considers the attention frame for expressing actions. We evaluate our method on three popular action prediction datasets: UT-Interaction, BIT-Interaction and UCF101. Our experimental results demonstrate that our method achieves the very competitive performance with state-of-the-arts.
TSVT: Token Sparsification Vision Transformer for robust RGB-D salient object detection
2024, Pattern Recognition
Visual transformer-based salient object detection (SOD) models have attracted increasing research attention. However, the existing transformer-based RGB-D SOD models usually operate on the full token sequences of RGB-D images and use an equal tokenization process to treat appearance and depth modalities, which leads to limited feature richness and inefficiency. To address these limitations, we present a novel token sparsification vision transformer architecture for RGB-D SOD, named TSVT, that explicitly extracts global-local multi-modality features with sparse tokens. The TSVT is an asymmetric encoder–decoder architecture with a dynamic sparse token encoder that adaptively selects and operates on sparse tokens, along with an multiple cascade aggregation decoder (MCAD) that predicts saliency results. Furthermore, we deeply investigate the differences and similarities between the appearance and depth modalities and develop an interactive diversity fusion module (IDFM) to integrate each pair of multi-modality tokens in different stages. Finally, to comprehensively evaluate the performance of the proposed model, we conduct extensive experiments on seven standard RGB-D SOD benchmarks in terms of five evaluation metrics. The experimental results reveal that the proposed model is more robust and effective than fifteen existing RGB-D SOD models. Moreover, the complexity of our model with the sparsification module is more than two times lower than that of the variant model without the dynamic sparse token module (DSTM).
CrossFormer: Cross-guided attention for multi-modal object detection
2024, Pattern Recognition Letters
Object detection is one of the essential tasks in a variety of real-world applications such as autonomous driving and robotics. In a real-world scenario, unfortunately, there are numerous challenges such as illumination changes, adverse weather conditions, and geographical changes, to name a few. To tackle the problem, we propose a novel multi-modal object detection model that is built upon a hierarchical transformer and cross-guidance between different modalities. The proposed hierarchical transformer consists of domain-specific feature extraction networks where intermediate features are connected by the proposed Cross-Guided Attention Module (CGAM) to enrich their representational power. Specifically, in the CGAM, one domain is regarded as a guide and the other is assigned to a base. After that, the cross-modal attention from the guide to the base is applied to the base feature. The CGAM works bidirectionally in parallel by exchanging roles between modalities to refine multi-modal features simultaneously. Experimental results on FLIR-aligned, LLVIP, and KAIST multispectral pedestrian datasets demonstrate that the proposed method is superior to previous multi-modal detection algorithms quantitatively and qualitatively.
Motion Context guided Edge-preserving network for video salient object detection
2023, Expert Systems with Applications
Video salient object detection targets at extracting the most conspicuous objects in a video sequence, which facilitate various video processing tasks, e.g., video compression, video recognition, etc. Although remarkable progress has been made for video salient object detection, most existing methods still suffer from coarse edge boundaries which may hinder their usage in real-world applications. To alleviate this problem, in this paper, we propose a Motion Context guided Edge-preserving network (MCE-Net) model for video salient object detection. MCE-Net can generate temporally consistent salient edges, which are then leveraged to refine the salient object regions completely and uniformly. The core innovation in MCE-Net is an Asymmetric Cross-Reference Module (ACRM), which is designed to exploit the cross-modal complementarity between spatial structure and motion flow, facilitating robust salient object edge extraction. To leverage the extracted edge features for salient object refinement, we fuse them with multi-level spatial–temporal embeddings in a paralleled guidance manner, generating the final saliency results. The proposed method is end-to-end trainable and the edge annotations are generated automatically from ground truth saliency maps. Experimental evaluations on five widely-used benchmarks demonstrate that our proposed method can achieve superior performance to other outstanding methods. Moreover, the experimental results indicate that our method can preserve salient objects with clear boundary structures in video sequences.
PFGAN: Fast transformers for image synthesis
2023, Pattern Recognition Letters
Recently, the Transformers have shown great potential in computer vision tasks, such as classification detection, segmentation, and image synthesis, etc. The success of Transformers has been long attributed to the attention-based token mixer. However, the computational complexity of the attention-based token mixer module is quadratic to the number of tokens to be mixed. Therefore, the attention-based token mixer module requires more parameters and will cause a very large amount of computation. As far as image synthesis task is concerned, the attention-based token mixer module increases the computation amount of generative adversarial networks (GANs) based on Transformers. To address this problem, we propose the PFGAN method. The motivation is based on our observation that the computational complexity of pooling is linear to the sequence length, without any other learnable parameters. Based on this observation, we use pooling rather than self-attention as the token mixer. Experimental results on CelebA, CIFAR-10 and LSUN datasets demonstrate that our proposed method has fewer parameters and fewer computational complexity.
Thermal images-aware guided early fusion network for cross-illumination RGB-T salient object detection
2023, Engineering Applications of Artificial Intelligence
Citation Excerpt :
Salient object detection mimics the human visual attention system and is used to detect and segment the most attention-grabbing regions or objects in an image. As a fundamental topic in the field of computer vision, salient object detection methods are widely used in the fields of video salient object detection (Huang et al., 2022b; Shokri et al., 2020; Kompella et al., 2021), object tracking (Liu et al., 2022b; Fiaz et al., 2019; Meinhardt et al., 2022), image segmentation (Cheng et al., 2022a; Strudel et al., 2021; Shivakumar et al., 2020) and other fields (Fu et al., 2022). In the past decade, most research has focused on RGB salient object detection and achieved excellent detection results.
RGB-T salient object detection (SOD) has been developed rapidly and achieved excellent results in recent years. However, some problems have not yet been solved. The current RGB-T datasets contain only a tiny amount of low-illumination data. The RGB-T SOD method trained based on these RGB-T datasets does not detect the salient objects in extremely low-illumination scenes very well. To improve the detection performance of low-illumination data, we can spend a lot of labor to label low-illumination data, but we tried a new idea to solve the problem by making full use of the properties of Thermal (T) images. Therefore, we propose a T-aware guided early fusion network for cross-illumination salient object detection. Specifically, in the training and testing stage, we use normal illumination data to train our network and then use low and extremely low-illumination data to verify the effectiveness of our method. In the early fusion stage, we propose a T-aware guided module (T-aware) for enhancing salient regions of RGB images at different illumination levels. Secondly, in the decoding stage, we use T images to guide the cross-modal fusion of RGB and T images. In addition, we propose a cross-modal fusion localization-remote correction module (CFL-RCM), which is used to deeply screen and correct redundant information generated by illumination variations. Comparative experiments on the VDT-2048 dataset validate the superior performance of our method on the cross-illumination RGB-T saliency detection. We also obtained favorable results on generalizability experiments with VT5000, VT1000, and VT821 datasets.

View all citing articles on Scopus

View full text

Transformer-based Cross Reference Network for video salient object detection

Highlights

Abstract

Introduction

Section snippets

Related works

Overview

Experiment setup

Conclusion

Declaration of Competing Interest

Acknowledgement

Coherency based spatiotemporal saliency detection for video object segmentation

IEEE J. Selected Top. Signal Process.

Automatic foveation for video compression using a neurobiological model of visual attention

IEEE Trans. Image Process.

A user attention model for video summarization

Volumetric attention for 3d medical image segmentation and detection

A model of saliency-based visual attention for rapid scene analysis

IEEE Trans. Pattern Anal. Mach. Intell.

Flow guided recurrent neural encoder for video salient object detection

Motion guided attention for video salient object detection

Pyramid dilated deeper convlstm for video salient object detection

Shifting more attention to video salient object detection

Pyramid constrained self-attention network for fast video salient object detection

Attention is all you need

An image is worth 16x16 words: Transformers for image recognition at scale

Proceedings of the International Conference on Learning Representation

Tokens-to-token vit: training vision transformers from scratch on imagenet

Frequency-tuned salient region detection

Deeply supervised salient object detection with short connections

Eg- net:edge guidance network for salient object detection

Video saliency incorporating spatiotemporal cues and uncertainty weighting

Consistent video saliency using local gradient flow optimization and global refinement

IEEE Trans. Image Process.

Scom: spatiotemporal constrained optimization for salient object detection

IEEE Trans. Image Process.

Video salient object detection via fully convolutional networks

IEEE Trans. Image Process.

Swin transformer: hierarchical vision transformer using shifted windows

Pyramid vision transformer: a versatile backbone for dense prediction without convolutions