Elsevier

Pattern Recognition Letters

Volume 160, August 2022, Pages 122-127
Pattern Recognition Letters

Transformer-based Cross Reference Network for video salient object detection

https://doi.org/10.1016/j.patrec.2022.06.006Get rights and content

Highlights

Abstract

Video salient object detection is a fundamental computer vision task aimed at highlighting the most conspicuous objects in a video sequence. There are two key challenges presented in video salient object detection: (1) how to extract effective feature representations from appearance and motion cues, and (2) how to combine both of them into robust saliency representation. To handle these challenges, in this paper, we propose a novel Transformer-based Cross Reference Network (TCRN), which fully exploits long-range context dependencies in both feature representation extraction and cross-modal (i.e., appearance and motion) integration. In contrast to existing CNN-based methods, our approach formulates video salient object detection as a sequence-to-sequence prediction task. In the proposed approach, the deep feature extraction is achieved by a pure vision transformer with multi-resolution token representations. Specifically, we design a Gated Cross Reference (GCR) module to effectively integrate appearance and motion into saliency representation. The GCR first propagates global context information between different modalities, and then perform cross-modal fusion by a gate mechanism. Extensive evaluations on five widely-used benchmarks show that the proposed Transformer-based method performs favorably against the existing state-of-the-art methods

Introduction

Video salient object detection (VSOD) aims at localizing and segmenting the most conspicuous objects in a video sequence. It appears to be a fundamental processing tool in a variety of vision tasks, such as video object segmentation [1], video compression [2], video summarization [3] and medical analysis [4]. VSOD is related to human eye fixation prediction [5] that targets at finding the focus of human eyes when free-viewing scenes. However, VSOD concentrates more on highlighting the whole salient object regions with clear boundaries. VSOD presents more challenges compared with still-image salient object detection due to the complexity of temporal motion.

There are mainly two issues involved in VSOD, i.e., (1) how to extract effective feature representations from appearance and motion cues, and (2) how to combine both of them into robust saliency representation. For the first issue, existing state-of-the-art methods leverage on Convolutional Neural Networks (CNNs) to extract multi-scale feature representations optimized by saliency supervision. For the second issue, widely-used strategies include: two-stream spatial-temporal fusion [6,7], step-to-step propagation (e.g., recurrent neural network (RNN)) [8,9], and attention-based aggregation (e.g., self-attention) [10]. Although these methods have gained promising improvement, they still confront several problems: (1) they all rely on CNNs to extract deep features, therefore, it is limited in modeling global long-range dependencies; (2) appearance and motion features are encoded individually at sequentially distinctive stages or in paralleled separate branches, thus they cannot modulate each other in a full collaborative way.

Focusing on the above-mentioned problems, we seek to leverage on Transformer [11] to model global long-range dependencies for feature extraction and integration in VSOD. Transformer has first been proposed for modeling long-range relations among word sequences for machine translation. Recently it has been successfully applied to image recognition (e.g., ViT [12]), showing great potential in solving vision tasks. With this inspiration, we rethink and investigate video salient object detection from a sequence-to-sequence prediction perspective.

In this paper, we propose a novel Transformer-based Cross Reference Network (TCRN) for VSOD. The proposed method fully exploits long-range context dependencies in feature representation extraction, as well as cross-modal integration. Unlike existing convolution-based methods, feature representation extraction in TCRN is achieved by a multi-scale token based Transformer, i.e., T2T-ViT [13]. In addition, we design a Gated Cross Reference (GCR) module to facilitate fully collaborative learning between appearance and motion features. The GCR further adopts a gate mechanism to integrate appearance and motion complementarily into a holistic saliency representation. Extensive evaluations on five widely-used benchmarks demonstrate that the proposed Transformer-based approach performs favorably against the existing state-of-the-art methods.

Section snippets

Related works

Video salient object detection. Compared with still-image salient object detection [14], [15], [16], video salient object detection (VSOD) presents more challenges due to the complexity of temporal feature extraction. Early works concentrate on designing effective hand-crafted saliency feature extractor and spatio-temporal fusion methods [17], [18], [19]. Recently, deep learning based methods have shown promising performance in VSOD. Wang et al. [20] employs Fully Convolutional Networks (FCNs)

Overview

An overview of the proposed architecture is shown in Fig. 1. Given a video clip containing T consecutive frame {It}t=1T, we first utilize the motion flow generator FlowNet-2.0 [26], to generate T1 optical flow maps {Mt}t=1T1. Mt is computed between two adjacent frames It and It+1. The proposed architecture takes as input a RGB image It and its paired optical flow map Mt, producing the final saliency map. First, Itand Mt are fed into two independent Transformer-based encoder, which adopts a

Experiment setup

Datasets. We evaluate our proposed method on five widely used Video Salient Object Detection (VSOD) datasets, including DAVIS [28], FBMS [29], VOS [30], SegTrackv2 [31], ViSal [18]. DAVIS and SegTrackV2 are densely annotated datasets, which contains 50 videos and 14 videos, respectively. FBMS, ViSal and VOS are sparsely annotated datasets, which contains 59 videos (720 annotated frames), 19 videos (193 annotated frames) and 200 videos (7467 annotated frames), respectively. The evaluation

Conclusion

In this paper we propose a novel Transformer-based Cross Reference Network (TCRN) for VSOD. The long-range context dependencies are fully exploited in both feature representation extraction and cross-modal integration. Specifically, we design a Gated Cross Reference (GCR) module to allow collaborative feature learning between appearance and motion features, deriving more robust saliency representation. Experiments show that Transformer-based architecture can perform on par with, or even better

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the National Natural Science Foundation of China (No. 62101316).

References (45)

  • D. Mahapatra et al.

    Coherency based spatiotemporal saliency detection for video object segmentation

    IEEE J. Selected Top. Signal Process.

    (2014)
  • L. Itti

    Automatic foveation for video compression using a neurobiological model of visual attention

    IEEE Trans. Image Process.

    (2004)
  • Y.F. Ma et al.

    A user attention model for video summarization

  • X. Wang et al.

    Volumetric attention for 3d medical image segmentation and detection

  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • G. Li et al.

    Flow guided recurrent neural encoder for video salient object detection

  • H. Li et al.

    Motion guided attention for video salient object detection

  • H. Song et al.

    Pyramid dilated deeper convlstm for video salient object detection

  • D.P. Fan et al.

    Shifting more attention to video salient object detection

  • Y. Gu et al.

    Pyramid constrained self-attention network for fast video salient object detection

  • A. Vaswani et al.

    Attention is all you need

  • A. Dosovitskiy et al.

    An image is worth 16x16 words: Transformers for image recognition at scale

    Proceedings of the International Conference on Learning Representation

    (2021)
  • L. Yuan et al.

    Tokens-to-token vit: training vision transformers from scratch on imagenet

  • R. Achanta et al.

    Frequency-tuned salient region detection

  • Q. Hou et al.

    Deeply supervised salient object detection with short connections

  • J.X. Zhao et al.

    Eg- net:edge guidance network for salient object detection

  • Y. Fang et al.

    Video saliency incorporating spatiotemporal cues and uncertainty weighting

  • W. Wang et al.

    Consistent video saliency using local gradient flow optimization and global refinement

    IEEE Trans. Image Process.

    (2015)
  • Y. Chen et al.

    Scom: spatiotemporal constrained optimization for salient object detection

    IEEE Trans. Image Process.

    (2018)
  • W. Wang et al.

    Video salient object detection via fully convolutional networks

    IEEE Trans. Image Process.

    (2018)
  • Z. Liu et al.

    Swin transformer: hierarchical vision transformer using shifted windows

  • W. Wang et al.

    Pyramid vision transformer: a versatile backbone for dense prediction without convolutions

  • Cited by (22)

    • PFGAN: Fast transformers for image synthesis

      2023, Pattern Recognition Letters
    • Thermal images-aware guided early fusion network for cross-illumination RGB-T salient object detection

      2023, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Salient object detection mimics the human visual attention system and is used to detect and segment the most attention-grabbing regions or objects in an image. As a fundamental topic in the field of computer vision, salient object detection methods are widely used in the fields of video salient object detection (Huang et al., 2022b; Shokri et al., 2020; Kompella et al., 2021), object tracking (Liu et al., 2022b; Fiaz et al., 2019; Meinhardt et al., 2022), image segmentation (Cheng et al., 2022a; Strudel et al., 2021; Shivakumar et al., 2020) and other fields (Fu et al., 2022). In the past decade, most research has focused on RGB salient object detection and achieved excellent detection results.

    View all citing articles on Scopus
    View full text