Elsevier

Neurocomputing

Volume 417, 5 December 2020, Pages 202-211
Neurocomputing

Triple attention network for video segmentation

https://doi.org/10.1016/j.neucom.2020.07.078Get rights and content

Abstract

Video segmentation automatically segments a target object throughout a video and has recently achieved good progress due to the development of deep convolutional neural networks (DCNNs). However, how to simultaneously capture long-range dependencies in multiple spaces remains an important issue in video segmentation. In this paper, we propose a novel triple attention network (TriANet) that simultaneously exploits temporal, spatial, and channel context knowledge by using the self-attention mechanism to enhance the discriminant ability of feature representations. We verify our method on the Shining3D dental, DAVIS16, and DAVIS17 datasets, and the results show our method to be competitive when compared with other state-of-the-art video segmentation methods.

Introduction

Video segmentation is a challenging and fundamental problem that usually aims to separate the foreground and the background pixels in all frames of a given video. It has been an active area of research in computer vision over the past years, and potential applications includes video editing [1], media diagnosis [2], and autonomous driving [3].

Recently, due to developments in deep learning, image segmentation based on multiscale analysis [4] and synthesizing-based data augmentation [5] has been used to provide acceptable outputs. Contexts in spatial, temporal, and channel domains are important factors to enhance the effectiveness of existing approaches. Examples of relationships in the domains of the DAVIS16 dataset [6] are illustrated in Fig. 1. The top and middle rows show that there are many highly related regions (represented by dotted boxes with the same colors) through a temporal sequence or in a single image, and these temporal and spatial contexts enhance the robustness of the inference. In the bottom row, the feature maps in different channels are illustrated. We find that the high-value regions (in red dotted boxes) in different channels are related to different parts of the object, for instance, the foot and head of a person, and that the pairwise relations between the different parts provide additional semantic cues to refine the segmentation result. However, how to simultaneously capture long-range dependencies in spatial, temporal, and channel spaces remains an important issue in video segmentation.

To model the relation in a specific domain, nonlocal neural networks [7] learn long-range dependencies in the spatial domain by using the affinity between pixels. We need an approach to flexibly extend this mechanism to different spaces and to design a new method to properly combine context features from multiple spaces to enhance the discrimination capacity in pixelwise classification tasks such as video segmentation.

In this paper, we present a novel framework, called the triple attention network (TriANet), which is illustrated in Fig. 2. The temporal attention map is learned by using the representations from past frames and the current frame and captures temporal dependencies between memory information and current observations. The channel attention map and spatial attention map are obtained by means of the current image, as the spatial and channel dependencies are dynamic and have nothing to do with the historical information. Then, the feature maps in each domain are updated to be context features that contain enough semantic information for video segmentation.

The contributions of this paper are as follows:

  • We present a new triple attention network with a self-attention mechanism to enhance the discriminant ability of feature maps for video segmentation.

  • We simultaneously exploit the temporal, spatial, and channel context knowledge by using relative lightweight networks to improve the segmentation results.

Experimental results on the Shining3D dental, DAVIS16, and DAVIS17 datasets show that our method yields satisfactory results when compared with state-of-the-art video segmentation methods.

Section snippets

Related work

In this section, we briefly review the literature on context exploitation in video segmentation.

Spatial context exploitation. Graph models, such as the spatiotemporal Markov random field (STMRF) [8] and VideoGCRF [9], encode spatial dependencies in a deep learning framework. However, this approach, just like sample relation exploitation [10], [11] is time consuming in the inference stage and is also sensitive to visual appearance changes. Therefore, adaptive affinity fields (AAF) [12] were

Our approach

Because convolution works on a local receptive field, features in different regions corresponding to the same class may have discrepancy. This discrepancy introduces intraclass inconsistency and affects the segmentation accuracy. To address this problem, we explore global contextual knowledge by building relations among features by using the attention mechanism, which learns long-range contextual knowledge in channel, spatial, and temporal dimensions. Our method is shown in Fig. 2. We use a

Experimental results

In this section, we compare the effectiveness of our proposed method to other methods.

Conclusion

The experimental results demonstrates that the attention is an effective mechanism for video segmentation, and can be simultaneously employed in spatial, temporal, and channel domains. Specifically, we propose a method to use self-attention to infer and combine context from different aspects in a video sequence and acquire representative and informative context features. Although our method may arrive at inexact segmentation due in part to factors such as shadows, our method is robust to

CRediT authorship contribution statement

Yan Tian: Methodology, Writing - original draft, Writing - review & editing, Funding acquisition. Yujie Zhang: Software, Data curation, Visualization. Di Zhou: Investigation, Resources, Writing - review & editing, Funding acquisition. Guohua Cheng: Resources, Writing - review & editing, Funding acquisition. Wei-Gang Chen: Validation, Formal analysis, Project administration. Ruili Wang: Conceptualization, Writing - review & editing, Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grant 61972351, Grant 61672460, and Grant 61702453, in part by the Natural Science Foundation of Zhejiang Province under Grant LY19F030005, Grant LY18F020008, Grant LQ17F030001, and Grant LQ20F020008, in part by the Science and Technology Program of Zhejiang Province under Grant 2020C01049, in part by the Opening Foundation of State Key Laboratory of Virtual Reality Technology and System of Beihang

Yan Tian received PhD degree from Beijing University of Posts and Telecommunications, Beijing, China, in 2011. He had a postdoctoral research fellow position in 2012–2015 at Zhejiang University, Hangzhou, China. He is currently an Associate Professor in the School of Computer Science and Information Engineering, Zhejiang Gongshang University, China. His research interests are machine learning and computer vision.

References (51)

  • Y. Tian et al.

    Joint temporal context exploitation and active learning for video segmentation

    PR

    (2019)
  • W. Wang et al.

    Inferring salient objects from human fixations

    TPAMI

    (2019)
  • W. Wang et al.

    A deep network solution for attention and aesthetics aware photo cropping

    TPAMI

    (2018)
  • Y. Tian et al.

    Traffic sign detection using a multi-scale recurrent attention network

    TITS

    (2019)
  • L. Chen et al.

    Encoder-decoder with atrous separable convolution for semantic image segmentation

    ECCV

    (2018)
  • Y. Zhu et al.

    Improving semantic segmentation via video propagation and label relaxation

    CVPR

    (2019)
  • F. Perazzi et al.

    A benchmark dataset and evaluation methodology for video object segmentation

    CVPR

    (2016)
  • X. Wang et al.

    Non-local neural networks

    CVPR

    (2018)
  • L. Bao et al.

    Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf

    CVPR

    (2018)
  • S. Chandra et al.

    Deep spatio-temporal random fields for efficient video segmentation

    CVPR

    (2018)
  • X. Dong et al.

    Quadruplet network with one-shot learning for fast visual object tracking

    TIP

    (2019)
  • X. Dong et al.

    Triplet loss in siamese network for object tracking

    ECCV

    (2018)
  • T. Ke et al.

    Adaptive affinity fields for semantic segmentation

    ECCV

    (2018)
  • S. Liu et al.

    Learning affinity via spatial propagation networks

    NeurIPS

    (2017)
  • Y. Zhuang et al.

    Relationnet: Learning deep-aligned representation for semantic image segmentation

    ICPR

    (2018)
  • P. Jiang et al.

    Difnet: Semantic segmentation by diffusion networks

    NeurIPS

    (2018)
  • T. Le et al.

    Semantic instance meets salient object: Study on video semantic salient instance segmentation

    WACV

    (2019)
  • J. Ahn et al.

    Weakly supervised learning of instance segmentation with inter-pixel relations

    CVPR

    (2019)
  • Z. Liang et al.

    Local semantic siamese networks for fast tracking

    TIP

    (2019)
  • Z. Wang et al.

    Learning channel-wise interactions for binary convolutional neural networks

    CVPR

    (2019)
  • F. Wang et al.

    Residual attention network for image classification

    CVPR

    (2017)
  • P. Zhang et al.

    Eleatt-rnn: Adding attentiveness to neurons in recurrent neural networks

    TPAMI

    (2019)
  • L. Chen et al.

    Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning

    CVPR

    (2017)
  • J. Fu et al.

    Dual attention network for scene segmentation

    CVPR

    (2019)
  • H. Ci et al.

    Video object segmentation by learning location-sensitive embeddings

    ECCV

    (2018)
  • Cited by (33)

    • Detect occluded items in X-ray baggage inspection

      2023, Computers and Graphics (Pergamon)
    • Spatial and temporal saliency based four-stream network with multi-task learning for action recognition

      2023, Applied Soft Computing
      Citation Excerpt :

      Convolutional Neural Networks (CNNs) have been successfully used to solve many image processing tasks such as image recognition, image synthesis, object detection and object segmentation [1–9].

    • Moving objects segmentation using generative adversarial modeling

      2022, Neurocomputing
      Citation Excerpt :

      At the end conclusion of this study is presented in Section 6. During recent years many studies have been proposed to address the problem of moving objects segmentation in complex environments [23,26,22,17,27,16]. Classical methods for MOS in complex scenes include RPCA-based techniques that decompose the batch of input data matrix into two parts, low-rank (background) and a sparse component (MOS) [28–36].

    • Annotation-guided encoder-decoder network for bone extraction in ultrasound-assisted orthopedic surgery

      2022, Computers in Biology and Medicine
      Citation Excerpt :

      For the improved fine-tuning U-Net [21] and Spatiotemporal CNN [28], we used the public released code and model. Since there is no public implementation, we re-implemented Filter-layer-guided CNN [20], LPT-Net [22], AGNet [30] and TriANet [29] on the same deep learning library and Ubuntu server as our proposed model. Especially, the tensor-based phase feature descriptor was used to extract local phase image features, which are required for the Filter-layer-guided CNN.

    View all citing articles on Scopus

    Yan Tian received PhD degree from Beijing University of Posts and Telecommunications, Beijing, China, in 2011. He had a postdoctoral research fellow position in 2012–2015 at Zhejiang University, Hangzhou, China. He is currently an Associate Professor in the School of Computer Science and Information Engineering, Zhejiang Gongshang University, China. His research interests are machine learning and computer vision.

    Yujie Zhang is a research assistant in School of Computer Science and Information Engineering, Zhejiang Gongshang University, China. His research interests are machine learning and pattern recognition, and he also works on image and video analysis.

    Di Zhou is the President of the Uniview Research Institute; Zhejiang 151 Key Subsidized Talents. Engaged in the field of intelligent IoT for 18 years. Inventor of more than 350 authorized invention patents and 14 American patents. Led two national projects such as Big Data Mining for Smart City and High Definition Intelligent Camera for Smart City, and won the Chinese Patent Excellence Award and other awards.

    Guohua Cheng is a PhD candidate at Fudan University, Shanghai, China, and he received Master’s degree from Nanyang University of Technology, Singapore. He is currently CEO of Jianpei Technology Co. Ltd, 1000 Talents Plan member of Zhejiang Province, 521 Program member of Hangzhou city, Director of the Artificial Intelligence Committee of China Association for Medical Device Industry. His research interests are machine learning and biomedical engineering, and he also works on medical image artificial intelligence.

    Wei-Gang Chen received the M.S. degree from Zhejiang Sci-Tech University, Hangzhou, China, in 1995, and Ph.D. degree from the Department of Computer Science and Technology, Shanghai Jiaotong University, Shanghai, China, in 2004. Since 2004, he is an Associate Professor with the School of Computer Science and Information Engineering, Zhejiang Gongshang University, Hangzhou, China. His research interests include video and image processing, pedestrian detection and counting, video compression and communication.

    Ruili Wang received the Ph.D. degree in Computer Science from Dublin City University, Dublin, Ireland. He is currently a Professor of Artificial Intelligence. His research interests include speed processing, language processing, image processing, data mining, intelligent systems, and complex systems.

    View full text