Triple attention network for video segmentation
Introduction
Video segmentation is a challenging and fundamental problem that usually aims to separate the foreground and the background pixels in all frames of a given video. It has been an active area of research in computer vision over the past years, and potential applications includes video editing [1], media diagnosis [2], and autonomous driving [3].
Recently, due to developments in deep learning, image segmentation based on multiscale analysis [4] and synthesizing-based data augmentation [5] has been used to provide acceptable outputs. Contexts in spatial, temporal, and channel domains are important factors to enhance the effectiveness of existing approaches. Examples of relationships in the domains of the DAVIS16 dataset [6] are illustrated in Fig. 1. The top and middle rows show that there are many highly related regions (represented by dotted boxes with the same colors) through a temporal sequence or in a single image, and these temporal and spatial contexts enhance the robustness of the inference. In the bottom row, the feature maps in different channels are illustrated. We find that the high-value regions (in red dotted boxes) in different channels are related to different parts of the object, for instance, the foot and head of a person, and that the pairwise relations between the different parts provide additional semantic cues to refine the segmentation result. However, how to simultaneously capture long-range dependencies in spatial, temporal, and channel spaces remains an important issue in video segmentation.
To model the relation in a specific domain, nonlocal neural networks [7] learn long-range dependencies in the spatial domain by using the affinity between pixels. We need an approach to flexibly extend this mechanism to different spaces and to design a new method to properly combine context features from multiple spaces to enhance the discrimination capacity in pixelwise classification tasks such as video segmentation.
In this paper, we present a novel framework, called the triple attention network (TriANet), which is illustrated in Fig. 2. The temporal attention map is learned by using the representations from past frames and the current frame and captures temporal dependencies between memory information and current observations. The channel attention map and spatial attention map are obtained by means of the current image, as the spatial and channel dependencies are dynamic and have nothing to do with the historical information. Then, the feature maps in each domain are updated to be context features that contain enough semantic information for video segmentation.
The contributions of this paper are as follows:
- •
We present a new triple attention network with a self-attention mechanism to enhance the discriminant ability of feature maps for video segmentation.
- •
We simultaneously exploit the temporal, spatial, and channel context knowledge by using relative lightweight networks to improve the segmentation results.
Experimental results on the Shining3D dental, DAVIS16, and DAVIS17 datasets show that our method yields satisfactory results when compared with state-of-the-art video segmentation methods.
Section snippets
Related work
In this section, we briefly review the literature on context exploitation in video segmentation.
Spatial context exploitation. Graph models, such as the spatiotemporal Markov random field (STMRF) [8] and VideoGCRF [9], encode spatial dependencies in a deep learning framework. However, this approach, just like sample relation exploitation [10], [11] is time consuming in the inference stage and is also sensitive to visual appearance changes. Therefore, adaptive affinity fields (AAF) [12] were
Our approach
Because convolution works on a local receptive field, features in different regions corresponding to the same class may have discrepancy. This discrepancy introduces intraclass inconsistency and affects the segmentation accuracy. To address this problem, we explore global contextual knowledge by building relations among features by using the attention mechanism, which learns long-range contextual knowledge in channel, spatial, and temporal dimensions. Our method is shown in Fig. 2. We use a
Experimental results
In this section, we compare the effectiveness of our proposed method to other methods.
Conclusion
The experimental results demonstrates that the attention is an effective mechanism for video segmentation, and can be simultaneously employed in spatial, temporal, and channel domains. Specifically, we propose a method to use self-attention to infer and combine context from different aspects in a video sequence and acquire representative and informative context features. Although our method may arrive at inexact segmentation due in part to factors such as shadows, our method is robust to
CRediT authorship contribution statement
Yan Tian: Methodology, Writing - original draft, Writing - review & editing, Funding acquisition. Yujie Zhang: Software, Data curation, Visualization. Di Zhou: Investigation, Resources, Writing - review & editing, Funding acquisition. Guohua Cheng: Resources, Writing - review & editing, Funding acquisition. Wei-Gang Chen: Validation, Formal analysis, Project administration. Ruili Wang: Conceptualization, Writing - review & editing, Supervision, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This work was supported in part by the National Natural Science Foundation of China under Grant 61972351, Grant 61672460, and Grant 61702453, in part by the Natural Science Foundation of Zhejiang Province under Grant LY19F030005, Grant LY18F020008, Grant LQ17F030001, and Grant LQ20F020008, in part by the Science and Technology Program of Zhejiang Province under Grant 2020C01049, in part by the Opening Foundation of State Key Laboratory of Virtual Reality Technology and System of Beihang
Yan Tian received PhD degree from Beijing University of Posts and Telecommunications, Beijing, China, in 2011. He had a postdoctoral research fellow position in 2012–2015 at Zhejiang University, Hangzhou, China. He is currently an Associate Professor in the School of Computer Science and Information Engineering, Zhejiang Gongshang University, China. His research interests are machine learning and computer vision.
References (51)
- et al.
Joint temporal context exploitation and active learning for video segmentation
PR
(2019) - et al.
Inferring salient objects from human fixations
TPAMI
(2019) - et al.
A deep network solution for attention and aesthetics aware photo cropping
TPAMI
(2018) - et al.
Traffic sign detection using a multi-scale recurrent attention network
TITS
(2019) - et al.
Encoder-decoder with atrous separable convolution for semantic image segmentation
ECCV
(2018) - et al.
Improving semantic segmentation via video propagation and label relaxation
CVPR
(2019) - et al.
A benchmark dataset and evaluation methodology for video object segmentation
CVPR
(2016) - et al.
Non-local neural networks
CVPR
(2018) - et al.
Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf
CVPR
(2018) - et al.
Deep spatio-temporal random fields for efficient video segmentation
CVPR
(2018)
Quadruplet network with one-shot learning for fast visual object tracking
TIP
Triplet loss in siamese network for object tracking
ECCV
Adaptive affinity fields for semantic segmentation
ECCV
Learning affinity via spatial propagation networks
NeurIPS
Relationnet: Learning deep-aligned representation for semantic image segmentation
ICPR
Difnet: Semantic segmentation by diffusion networks
NeurIPS
Semantic instance meets salient object: Study on video semantic salient instance segmentation
WACV
Weakly supervised learning of instance segmentation with inter-pixel relations
CVPR
Local semantic siamese networks for fast tracking
TIP
Learning channel-wise interactions for binary convolutional neural networks
CVPR
Residual attention network for image classification
CVPR
Eleatt-rnn: Adding attentiveness to neurons in recurrent neural networks
TPAMI
Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning
CVPR
Dual attention network for scene segmentation
CVPR
Video object segmentation by learning location-sensitive embeddings
ECCV
Cited by (33)
Detect occluded items in X-ray baggage inspection
2023, Computers and Graphics (Pergamon)TIVE: A toolbox for identifying video instance segmentation errors
2023, NeurocomputingTriple-attention interaction network for breast tumor classification based on multi-modality images
2023, Pattern RecognitionSpatial and temporal saliency based four-stream network with multi-task learning for action recognition
2023, Applied Soft ComputingCitation Excerpt :Convolutional Neural Networks (CNNs) have been successfully used to solve many image processing tasks such as image recognition, image synthesis, object detection and object segmentation [1–9].
Moving objects segmentation using generative adversarial modeling
2022, NeurocomputingCitation Excerpt :At the end conclusion of this study is presented in Section 6. During recent years many studies have been proposed to address the problem of moving objects segmentation in complex environments [23,26,22,17,27,16]. Classical methods for MOS in complex scenes include RPCA-based techniques that decompose the batch of input data matrix into two parts, low-rank (background) and a sparse component (MOS) [28–36].
Annotation-guided encoder-decoder network for bone extraction in ultrasound-assisted orthopedic surgery
2022, Computers in Biology and MedicineCitation Excerpt :For the improved fine-tuning U-Net [21] and Spatiotemporal CNN [28], we used the public released code and model. Since there is no public implementation, we re-implemented Filter-layer-guided CNN [20], LPT-Net [22], AGNet [30] and TriANet [29] on the same deep learning library and Ubuntu server as our proposed model. Especially, the tensor-based phase feature descriptor was used to extract local phase image features, which are required for the Filter-layer-guided CNN.
Yan Tian received PhD degree from Beijing University of Posts and Telecommunications, Beijing, China, in 2011. He had a postdoctoral research fellow position in 2012–2015 at Zhejiang University, Hangzhou, China. He is currently an Associate Professor in the School of Computer Science and Information Engineering, Zhejiang Gongshang University, China. His research interests are machine learning and computer vision.
Yujie Zhang is a research assistant in School of Computer Science and Information Engineering, Zhejiang Gongshang University, China. His research interests are machine learning and pattern recognition, and he also works on image and video analysis.
Di Zhou is the President of the Uniview Research Institute; Zhejiang 151 Key Subsidized Talents. Engaged in the field of intelligent IoT for 18 years. Inventor of more than 350 authorized invention patents and 14 American patents. Led two national projects such as Big Data Mining for Smart City and High Definition Intelligent Camera for Smart City, and won the Chinese Patent Excellence Award and other awards.
Guohua Cheng is a PhD candidate at Fudan University, Shanghai, China, and he received Master’s degree from Nanyang University of Technology, Singapore. He is currently CEO of Jianpei Technology Co. Ltd, 1000 Talents Plan member of Zhejiang Province, 521 Program member of Hangzhou city, Director of the Artificial Intelligence Committee of China Association for Medical Device Industry. His research interests are machine learning and biomedical engineering, and he also works on medical image artificial intelligence.
Wei-Gang Chen received the M.S. degree from Zhejiang Sci-Tech University, Hangzhou, China, in 1995, and Ph.D. degree from the Department of Computer Science and Technology, Shanghai Jiaotong University, Shanghai, China, in 2004. Since 2004, he is an Associate Professor with the School of Computer Science and Information Engineering, Zhejiang Gongshang University, Hangzhou, China. His research interests include video and image processing, pedestrian detection and counting, video compression and communication.
Ruili Wang received the Ph.D. degree in Computer Science from Dublin City University, Dublin, Ireland. He is currently a Professor of Artificial Intelligence. His research interests include speed processing, language processing, image processing, data mining, intelligent systems, and complex systems.