Two-stage transfer network for weakly supervised action localization

doi:10.1016/j.neucom.2019.02.026

Neurocomputing

Volume 339, 28 April 2019, Pages 202-209

https://doi.org/10.1016/j.neucom.2019.02.026 Get rights and content

Abstract

Action localization is a central yet challenging task for video analysis. Most existing methods rely heavily on the supervised learning where the action label for each frame should be given beforehand. Unfortunately, for many real applications, it is often costly and source-consuming to obtain frame-level action labels for untrimmed videos. In this paper, a novel two-stage paradigm where only the video-level action labels are required, is proposed for weakly supervised action localization. To this end, an Image-to-Video (I2V) network is firstly developed to transfer the knowledge learned from the image domain (e.g. ImageNet) to the specific video domain. Relying on the model learned from I2V network, a Video-to-Proposal (V2P) network is further designed to identify action proposals without the need of temporal annotations. Lastly, a proposal selection layer is devised on the top of the V2P network to choose the maximal proposal response along each class subject and thus obtain a video-level prediction score. By minimizing the difference between the prediction score and video-level label, we fine-tune our V2P network to learn enhanced discriminative ability on classifying proposal inputs. Extensive experimental results show that our method outperforms the state-of-the-art approaches on ActivityNet1.2 and the mAP is improved from 13.7% to 16.2% on THUMOS14. More importantly, even with weak supervision, our networks attain comparable accuracy to those employing strong supervision, thus demonstrating the effectiveness of our method.

Introduction

Action localization is a fundamental task for video understanding. Given a video, action localization should simultaneously answer “what kind of action is in this video?” and “when it starts and ends?” This problem is important because long and untrimmed videos are dominant in real world applications such as surveillance videos.

In recent years, temporal action localization in videos is an active area of research and great progress has been facilitated by abundant methods on learning representation of videos [1], [2], [3], [4], [5], [6], [7] and datasets [8], [9], [10]. Benefited from the development of parallel computing power, deep learning based methods recently have achieved great improvements for video analysis. Many of these work exploit convolution neural networks as feature extractors and train classifiers to categorize sliding windows or segment proposals [11], [12], [13], [14], [15]. These methods heavily rely on the temporal annotations of the untrimmed videos in a supervised setting and proposals should be labeled with action categories, the start time and end time. With these dense temporal annotations, the proposal-level loss is able to be calculated and the backward propagation can be applied to train the networks. The problems, however, are how to collect action annotations and how to guarantee the quality of them. In particular, manually annotating actions frame by frame is not only time-consuming but also subjective to the annotators, making the annotations to be severely biased.

The annotation issue above also exists in the research area of still images. For example, in object detection, it is costly to manually collect objective bounding boxes. As a result, weakly supervised object detection was extensively studied [16], [17]. However, for the video domain, it is more challenging to solve the action localization problem given the video labels only. The reason is that action localization needs not only to learn spatial features but also to extract temporal patterns. Hence, very few attempts have been proposed for weakly supervised action localization.

This paper tackles the weakly supervised action location as the proposal classification problem without frame-level annotations. To this end, we propose a two-stage paradigm to localize actions in untrimmed videos. Particularly, in the first stage, we propose an Image-to-Video (I2V) net for untrimmed video classification. Such network can capture coarse action patterns from the untrimmed videos but neglect the precise discriminative information between action categories. Hence, we further propose a Video-to-Proposal (V2P) network in the second stage. It is natural to feed proposals into the network and obtain prediction outputs. While the loss cannot be calculated directly since the proposal labels are unavailable, a specific proposal selection layer is designed in the V2P network. Through this layer, proposals contributing most to each class will be selected and gradients will be propagated via these proposals. In other words, the video-level loss is transferred to the proposal-level one, and the network parameters can be updated accordingly.

The main contributions of our paper are summarized as follows:

(1)
We devise a principled technique to tackle the problem of weakly supervised action localization. We successfully train a two-stage network to localize action in untrimmed videos without the need of temporal annotations of action instances.
(2)
We provides an efficient proposal selection layer to bridge the proposal inputs and video labels. With the help of this layer, the video-level loss is transferred to the proposal-level one, and thus the network parameters can be fine-tuned efficiently.
(3)
Our proposed network outperforms the state-of-the-art methods on ActivityNet1.2 [9] and shows results that are comparable to the state-of-the-art on THUMOS14 [8]. We significantly improve the current best results from 16.0% to 18.5% on ActivityNet1.2.

Section snippets

Related work

Action localization. Action localization in videos needs not only to recognize action categories, but also to localize the start time and end time of each action instance. Previous work focus on classifying proposals generated by sliding windows with the hand-crafted features [18], [19]. Over the past few years, deep learning based methods have been extensively studied [11], [12], [14], [15], [20], [21], [22], [23]. Shou et al. [12] proposed a multi-stage approach involving three segment-based

Proposed method

Given an untrimmed video set ${V_{i}, y_{i}}_{i = 1}^{N},$ where N is number of videos and y_i ∈ {0, 1}^c is the label vector for the ith video with c being the total number of action classes in the video set. Here, each video may belong to one class or multiple classes depending on how many types of actions are included in the video. And each video may contain multiple actions, but the position of each action, denoted by (t_start, t_end), is unknown, where t_start and t_end denote the start time point and end time

Experiments

We evaluate the performance of the proposed method on two benchmark datasets and compare our method with several state-of-the-art methods.

Conclusions

We address the weakly supervised action localization problem by developing a novel coarse-to-fine framework. Only given the video labels, our method is able to transfer knowledge learned from the video domain to the proposal domain by automatically mining positive samples for the training. We achieved the state-of-the-art performance on two benchmark datasets THUMOS14 and ActivityNet 1.2. One future direction to enhance our network could be considering more advanced feature extraction methods

Qiubin Su received the Master of Science in Mathematics and Applied Mathematics from Sun Yat-sen University, Guangzhou, China, in 2005. He is working in South China University of Technology from 2005. He has studied in School of Computer Science & Engineering, South China University of Technology since 2013.

References (35)

K. Simonyan et al.
Two-stream convolutional networks for action recognition in videos
Advances in Neural Information Processing Systems
(2014)
D. Tran et al.
Learning spatiotemporal features with 3d convolutional networks
Proceedings of the ICCV
(2015)
C. Cao et al.
Action recognition with joints-pooled 3d deep convolutional descriptors.
Proceedings of the IJCAI
(2016)
P.T. Bilinski et al.
Video covariance matrix logarithm for human action recognition in videos.
Proceedings of the IJCAI
(2015)
L. Wang et al.
Temporal segment networks: Towards good practices for deep action recognition
Proceedings of the ECCV
(2016)
J. Liu et al.
Ssnet: scale selection network for online 3d action prediction
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2018)
X. Shu et al.
Hierarchical long short-term concurrent memory for human interaction recognition
(2018)
Y. Jiang et al.
Thumos challenge: action recognition with a large number of classes
(2014)
F. Caba Heilbron et al.
Activitynet: A large-scale video benchmark for human activity understanding
Proceedings of the CVPR
(2015)
C. Gu et al.
AVA: a video dataset of spatio-temporally localized atomic visual actions
(2017)

S. Yeung et al.

End-to-end learning of action detection from frame glimpses in videos

Proceedings of the CVPR

(2016)

Z. Shou et al.

Temporal action localization in untrimmed videos via multi-stage CNNS

Proceedings of the CVPR

(2016)

H. Xu et al.

R-c3d: region convolutional 3d network for temporal activity detection

Proceedings of the ICCV

(2017)

Z. Shou et al.

Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos

Proceedings of the CVPR

(2017)

Y. Zhao et al.

Temporal action detection with structured segment networks

Proceedings of the ICCV

(2017)

H. Bilen et al.

Weakly supervised deep detection networks

Proceedings of the CVPR

(2016)

B. Lai et al.

Saliency guided end-to-end learning forweakly supervised object detection

Proceedings of the IJCAI

(2017)

Cited by (6)

Learning frame-level affinity with video-level labels for weakly supervised temporal action detection
2021, Neurocomputing
Citation Excerpt :
There is a tendency of transferring knowledge from publicly available datasets to facilitate the detection of actions in untrimmed videos with video class labels. Su [27] transfers image knowledge to video domain for releasing the demand of frame-level annotations. Zhang et al. [28,29] transfer knowledge of trimmed videos to assist in localizing actions in untrimmed videos.
Weakly supervised temporal action detection aims at localizing actions with only video-level labels rather than lots of frame-level labels. To this end, previous methods train a classification network for mining discernible action frames as detection results. However, the classification network is known to only concentrate on local discernible frames rather than the entire action instance. Therefore, substantial numbers of indiscernible action frames are not detected and the detection results are incomplete. To alleviate this issue, we propose a novel method to facilitate the detection of indiscernible frames based on learning frame-level affinities. In the proposed method, we design a network (named Affinity Network) for predicting affinities between pairs of adjacent frames. Then, the affinities are used as transition probabilities to propagate local responses to indiscernible frames. As a result, the responses of indiscernible frames can be enhanced and the detection of them can be facilitated. For learning the network, we propose strategies to synthesize frame-pair and video-pair training samples, which are conducive to learn frame-level affinities with only video-level labels. The experimental results on THUMOS14 dataset and ActivityNet1.2 dataset show that the detection performance of our framework outperforms most previous weakly supervised action detection methods, and is even as competitive as some fully supervised action detection methods.
PFWNet: Pretraining neural network via feature jigsaw puzzle for weakly-supervised temporal action localization
2021, Neurocomputing
Citation Excerpt :
Comprehensive experiments demonstrate the efficacy of self-supervised pretraining to action localization task, e.g., on THUMOS14, mAP@IoU 0.5 is improved from 27.0% to 30.7%. The weakly supervised temporal action localization methods learn from video-level classification labels and aim to discover action instances in the untrimmed videos [7,28,8,9,29]. Wang et al. [7] propose UntrimmedNet that consists of a classification module and a selection module for classifying the actions and detecting relevant temporal segments, respectively.
Weakly supervised temporal action localization is a challenging yet interesting task. Existing methods usually apply a few temporal convolutional layers or linear layers to predict classification scores, where the model capacity is limited. Inspired by counterpart researches, increasing model capacity is the potential to improve the localization performance. However, under the weakly supervised paradigm, the video-level classification label is insufficient to learn large-capacity models. The essential reason lies in that most of the inputs to action localization networks are high-level features extracted by video recognition models. In lack of off-the-shelf initialization weights, the action localization networks have to train from scratch and can only explore low-capacity models. In this work, we are inspired by the self-supervised learning paradigm and propose to learn high-quality representative models via solving the feature jigsaw puzzle task. The proposed self-supervised pretraining process can explore networks with large kernel size and deeper layers, which can provide valuable initialization to action localization networks. In the implementation, we first discover potential action scopes via calculating motion intensity. Then, we cut features into snippets and permute them into out-of-order status. We randomly discard frames for boundaries between two snippets to guide the network learning high-level representations and prevent information leakage. Moreover, because the potential permutation number factorially rises with the increase of snippet number, we select a fixed number of permutation operations via the maximum hamming distance criterion, which eases the learning process. Comprehensive experiments on two benchmarks demonstrate the efficiency of pretraining to weakly supervised action localization task, and the proposed method builds new state-of-the-art performance.
Weakly supervised image classification and pointwise localization with graph convolutional networks
2021, Pattern Recognition
Citation Excerpt :
Weakly supervised learning (WSL) refers to methods that use only incomplete annotations to train predictive models. Over the last few years, WSL of convolutional neural networks (CNNs) has emerged as a new practical learning framework for various computer vision applications; for example, multi-label image recognition [1,2], image segmentation [3], object localization [4,5], and object detection [6,7]. The performance of fully supervised learning in these applications currently exceeds that of WSL; however, annotation in fully supervised learning [41,42] is labor-intensive and time-consuming.
In computer vision, the research community has been looking to how to benefit from weakly supervised learning that utilizes easily obtained image-level labels to train neural network models. The existing deep convolutional neural networks for weakly supervised learning, however, generally do not fully exploit the label dependencies in an image. To make full use of this information, in this paper, we propose a new framework for weakly supervised learning of deep convolutional neural networks, introducing graph convolutional networks to capture the semantic label co-occurrence in an image. Moreover, we propose a novel initialization method for label embedding in graph convolutional networks, which enables a smoother optimization for interrelationships learning. Extensive experiments and comparisons on four public benchmark datasets (PASCAL VOC 2007, PASCAL VOC 2012, Microsoft COCO, and NUS-WIDE) show the superior performance of our approach in both image classification and weakly supervised pointwise object localization. These results lead us to conclude that the label dependencies in the input image can provide valuable evidence for learning strongly localized features.
Superframe-Based Temporal Proposals for Weakly Supervised Temporal Action Detection
2023, IEEE Transactions on Multimedia
Weakly Supervised Temporal Action Detection With Temporal Dependency Learning
2022, IEEE Transactions on Circuits and Systems for Video Technology
I-ME: iterative model evolution for learning from weakly labeled images and videos
2020, Machine Vision and Applications

View full text

Two-stage transfer network for weakly supervised action localization

Abstract

Introduction

Section snippets

Related work

Proposed method

Experiments

Conclusions

Two-stream convolutional networks for action recognition in videos

Advances in Neural Information Processing Systems

Learning spatiotemporal features with 3d convolutional networks

Proceedings of the ICCV

Action recognition with joints-pooled 3d deep convolutional descriptors.

Proceedings of the IJCAI

Video covariance matrix logarithm for human action recognition in videos.

Proceedings of the IJCAI

Temporal segment networks: Towards good practices for deep action recognition

Proceedings of the ECCV

Ssnet: scale selection network for online 3d action prediction

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Hierarchical long short-term concurrent memory for human interaction recognition

Thumos challenge: action recognition with a large number of classes

Activitynet: A large-scale video benchmark for human activity understanding

Proceedings of the CVPR

AVA: a video dataset of spatio-temporally localized atomic visual actions

End-to-end learning of action detection from frame glimpses in videos

Proceedings of the CVPR

Temporal action localization in untrimmed videos via multi-stage CNNS

Proceedings of the CVPR

R-c3d: region convolutional 3d network for temporal activity detection

Proceedings of the ICCV

Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos

Proceedings of the CVPR

Temporal action detection with structured segment networks

Proceedings of the ICCV

Weakly supervised deep detection networks

Proceedings of the CVPR

Saliency guided end-to-end learning forweakly supervised object detection

Proceedings of the IJCAI