Two-stage transfer network for weakly supervised action localization
Introduction
Action localization is a fundamental task for video understanding. Given a video, action localization should simultaneously answer “what kind of action is in this video?” and “when it starts and ends?” This problem is important because long and untrimmed videos are dominant in real world applications such as surveillance videos.
In recent years, temporal action localization in videos is an active area of research and great progress has been facilitated by abundant methods on learning representation of videos [1], [2], [3], [4], [5], [6], [7] and datasets [8], [9], [10]. Benefited from the development of parallel computing power, deep learning based methods recently have achieved great improvements for video analysis. Many of these work exploit convolution neural networks as feature extractors and train classifiers to categorize sliding windows or segment proposals [11], [12], [13], [14], [15]. These methods heavily rely on the temporal annotations of the untrimmed videos in a supervised setting and proposals should be labeled with action categories, the start time and end time. With these dense temporal annotations, the proposal-level loss is able to be calculated and the backward propagation can be applied to train the networks. The problems, however, are how to collect action annotations and how to guarantee the quality of them. In particular, manually annotating actions frame by frame is not only time-consuming but also subjective to the annotators, making the annotations to be severely biased.
The annotation issue above also exists in the research area of still images. For example, in object detection, it is costly to manually collect objective bounding boxes. As a result, weakly supervised object detection was extensively studied [16], [17]. However, for the video domain, it is more challenging to solve the action localization problem given the video labels only. The reason is that action localization needs not only to learn spatial features but also to extract temporal patterns. Hence, very few attempts have been proposed for weakly supervised action localization.
This paper tackles the weakly supervised action location as the proposal classification problem without frame-level annotations. To this end, we propose a two-stage paradigm to localize actions in untrimmed videos. Particularly, in the first stage, we propose an Image-to-Video (I2V) net for untrimmed video classification. Such network can capture coarse action patterns from the untrimmed videos but neglect the precise discriminative information between action categories. Hence, we further propose a Video-to-Proposal (V2P) network in the second stage. It is natural to feed proposals into the network and obtain prediction outputs. While the loss cannot be calculated directly since the proposal labels are unavailable, a specific proposal selection layer is designed in the V2P network. Through this layer, proposals contributing most to each class will be selected and gradients will be propagated via these proposals. In other words, the video-level loss is transferred to the proposal-level one, and the network parameters can be updated accordingly.
The main contributions of our paper are summarized as follows:
- (1)
We devise a principled technique to tackle the problem of weakly supervised action localization. We successfully train a two-stage network to localize action in untrimmed videos without the need of temporal annotations of action instances.
- (2)
We provides an efficient proposal selection layer to bridge the proposal inputs and video labels. With the help of this layer, the video-level loss is transferred to the proposal-level one, and thus the network parameters can be fine-tuned efficiently.
- (3)
Our proposed network outperforms the state-of-the-art methods on ActivityNet1.2 [9] and shows results that are comparable to the state-of-the-art on THUMOS14 [8]. We significantly improve the current best results from 16.0% to 18.5% on ActivityNet1.2.
Section snippets
Related work
Action localization. Action localization in videos needs not only to recognize action categories, but also to localize the start time and end time of each action instance. Previous work focus on classifying proposals generated by sliding windows with the hand-crafted features [18], [19]. Over the past few years, deep learning based methods have been extensively studied [11], [12], [14], [15], [20], [21], [22], [23]. Shou et al. [12] proposed a multi-stage approach involving three segment-based
Proposed method
Given an untrimmed video set where N is number of videos and yi ∈ {0, 1}c is the label vector for the ith video with c being the total number of action classes in the video set. Here, each video may belong to one class or multiple classes depending on how many types of actions are included in the video. And each video may contain multiple actions, but the position of each action, denoted by (tstart, tend), is unknown, where tstart and tend denote the start time point and end time
Experiments
We evaluate the performance of the proposed method on two benchmark datasets and compare our method with several state-of-the-art methods.
Conclusions
We address the weakly supervised action localization problem by developing a novel coarse-to-fine framework. Only given the video labels, our method is able to transfer knowledge learned from the video domain to the proposal domain by automatically mining positive samples for the training. We achieved the state-of-the-art performance on two benchmark datasets THUMOS14 and ActivityNet 1.2. One future direction to enhance our network could be considering more advanced feature extraction methods
Qiubin Su received the Master of Science in Mathematics and Applied Mathematics from Sun Yat-sen University, Guangzhou, China, in 2005. He is working in South China University of Technology from 2005. He has studied in School of Computer Science & Engineering, South China University of Technology since 2013.
References (35)
- et al.
Two-stream convolutional networks for action recognition in videos
Advances in Neural Information Processing Systems
(2014) - et al.
Learning spatiotemporal features with 3d convolutional networks
Proceedings of the ICCV
(2015) - et al.
Action recognition with joints-pooled 3d deep convolutional descriptors.
Proceedings of the IJCAI
(2016) - et al.
Video covariance matrix logarithm for human action recognition in videos.
Proceedings of the IJCAI
(2015) - et al.
Temporal segment networks: Towards good practices for deep action recognition
Proceedings of the ECCV
(2016) - et al.
Ssnet: scale selection network for online 3d action prediction
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2018) - et al.
Hierarchical long short-term concurrent memory for human interaction recognition
(2018) - et al.
Thumos challenge: action recognition with a large number of classes
(2014) - et al.
Activitynet: A large-scale video benchmark for human activity understanding
Proceedings of the CVPR
(2015) - et al.
AVA: a video dataset of spatio-temporally localized atomic visual actions
(2017)
End-to-end learning of action detection from frame glimpses in videos
Proceedings of the CVPR
Temporal action localization in untrimmed videos via multi-stage CNNS
Proceedings of the CVPR
R-c3d: region convolutional 3d network for temporal activity detection
Proceedings of the ICCV
Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos
Proceedings of the CVPR
Temporal action detection with structured segment networks
Proceedings of the ICCV
Weakly supervised deep detection networks
Proceedings of the CVPR
Saliency guided end-to-end learning forweakly supervised object detection
Proceedings of the IJCAI
Cited by (6)
Learning frame-level affinity with video-level labels for weakly supervised temporal action detection
2021, NeurocomputingCitation Excerpt :There is a tendency of transferring knowledge from publicly available datasets to facilitate the detection of actions in untrimmed videos with video class labels. Su [27] transfers image knowledge to video domain for releasing the demand of frame-level annotations. Zhang et al. [28,29] transfer knowledge of trimmed videos to assist in localizing actions in untrimmed videos.
PFWNet: Pretraining neural network via feature jigsaw puzzle for weakly-supervised temporal action localization
2021, NeurocomputingCitation Excerpt :Comprehensive experiments demonstrate the efficacy of self-supervised pretraining to action localization task, e.g., on THUMOS14, mAP@IoU 0.5 is improved from 27.0% to 30.7%. The weakly supervised temporal action localization methods learn from video-level classification labels and aim to discover action instances in the untrimmed videos [7,28,8,9,29]. Wang et al. [7] propose UntrimmedNet that consists of a classification module and a selection module for classifying the actions and detecting relevant temporal segments, respectively.
Weakly supervised image classification and pointwise localization with graph convolutional networks
2021, Pattern RecognitionCitation Excerpt :Weakly supervised learning (WSL) refers to methods that use only incomplete annotations to train predictive models. Over the last few years, WSL of convolutional neural networks (CNNs) has emerged as a new practical learning framework for various computer vision applications; for example, multi-label image recognition [1,2], image segmentation [3], object localization [4,5], and object detection [6,7]. The performance of fully supervised learning in these applications currently exceeds that of WSL; however, annotation in fully supervised learning [41,42] is labor-intensive and time-consuming.
Superframe-Based Temporal Proposals for Weakly Supervised Temporal Action Detection
2023, IEEE Transactions on MultimediaWeakly Supervised Temporal Action Detection With Temporal Dependency Learning
2022, IEEE Transactions on Circuits and Systems for Video TechnologyI-ME: iterative model evolution for learning from weakly labeled images and videos
2020, Machine Vision and Applications
Qiubin Su received the Master of Science in Mathematics and Applied Mathematics from Sun Yat-sen University, Guangzhou, China, in 2005. He is working in South China University of Technology from 2005. He has studied in School of Computer Science & Engineering, South China University of Technology since 2013.