Elsevier

Neurocomputing

Volume 339, 28 April 2019, Pages 202-209
Neurocomputing

Two-stage transfer network for weakly supervised action localization

https://doi.org/10.1016/j.neucom.2019.02.026Get rights and content

Abstract

Action localization is a central yet challenging task for video analysis. Most existing methods rely heavily on the supervised learning where the action label for each frame should be given beforehand. Unfortunately, for many real applications, it is often costly and source-consuming to obtain frame-level action labels for untrimmed videos. In this paper, a novel two-stage paradigm where only the video-level action labels are required, is proposed for weakly supervised action localization. To this end, an Image-to-Video (I2V) network is firstly developed to transfer the knowledge learned from the image domain (e.g. ImageNet) to the specific video domain. Relying on the model learned from I2V network, a Video-to-Proposal (V2P) network is further designed to identify action proposals without the need of temporal annotations. Lastly, a proposal selection layer is devised on the top of the V2P network to choose the maximal proposal response along each class subject and thus obtain a video-level prediction score. By minimizing the difference between the prediction score and video-level label, we fine-tune our V2P network to learn enhanced discriminative ability on classifying proposal inputs. Extensive experimental results show that our method outperforms the state-of-the-art approaches on ActivityNet1.2 and the mAP is improved from 13.7% to 16.2% on THUMOS14. More importantly, even with weak supervision, our networks attain comparable accuracy to those employing strong supervision, thus demonstrating the effectiveness of our method.

Introduction

Action localization is a fundamental task for video understanding. Given a video, action localization should simultaneously answer “what kind of action is in this video?” and “when it starts and ends?” This problem is important because long and untrimmed videos are dominant in real world applications such as surveillance videos.

In recent years, temporal action localization in videos is an active area of research and great progress has been facilitated by abundant methods on learning representation of videos [1], [2], [3], [4], [5], [6], [7] and datasets [8], [9], [10]. Benefited from the development of parallel computing power, deep learning based methods recently have achieved great improvements for video analysis. Many of these work exploit convolution neural networks as feature extractors and train classifiers to categorize sliding windows or segment proposals [11], [12], [13], [14], [15]. These methods heavily rely on the temporal annotations of the untrimmed videos in a supervised setting and proposals should be labeled with action categories, the start time and end time. With these dense temporal annotations, the proposal-level loss is able to be calculated and the backward propagation can be applied to train the networks. The problems, however, are how to collect action annotations and how to guarantee the quality of them. In particular, manually annotating actions frame by frame is not only time-consuming but also subjective to the annotators, making the annotations to be severely biased.

The annotation issue above also exists in the research area of still images. For example, in object detection, it is costly to manually collect objective bounding boxes. As a result, weakly supervised object detection was extensively studied [16], [17]. However, for the video domain, it is more challenging to solve the action localization problem given the video labels only. The reason is that action localization needs not only to learn spatial features but also to extract temporal patterns. Hence, very few attempts have been proposed for weakly supervised action localization.

This paper tackles the weakly supervised action location as the proposal classification problem without frame-level annotations. To this end, we propose a two-stage paradigm to localize actions in untrimmed videos. Particularly, in the first stage, we propose an Image-to-Video (I2V) net for untrimmed video classification. Such network can capture coarse action patterns from the untrimmed videos but neglect the precise discriminative information between action categories. Hence, we further propose a Video-to-Proposal (V2P) network in the second stage. It is natural to feed proposals into the network and obtain prediction outputs. While the loss cannot be calculated directly since the proposal labels are unavailable, a specific proposal selection layer is designed in the V2P network. Through this layer, proposals contributing most to each class will be selected and gradients will be propagated via these proposals. In other words, the video-level loss is transferred to the proposal-level one, and the network parameters can be updated accordingly.

The main contributions of our paper are summarized as follows:

  • (1)

    We devise a principled technique to tackle the problem of weakly supervised action localization. We successfully train a two-stage network to localize action in untrimmed videos without the need of temporal annotations of action instances.

  • (2)

    We provides an efficient proposal selection layer to bridge the proposal inputs and video labels. With the help of this layer, the video-level loss is transferred to the proposal-level one, and thus the network parameters can be fine-tuned efficiently.

  • (3)

    Our proposed network outperforms the state-of-the-art methods on ActivityNet1.2 [9] and shows results that are comparable to the state-of-the-art on THUMOS14 [8]. We significantly improve the current best results from 16.0% to 18.5% on ActivityNet1.2.

Section snippets

Related work

Action localization. Action localization in videos needs not only to recognize action categories, but also to localize the start time and end time of each action instance. Previous work focus on classifying proposals generated by sliding windows with the hand-crafted features [18], [19]. Over the past few years, deep learning based methods have been extensively studied [11], [12], [14], [15], [20], [21], [22], [23]. Shou et al. [12] proposed a multi-stage approach involving three segment-based

Proposed method

Given an untrimmed video set {Vi,yi}i=1N, where N is number of videos and yi ∈ {0, 1}c is the label vector for the ith video with c being the total number of action classes in the video set. Here, each video may belong to one class or multiple classes depending on how many types of actions are included in the video. And each video may contain multiple actions, but the position of each action, denoted by (tstart, tend), is unknown, where tstart and tend denote the start time point and end time

Experiments

We evaluate the performance of the proposed method on two benchmark datasets and compare our method with several state-of-the-art methods.

Conclusions

We address the weakly supervised action localization problem by developing a novel coarse-to-fine framework. Only given the video labels, our method is able to transfer knowledge learned from the video domain to the proposal domain by automatically mining positive samples for the training. We achieved the state-of-the-art performance on two benchmark datasets THUMOS14 and ActivityNet 1.2. One future direction to enhance our network could be considering more advanced feature extraction methods

Qiubin Su received the Master of Science in Mathematics and Applied Mathematics from Sun Yat-sen University, Guangzhou, China, in 2005. He is working in South China University of Technology from 2005. He has studied in School of Computer Science & Engineering, South China University of Technology since 2013.

References (35)

  • K. Simonyan et al.

    Two-stream convolutional networks for action recognition in videos

    Advances in Neural Information Processing Systems

    (2014)
  • D. Tran et al.

    Learning spatiotemporal features with 3d convolutional networks

    Proceedings of the ICCV

    (2015)
  • C. Cao et al.

    Action recognition with joints-pooled 3d deep convolutional descriptors.

    Proceedings of the IJCAI

    (2016)
  • P.T. Bilinski et al.

    Video covariance matrix logarithm for human action recognition in videos.

    Proceedings of the IJCAI

    (2015)
  • L. Wang et al.

    Temporal segment networks: Towards good practices for deep action recognition

    Proceedings of the ECCV

    (2016)
  • J. Liu et al.

    Ssnet: scale selection network for online 3d action prediction

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2018)
  • X. Shu et al.

    Hierarchical long short-term concurrent memory for human interaction recognition

    (2018)
  • Y. Jiang et al.

    Thumos challenge: action recognition with a large number of classes

    (2014)
  • F. Caba Heilbron et al.

    Activitynet: A large-scale video benchmark for human activity understanding

    Proceedings of the CVPR

    (2015)
  • C. Gu et al.

    AVA: a video dataset of spatio-temporally localized atomic visual actions

    (2017)
  • S. Yeung et al.

    End-to-end learning of action detection from frame glimpses in videos

    Proceedings of the CVPR

    (2016)
  • Z. Shou et al.

    Temporal action localization in untrimmed videos via multi-stage CNNS

    Proceedings of the CVPR

    (2016)
  • H. Xu et al.

    R-c3d: region convolutional 3d network for temporal activity detection

    Proceedings of the ICCV

    (2017)
  • Z. Shou et al.

    Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos

    Proceedings of the CVPR

    (2017)
  • Y. Zhao et al.

    Temporal action detection with structured segment networks

    Proceedings of the ICCV

    (2017)
  • H. Bilen et al.

    Weakly supervised deep detection networks

    Proceedings of the CVPR

    (2016)
  • B. Lai et al.

    Saliency guided end-to-end learning forweakly supervised object detection

    Proceedings of the IJCAI

    (2017)
  • Cited by (6)

    • Learning frame-level affinity with video-level labels for weakly supervised temporal action detection

      2021, Neurocomputing
      Citation Excerpt :

      There is a tendency of transferring knowledge from publicly available datasets to facilitate the detection of actions in untrimmed videos with video class labels. Su [27] transfers image knowledge to video domain for releasing the demand of frame-level annotations. Zhang et al. [28,29] transfer knowledge of trimmed videos to assist in localizing actions in untrimmed videos.

    • PFWNet: Pretraining neural network via feature jigsaw puzzle for weakly-supervised temporal action localization

      2021, Neurocomputing
      Citation Excerpt :

      Comprehensive experiments demonstrate the efficacy of self-supervised pretraining to action localization task, e.g., on THUMOS14, mAP@IoU 0.5 is improved from 27.0% to 30.7%. The weakly supervised temporal action localization methods learn from video-level classification labels and aim to discover action instances in the untrimmed videos [7,28,8,9,29]. Wang et al. [7] propose UntrimmedNet that consists of a classification module and a selection module for classifying the actions and detecting relevant temporal segments, respectively.

    • Weakly supervised image classification and pointwise localization with graph convolutional networks

      2021, Pattern Recognition
      Citation Excerpt :

      Weakly supervised learning (WSL) refers to methods that use only incomplete annotations to train predictive models. Over the last few years, WSL of convolutional neural networks (CNNs) has emerged as a new practical learning framework for various computer vision applications; for example, multi-label image recognition [1,2], image segmentation [3], object localization [4,5], and object detection [6,7]. The performance of fully supervised learning in these applications currently exceeds that of WSL; however, annotation in fully supervised learning [41,42] is labor-intensive and time-consuming.

    • Weakly Supervised Temporal Action Detection With Temporal Dependency Learning

      2022, IEEE Transactions on Circuits and Systems for Video Technology

    Qiubin Su received the Master of Science in Mathematics and Applied Mathematics from Sun Yat-sen University, Guangzhou, China, in 2005. He is working in South China University of Technology from 2005. He has studied in School of Computer Science & Engineering, South China University of Technology since 2013.

    View full text