Abstract
This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containing the same action, without knowing their common class label. To address this task, we introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments. The network contains: (i) a mutual enhancement module to simultaneously complement the representation of the few trimmed support videos and the untrimmed query video; (ii) a progressive alignment module that iteratively fuses the support videos into the query branch; and (iii) a pairwise matching module to weigh the importance of different support videos. Evaluation of few-shot common action localization in untrimmed videos containing a single or multiple action instances demonstrates the effectiveness and general applicability of our proposal.
P. Yang and V. T. Hu—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bojanowski, P., et al.: Weakly supervised action labeling in videos under ordering constraints. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, Tinne (eds.) ECCV 2014. LNCS, vol. 8693, pp. 628–643. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_41
Buch, S., Escorcia, V., Shen, C., Ghanem, B., Carlos Niebles, J.: Sst: single-stream temporal action proposals. In: CVPR (2017)
Caba Heilbron, F., Escorcia, V., Ghanem, B., Car-los Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL (2011)
Dai, X., Singh, B., Zhang, G., Davis, L.S., Chen, Y.Q: Temporal context network for activity localization in videos. In: ICCV (2017)
Damen, D.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV (2018)
Dong, X., Zheng, L., Ma, F., Yang, Y., Meng, D.: Few-example object detection with model communication. PAMI 41(7), 1641–1654 (2018)
Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In ICCV (2009)
Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_47
Feng, Y., Ma, L., Liu, W., Zhang, T., Luo, J.: Video re-localization. In: ECCV (2018)
Gao, J., Chen, K., Nevatia, R.: Ctap: Complementary temporal action proposal generation. In: ECCV (2018)
Gao, J., Yang, Z., Sun, C., Chen, K., Nevatia, R.: Turn tap: temporal unit regression network for temporal action proposals. In: ICCV (2017)
Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, Tinne (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_33
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)
Hu, T., Mettes, P., Huang, J.-H., Snoek, C.G.M.: SILCO: show a few images, localize the common object. In: ICCV(2019)
Idrees, H., et al.: The THUMOS challenge on action recognition for videos “in the wild”. In: CVIU (2017)
Jain, M., Ghodrati, A., Snoek, C.G.M.: ActionBytes: learning from trimmed videos to localize actions. In: CVPR (2020)
Jain, M., van Gemert, J.C., Mensink, T., Snoek, C.G.M.: Objects2action: classifying and localizing actions without any video example. In: ICCV (2015)
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Joint learning of object and action detectors. In: ICCV (2017)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Kay, W., et al.: The kinetics human action video dataset. arXiv (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv (2014)
Kuehne, H., Richard, A., Gall, J.: A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. arXiv (2019)
Singh, K.K., Lee, Y.J.: Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In: ICCV (2017)
Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: ECCV (2018)
Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMS for activity detection and early detection. In: CVPR (2016)
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008)
Mettes, P., Snoek, C.G.M.: Spatial-aware object embeddings for zero-shot localization and classification of actions. In: ICCV (2017)
Nguyen, P., Liu, T., Prasad, G., Han, G.: Weakly supervised action localization by sparse temporal pooling network. In: CVPR (2018)
Nguyen, P.X., Ramanan, D., Charless C.F.: Weakly-supervised action localization with background modeling. In: ICCV (2019)
Oneata, D., Verbeek, J., Cordelia, S.: Action and event recognition with fisher vectors on a compact feature set. In: ICCV (2013)
Pasze, A., et al.: Automatic differentiation in pytorch. In: NeurIPS (2017)
Paul, S., Roy, S., Roy-Chowdhury, A.K.: W-talc: Weakly-supervised temporal activity localization and classification. In: ECCV (2018)
Sawatzky, J., Garbade, M., Gall, J.: Ex paucis plura: learning affordance segmentation from very few examples. In: Brox, T., Bruhn, A., Fritz, M. (eds.) GCPR 2018. LNCS, vol. 11269, pp. 169–184. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12939-2_13
Shaban, A., Rahimi, A., Gould, S., Boots, B., Hartley, R.: Learning to find common objects across image collections. In: ICCV (2019)
Shou, Z., Wang, D., Chang, S.-F.: Temporal action localization in untrimmed videos via multi-stage CNNS. In: CVPR (2016)
Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: CVPR (2016)
Soomro, K., Shah, M.: Unsupervised action discovery and localization in videos. In: ICCV (2017)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recogn. Challenge, 1(2), 2 (2014)
Wang, L., Xiong, Y., Lin, D., Van Gool, L.: Untrimmednets for weakly supervised action recognition and detection. In: CVPR (2017)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)
Xu, H., Das, A., Saenko, K.: R-C3D: Region convolutional 3D network for temporal activity detection. In: ICCV (2017)
Yang, H., He, X., Porikli, F.: One-shot action localization by learning sequence matching network. In: CVPR (2018)
Yang, J., Yuan, J.: Common action discovery and localization in unconstrained videos. In: ICCV (2017)
Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR (2016)
Zhang, Z., Zhao, Z., Lin, Z., Song, J., Cai, D.: Localizing unseen activities in video via image query. In: IJCAI (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, P., Hu, V.T., Mettes, P., Snoek, C.G.M. (2020). Localizing the Common Action Among a Few Videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12352. Springer, Cham. https://doi.org/10.1007/978-3-030-58571-6_30
Download citation
DOI: https://doi.org/10.1007/978-3-030-58571-6_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58570-9
Online ISBN: 978-3-030-58571-6
eBook Packages: Computer ScienceComputer Science (R0)