skip to main content
10.1145/3343031.3351015acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Embodied One-Shot Video Recognition: Learning from Actions of a Virtual Embodied Agent

Published: 15 October 2019 Publication History

Abstract

One-shot learning aims to recognize novel target classes from few examples by transferring knowledge from source classes, under a general assumption that the source and target classes are semantically related but not exactly the same. Based on this assumption, recent work has focused on image-based one-shot learning, while little work has addressed video-based one shot learning. One of the challenges lies in that it is difficult to maintain the disjoint-class assumption for videos, since video clips of target classes may potentially appear in the videos of source classes. To address this issue, we introduce a novel setting, termed as embodied agents based one-shot learning, which leverages synthetic videos produced in a virtual environment to understand realistic videos of target classes. In this setting, we further propose two types of learning tasks: embodied one-shot video domain adaptation and embodied one-shot video transfer recognition. These tasks serve as a testbed for evaluating video related one-shot learning tasks. In addition, we propose a general video segment augmentation method, which significantly facilitates a variety of one-shot learning tasks. Experimental results validate the soundness of our setting and learning tasks, and also show the effectiveness of our augmentation approach to video recognition in the small-sample size regime.

References

[1]
Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. 2018. On evaluation of embodied navigation agents. In ECCV .
[2]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR .
[3]
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3D: Learning from RGB-D data in indoor environments. In 3DV .
[4]
Xiaojun Chang, Yi Yang, Alexander G. Hauptmann, Eric P. Xing, and Yao-Liang Yu. 2015. Semantic concept discovery for large-scale zero-shot event detection. In IJCAI .
[5]
Xiaojun Chang, Yi Yang, Guodong Long, Chengqi Zhang, and Alexander G. Hauptmann. 2016. Dynamic concept composition for zero-example event detection. In AAAI .
[6]
Xinlei Chen and C Lawrence Zitnick. 2015. Mind's eye: A recurrent visual representation for image caption generation. In CVPR .
[7]
Zitian Chen, Yanwei Fu, Yu-Xiong Wang, Lin Ma, Wei Liu, and Martial Hebert. 2018. Image deformation meta-network for one-shot learning. In CVPR .
[8]
Zitian Chen, Yanwei Fu, Yinda Zhang, Yu-Gang Jiang, Xiangyang Xues, and Leonid Sigal. 2019. Multi-level semantic feature augmentation for one-shot learning. TIP (2019).
[9]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In CVPR .
[10]
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In ICCV .
[11]
Li Fei-Fei, Rob Fergus, and Pietro Perona. 2003. A Bayesian approach to unsupervised one-shot learning of object categories. In ICCV .
[12]
Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object categories. PAMI (2006).
[13]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In CVPR .
[14]
Basura Fernando, Efstratios Gavves, Jose M Oramas, Amir Ghodrati, and Tinne Tuytelaars. 2015. Modeling video evolution for action recognition. In CVPR .
[15]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML .
[16]
Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. 2014. Learning multimodal latent attributes. PAMI (2014).
[17]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR .
[18]
Amir Habibian, Thomas Mensink, and Cees Snoek. 2014. VideoStory: A new multimedia embedding for few-example recognition and translation of events. In ACM MM .
[19]
Nakamasa Inoue, Shanshan Hao, Tatsuhiko Saito, and Koichi Shinoda. 2009. Titgt at TRECVID 2009 workshop. In Proc. TRECvid .
[20]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. PAMI (2013).
[21]
Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue, and Shih-Fu Chang. 2018. Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Transactions on Multimedia (2018).
[22]
Johan C Karremans, Wolfgang Stroebe, and Jasper Claus. 2006. Beyond Vicary's fantasies: The impact of subliminal priming and brand choice. JESP (2006).
[23]
Alexander Klaser, Marcin Marszałek, and Cordelia Schmid. 2008. A spatio-temporal descriptor based on 3d-gradients. In BMVC .
[24]
Orit Kliper-Gross, Tal Hassner, and Lior Wolf. 2011. One shot similarity metric learning for action recognition. In International Workshop on Similarity-Based Pattern Recognition .
[25]
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In ICML .
[26]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NeurIPS .
[27]
Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum. 2011. One shot learning of simple visual concepts. In CogSci .
[28]
Ivan Laptev. 2005. On space-time interest points. In ICCV .
[29]
Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In CVPR .
[30]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shot multibox detector. In ECCV .
[31]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE . JMLR (2008).
[32]
Erik G. Miller, Nicholas E. Matsakis, and Paul A. Viola. 2000. Learning from one example through shared densities on transforms. In CVPR .
[33]
Ashish Mishra, Vinay Kumar Verma, M Shiva Krishna Reddy, S Arulkumar, Piyush Rai, and Anurag Mittal. 2018. A generative approach to zero-shot and few-shot action recognition. In WACV .
[34]
Timothy E Moore. 1982. Subliminal advertising: What you see is what you get. Journal of marketing (1982).
[35]
Paul Over, George Awad, Martial Michel, Jon Fiscus, Wessel Kraaij, and Alan F. Smeaton. 2011. TRECVID 2011 -- An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2011 .
[36]
Weichao Qiu and Alan Yuille. 2016. Unrealcv: Connecting computer vision to unreal engine. In ECCV .
[37]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV .
[38]
Craig Quiter and Maik Ernst. 2018. Deepdrive/deepdrive: 2.0.
[39]
Thomas Zoëga Ramsøy and Morten Overgaard. 2004. Introspection and subliminal perception. Phenomenology and the cognitive sciences (2004).
[40]
Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning. In ICLR .
[41]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In CVPR .
[42]
Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. 2016. Playing for data: Ground truth from computer games. In ECCV .
[43]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In NeurIPS .
[44]
Jake Snell, Kevin Swersky, and Richard S. Zemeln. 2017. Prototypical networks for few-shot learning. In NeurIPS .
[45]
Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, and Thomas Funkhouser. 2017. Semantic scene completion from a single depth image. In CVPR .
[46]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human action classes from videos in the wild. CRCV (2012).
[47]
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. The new data and new challenges in multimedia research. Commun. ACM (2016).
[48]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV .
[49]
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching networks for one shot learning. In NeurIPS .
[50]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR .
[51]
Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. 2018. Low-shot learning from imaginary data. In CVPR .
[52]
Yu-Xiong Wang and Martial Hebert. 2016a. Learning from small sample sets by combining unsupervised meta-training with CNNs. In NeurIPS .
[53]
Yu-Xiong Wang and Martial Hebert. 2016b. Learning to learn: Model regression networks for easy small sample learning. In ECCV .
[54]
Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. 2017. Learning to model the tail. In NeurIPS .
[55]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML .
[56]
Ming yu Chen, Huan Li, and Alexander Hauptmann. 2009. Informedia @ TRECVID 2009: Analyzing video motions. In Proc TRECvid .
[57]
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In CVPR .
[58]
Linchao Zhu and Yi Yang. 2018. Compound memory networks for few-shot video classification. In ECCV .

Cited By

View all
  • (2024)Unified View Empirical Study for Large Pretrained Model on Cross-Domain Few-Shot LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367323120:9(1-18)Online publication date: 19-Jun-2024
  • (2024)Spatiotemporal Orthogonal Projection Capsule Network for Incremental Few-Shot Action RecognitionIEEE Transactions on Multimedia10.1109/TMM.2024.339945326(9825-9838)Online publication date: 13-May-2024
  • (2024)Cross-Modal Contrastive Learning Network for Few-Shot Action RecognitionIEEE Transactions on Image Processing10.1109/TIP.2024.335410433(1257-1271)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. embodied agents
  2. one-shot learning
  3. video action recognition

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • National Key Research and Development Program of China

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)3
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Unified View Empirical Study for Large Pretrained Model on Cross-Domain Few-Shot LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367323120:9(1-18)Online publication date: 19-Jun-2024
  • (2024)Spatiotemporal Orthogonal Projection Capsule Network for Incremental Few-Shot Action RecognitionIEEE Transactions on Multimedia10.1109/TMM.2024.339945326(9825-9838)Online publication date: 13-May-2024
  • (2024)Cross-Modal Contrastive Learning Network for Few-Shot Action RecognitionIEEE Transactions on Image Processing10.1109/TIP.2024.335410433(1257-1271)Online publication date: 2024
  • (2024)SFF-DA: Spatiotemporal Feature Fusion for Nonintrusively Detecting AnxietyIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2023.334113273(1-13)Online publication date: 2024
  • (2024)Few-Shot Action Recognition via Multi-View Representation LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.338487534:9(8522-8535)Online publication date: Sep-2024
  • (2024)Advances in Few-Shot Action Recognition: A Comprehensive Review2024 7th International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD62003.2024.10604585(390-398)Online publication date: 24-May-2024
  • (2024)Harnessing Meta-Learning for Improving Full-Frame Video Stabilization2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01198(12605-12614)Online publication date: 16-Jun-2024
  • (2024)Transfer learning and its extensive appositeness in human activity recognition: A surveyExpert Systems with Applications10.1016/j.eswa.2023.122538240(122538)Online publication date: Apr-2024
  • (2024)Saliency Based Data Augmentation for Few-Shot Video Action RecognitionMultiMedia Modeling10.1007/978-981-96-2064-7_27(367-380)Online publication date: 28-Dec-2024
  • (2023)Hierarchical Motion Excitation Network for Few-Shot Video RecognitionElectronics10.3390/electronics1205109012:5(1090)Online publication date: 22-Feb-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media