research-article

Embodied One-Shot Video Recognition: Learning from Actions of a Virtual Embodied Agent

Authors:

Chengrong Wang,

Yu-Gang JiangAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 411 - 419

https://doi.org/10.1145/3343031.3351015

Published: 15 October 2019 Publication History

Abstract

One-shot learning aims to recognize novel target classes from few examples by transferring knowledge from source classes, under a general assumption that the source and target classes are semantically related but not exactly the same. Based on this assumption, recent work has focused on image-based one-shot learning, while little work has addressed video-based one shot learning. One of the challenges lies in that it is difficult to maintain the disjoint-class assumption for videos, since video clips of target classes may potentially appear in the videos of source classes. To address this issue, we introduce a novel setting, termed as embodied agents based one-shot learning, which leverages synthetic videos produced in a virtual environment to understand realistic videos of target classes. In this setting, we further propose two types of learning tasks: embodied one-shot video domain adaptation and embodied one-shot video transfer recognition. These tasks serve as a testbed for evaluating video related one-shot learning tasks. In addition, we propose a general video segment augmentation method, which significantly facilitates a variety of one-shot learning tasks. Experimental results validate the soundness of our setting and learning tasks, and also show the effectiveness of our augmentation approach to video recognition in the small-sample size regime.

References

[1]

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. 2018. On evaluation of embodied navigation agents. In ECCV .

[2]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR .

[3]

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3D: Learning from RGB-D data in indoor environments. In 3DV .

[4]

Xiaojun Chang, Yi Yang, Alexander G. Hauptmann, Eric P. Xing, and Yao-Liang Yu. 2015. Semantic concept discovery for large-scale zero-shot event detection. In IJCAI .

[5]

Xiaojun Chang, Yi Yang, Guodong Long, Chengqi Zhang, and Alexander G. Hauptmann. 2016. Dynamic concept composition for zero-example event detection. In AAAI .

[6]

Xinlei Chen and C Lawrence Zitnick. 2015. Mind's eye: A recurrent visual representation for image caption generation. In CVPR .

[7]

Zitian Chen, Yanwei Fu, Yu-Xiong Wang, Lin Ma, Wei Liu, and Martial Hebert. 2018. Image deformation meta-network for one-shot learning. In CVPR .

[8]

Zitian Chen, Yanwei Fu, Yinda Zhang, Yu-Gang Jiang, Xiangyang Xues, and Leonid Sigal. 2019. Multi-level semantic feature augmentation for one-shot learning. TIP (2019).

[9]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In CVPR .

[10]

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In ICCV .

[11]

Li Fei-Fei, Rob Fergus, and Pietro Perona. 2003. A Bayesian approach to unsupervised one-shot learning of object categories. In ICCV .

[12]

Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object categories. PAMI (2006).

[13]

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In CVPR .

[14]

Basura Fernando, Efstratios Gavves, Jose M Oramas, Amir Ghodrati, and Tinne Tuytelaars. 2015. Modeling video evolution for action recognition. In CVPR .

[15]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML .

[16]

Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. 2014. Learning multimodal latent attributes. PAMI (2014).

[17]

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR .

[18]

Amir Habibian, Thomas Mensink, and Cees Snoek. 2014. VideoStory: A new multimedia embedding for few-example recognition and translation of events. In ACM MM .

[19]

Nakamasa Inoue, Shanshan Hao, Tatsuhiko Saito, and Koichi Shinoda. 2009. Titgt at TRECVID 2009 workshop. In Proc. TRECvid .

[20]

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. PAMI (2013).

[21]

Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue, and Shih-Fu Chang. 2018. Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Transactions on Multimedia (2018).

Digital Library

[22]

Johan C Karremans, Wolfgang Stroebe, and Jasper Claus. 2006. Beyond Vicary's fantasies: The impact of subliminal priming and brand choice. JESP (2006).

[23]

Alexander Klaser, Marcin Marszałek, and Cordelia Schmid. 2008. A spatio-temporal descriptor based on 3d-gradients. In BMVC .

[24]

Orit Kliper-Gross, Tal Hassner, and Lior Wolf. 2011. One shot similarity metric learning for action recognition. In International Workshop on Similarity-Based Pattern Recognition .

[25]

Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In ICML .

[26]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NeurIPS .

[27]

Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum. 2011. One shot learning of simple visual concepts. In CogSci .

[28]

Ivan Laptev. 2005. On space-time interest points. In ICCV .

[29]

Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In CVPR .

[30]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shot multibox detector. In ECCV .

[31]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE . JMLR (2008).

[32]

Erik G. Miller, Nicholas E. Matsakis, and Paul A. Viola. 2000. Learning from one example through shared densities on transforms. In CVPR .

[33]

Ashish Mishra, Vinay Kumar Verma, M Shiva Krishna Reddy, S Arulkumar, Piyush Rai, and Anurag Mittal. 2018. A generative approach to zero-shot and few-shot action recognition. In WACV .

[34]

Timothy E Moore. 1982. Subliminal advertising: What you see is what you get. Journal of marketing (1982).

[35]

Paul Over, George Awad, Martial Michel, Jon Fiscus, Wessel Kraaij, and Alan F. Smeaton. 2011. TRECVID 2011 -- An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2011 .

[36]

Weichao Qiu and Alan Yuille. 2016. Unrealcv: Connecting computer vision to unreal engine. In ECCV .

[37]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV .

[38]

Craig Quiter and Maik Ernst. 2018. Deepdrive/deepdrive: 2.0.

[39]

Thomas Zoëga Ramsøy and Morten Overgaard. 2004. Introspection and subliminal perception. Phenomenology and the cognitive sciences (2004).

[40]

Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning. In ICLR .

[41]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In CVPR .

[42]

Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. 2016. Playing for data: Ground truth from computer games. In ECCV .

[43]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In NeurIPS .

[44]

Jake Snell, Kevin Swersky, and Richard S. Zemeln. 2017. Prototypical networks for few-shot learning. In NeurIPS .

[45]

Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, and Thomas Funkhouser. 2017. Semantic scene completion from a single depth image. In CVPR .

[46]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human action classes from videos in the wild. CRCV (2012).

[47]

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. The new data and new challenges in multimedia research. Commun. ACM (2016).

[48]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV .

[49]

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching networks for one shot learning. In NeurIPS .

[50]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR .

[51]

Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. 2018. Low-shot learning from imaginary data. In CVPR .

[52]

Yu-Xiong Wang and Martial Hebert. 2016a. Learning from small sample sets by combining unsupervised meta-training with CNNs. In NeurIPS .

[53]

Yu-Xiong Wang and Martial Hebert. 2016b. Learning to learn: Model regression networks for easy small sample learning. In ECCV .

[54]

Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. 2017. Learning to model the tail. In NeurIPS .

[55]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML .

[56]

Ming yu Chen, Huan Li, and Alexander Hauptmann. 2009. Informedia @ TRECVID 2009: Analyzing video motions. In Proc TRECvid .

[57]

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In CVPR .

[58]

Linchao Zhu and Yi Yang. 2018. Compound memory networks for few-shot video classification. In ECCV .

Cited By

Zhuo LFu YChen JCao YJiang Y(2024)Unified View Empirical Study for Large Pretrained Model on Cross-Domain Few-Shot LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367323120:9(1-18)Online publication date: 19-Jun-2024
https://dl.acm.org/doi/10.1145/3673231
Feng YGao JXu C(2024)Spatiotemporal Orthogonal Projection Capsule Network for Incremental Few-Shot Action RecognitionIEEE Transactions on Multimedia10.1109/TMM.2024.339945326(9825-9838)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3399453
Wang XYan YHu HLi BWang H(2024)Cross-Modal Contrastive Learning Network for Few-Shot Action RecognitionIEEE Transactions on Image Processing10.1109/TIP.2024.335410433(1257-1271)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3354104
Show More Cited By

Index Terms

Embodied One-Shot Video Recognition: Learning from Actions of a Virtual Embodied Agent
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
    2. Distributed artificial intelligence
      1. Intelligent agents
  2. Machine learning
    1. Learning paradigms
      1. Multi-task learning
        Transfer learning

Recommendations

A Methodology for Evaluating Multimodal Referring Expression Generation for Embodied Virtual Agents
ICMI '23 Companion: Companion Publication of the 25th International Conference on Multimodal Interaction

Robust use of definite descriptions in a situated space often involves recourse to both verbal and non-verbal modalities. For IVAs, virtual agents designed to interact with humans, the ability to both recognize and generate non-verbal and verbal ...
Multimodal embodied mimicry in interaction
COST'10: Proceedings of the 2010 international conference on Analysis of Verbal and Nonverbal Communication and Enactment

Nonverbal behavior plays an important role in human-human interaction. One particular kind of nonverbal behavior is mimicry. Behavioral mimicry supports harmonious relationships in social interaction through creating affiliation, rapport, and liking ...
Understanding Conversational and Expressive Style in a Multimodal Embodied Conversational Agent
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

Embodied conversational agents have changed the ways we can interact with machines. However, these systems often do not meet users’ expectations. A limitation is that the agents are monotonic in behavior and do not adapt to an interlocutor. We present ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Key Research and Development Program of China

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
560
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)3

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhuo LFu YChen JCao YJiang Y(2024)Unified View Empirical Study for Large Pretrained Model on Cross-Domain Few-Shot LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367323120:9(1-18)Online publication date: 19-Jun-2024
https://dl.acm.org/doi/10.1145/3673231
Feng YGao JXu C(2024)Spatiotemporal Orthogonal Projection Capsule Network for Incremental Few-Shot Action RecognitionIEEE Transactions on Multimedia10.1109/TMM.2024.339945326(9825-9838)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3399453
Wang XYan YHu HLi BWang H(2024)Cross-Modal Contrastive Learning Network for Few-Shot Action RecognitionIEEE Transactions on Image Processing10.1109/TIP.2024.335410433(1257-1271)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3354104
Mo HLi YHan PLiao XZhang WDing S(2024)SFF-DA: Spatiotemporal Feature Fusion for Nonintrusively Detecting AnxietyIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2023.334113273(1-13)Online publication date: 2024
https://doi.org/10.1109/TIM.2023.3341132
Wang XLu YYu WPang YWang H(2024)Few-Shot Action Recognition via Multi-View Representation LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.338487534:9(8522-8535)Online publication date: Sep-2024
https://doi.org/10.1109/TCSVT.2024.3384875
Ruan ZWei YYuan YLi YGuo YXie Y(2024)Advances in Few-Shot Action Recognition: A Comprehensive Review2024 7th International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD62003.2024.10604585(390-398)Online publication date: 24-May-2024
https://doi.org/10.1109/ICAIBD62003.2024.10604585
Ali MIm EKim DKim T(2024)Harnessing Meta-Learning for Improving Full-Frame Video Stabilization2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01198(12605-12614)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01198
Ray AKolekar M(2024)Transfer learning and its extensive appositeness in human activity recognition: A surveyExpert Systems with Applications10.1016/j.eswa.2023.122538240(122538)Online publication date: Apr-2024
https://doi.org/10.1016/j.eswa.2023.122538
Kong YWang YLi A(2024)Saliency Based Data Augmentation for Few-Shot Video Action RecognitionMultiMedia Modeling10.1007/978-981-96-2064-7_27(367-380)Online publication date: 28-Dec-2024
https://doi.org/10.1007/978-981-96-2064-7_27
Wang BWang XRen SWang WShi Y(2023)Hierarchical Motion Excitation Network for Few-Shot Video RecognitionElectronics10.3390/electronics1205109012:5(1090)Online publication date: 22-Feb-2023
https://doi.org/10.3390/electronics12051090
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten