skip to main content
10.1145/3394171.3416269acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Few-Shot Ensemble Learning for Video Classification with SlowFast Memory Networks

Published: 12 October 2020 Publication History

Abstract

In the era of big data, few-shot learning has recently received much attention in multimedia analysis and computer vision due to its appealing ability of learning from scarce labeled data. However, it has been largely underdeveloped in the video domain, which is even more challenging due to the huge spatial-temporal variability of video data. In this paper, we address few-shot video classification by learning an ensemble of SlowFast networks augmented with memory units. Specifically, we introduce a family of few-shot learners based on SlowFast networks which are used to extract informative features at multiple rates, and we incorporate a memory unit into each network to enable encoding and retrieving crucial information instantly. Furthermore, we propose a choice controller network to leverage the diversity of few-shot learners by learning to adaptively assign a confidence score to each SlowFast memory network, leading to a strong classifier for enhanced prediction. Experimental results on two widely-adopted video datasets demonstrate the effectiveness of the proposed method, as well as its superior performance over the state-of-the-art approaches.

Supplementary Material

MP4 File (3394171.3416269.mp4)
In the era of big data, few-shot learning has recently received much attention in multimedia analysis and computer vision due to its appealing ability of learning from scarce labeled data. In this work, we address few-shot video classification by learning an ensemble of SlowFast networks augmented with memory units.

References

[1]
Leo Breiman. 1996. Bagging predictors. Machine learning, Vol. 24, 2 (1996), 123--140.
[2]
Spencer Cappallo and Cees GM Snoek. 2017. Future-Supervised Retrieval of Unseen Queries for Live Video. In Proceedings of the 25th ACM international conference on Multimedia (MM). ACM, 28--36.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR. IEEE, 6299--6308.
[4]
Xiongtao Chen, Wenmin Wang, Jinzhuo Wang, and Weimian Li. 2017. Learning object-centric transformation for video prediction. In Proceedings of the 25th ACM international conference on Multimedia (MM). ACM, 1503--1512.
[5]
Donghyeon Cho, Yunjae Jung, Francois Rameau, Dahun Kim, Sanghyun Woo, and In So Kweon. 2019. Video Retargeting: Trade-off between Content Preservation and Spatio-temporal Consistency. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 882--889.
[6]
Navneet Dalal, Bill Triggs, and Cordelia Schmid. 2006. Human detection using oriented histograms of flow and appearance. In Proc. ECCV. Springer, 428--441.
[7]
Shancheng Fang, Hongtao Xie, Zheng-Jun Zha, Nannan Sun, Jianlong Tan, and Yongdong Zhang. 2018. Attention and language ensemble for scene text recognition with convolutional sequence modeling. In Proceedings of the 26th ACM international conference on Multimedia (MM). ACM, 248--256.
[8]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proc. ICCV. IEEE, 6202--6211.
[9]
Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016. Spatiotemporal residual networks for video action recognition. In Proc. NIPS. 3468--3476.
[10]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. ICML.
[11]
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics New York.
[12]
Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. 2000. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, Vol. 28, 2 (2000), 337--407.
[13]
Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2018. Watch, think and attend: End-to-end video classification via dynamic knowledge evolution modeling. In Proceedings of the 26th ACM international conference on Multimedia (MM). ACM, 690--699.
[14]
Victor Garcia and Joan Bruna. 2018. Few-shot learning with graph neural networks. Proc. ICLR (2018).
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In Proc. ECCV. Springer, 630--645.
[16]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[17]
Linxi Jiang, Xingjun Ma, Shaoxiang Chen, James Bailey, and Yu-Gang Jiang. 2019. Black-box adversarial attacks on video recognition models. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 864--872.
[18]
Łukasz Kaiser, Ofir Nachum, Aurko Roy, and Samy Bengio. 2017. Learning to remember rare events. In Proc. ICLR.
[19]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
[20]
Ivan Laptev, Marcin Marszałek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In Proc. CVPR. IEEE, 1--8.
[21]
Dong Li, Ting Yao, Zhaofan Qiu, Houqiang Li, and Tao Mei. 2019. Long Short-Term Relation Networks for Video Action Detection. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 629--637.
[22]
Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankanhalli. 2017. Attention transfer from web images for video recognition. In Proceedings of the 25th ACM international conference on Multimedia (MM). ACM, 1--9.
[23]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proc. CVPR. IEEE, 2117--2125.
[24]
Jakub Lokovc, Gregor Kovalvc ik, Tomávs Souvc ek, Jaroslav Moravec, and Pvr emysl vC ech. 2019. A framework for effective known-item search in video. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 1777--1785.
[25]
Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. 2018. A simple neural attentive meta-learner. Proc. ICLR (2018).
[26]
Nikunj Chandrakant Oza and Stuart Russell. 2001. Online ensemble learning .University of California, Berkeley.
[27]
Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, and Jiebo Luo. 2019 a. Attentive Relational Networks for Mapping Images to Scene Graphs. In Proc. CVRR. IEEE, 3957--3966.
[28]
Mengshi Qi, Jie Qin, Annan Li, Yunhong Wang, Jiebo Luo, and Luc Van Gool. 2018a. stagnet: An attentive semantic RNN for group activity recognition. In Proc. ECCV. Springer, 101--117.
[29]
Mengshi Qi, Jie Qin, Yu Wu, and Yi Yang. 2020 a. Imitative Non-Autoregressive Modeling for Trajectory Forecasting and Imputation. In Proc. CVRR. IEEE, 12736--12745.
[30]
Mengshi Qi, Yunhong Wang, and Annan Li. 2017. Online cross-modal scene retrieval by binary representation and semantic graph. In Proceedings of the 25th ACM international conference on Multimedia (MM). ACM, 744--752.
[31]
Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. 2018b. Sports Video Captioning by Attentive Motion Representation based Hierarchical Recurrent Neural Networks. In Proceedings of the 1st International Workshop on Multimedia Content Analysis in Sports (MMSports). ACM, 77--85.
[32]
Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. 2020 b. STC-GAN: Spatio-Temporally Coupled Generative Adversarial Networks for Predictive Scene Parsing. IEEE Transactions on Image Processing, Vol. 29 (2020), 5420--5430.
[33]
Mengshi Qi, Yunhong Wang, Jie Qin, and Annan Li. 2019 b. KE-GAN: Knowledge Embedded Generative Adversarial Networks for Semi-Supervised Scene Parsing. In Proc. CVPR. IEEE, 5237--5246.
[34]
Xufeng Qian, Yueting Zhuang, Yimeng Li, Shaoning Xiao, Shiliang Pu, and Jun Xiao. 2019. Video relation detection with spatio-temporal graph. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 84--93.
[35]
Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning. In Proc. ICLR.
[36]
Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Meta-learning with memory-augmented neural networks. In Proc. ICML. 1842--1850.
[37]
Robert E Schapire. 2003. The boosting approach to machine learning: An overview. In Nonlinear estimation and classification. Springer, 149--171.
[38]
Robert E Schapire and Yoram Singer. 1999. Improved boosting algorithms using confidence-rated predictions. Machine learning, Vol. 37, 3 (1999), 297--336.
[39]
Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video visual relation detection. In Proceedings of the 25th ACM international conference on Multimedia (MM). ACM, 1300--1308.
[40]
Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proc. ECCV. Springer, 510--526.
[41]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proc. NIPS. 568--576.
[42]
Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Proc. NIPS. 4077--4087.
[43]
Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proc. CVPR. IEEE, 1199--1208.
[44]
Kai Tian, Yi Xu, Shuigeng Zhou, and Jihong Guan. 2019. Versatile multiple choice learning and its application to vision computing. In Proc. CVPR. IEEE, 6349--6357.
[45]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proc. ICCV. IEEE, 4489--4497.
[46]
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Proc. NIPS. 3630--3638.
[47]
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proc. ICCV. IEEE, 3551--3558.
[48]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proc. ECCV. Springer, 20--36.
[49]
Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. arXiv preprint arXiv:1410.3916 (2014).
[50]
Haoze Wu, Zheng-Jun Zha, Xin Wen, Zhenzhong Chen, Dong Liu, and Xuejin Chen. 2019. Cross-Fiber Spatial-Temporal Co-enhanced Networks for Video Action Recognition. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 620--628.
[51]
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proc. CVPR. IEEE, 4694--4702.
[52]
Chenrui Zhang, Xiaoqing Lyu, and Zhi Tang. 2019 b. TGG: Transferable Graph Generation for Zero-shot and Few-shot Learning. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 1641--1649.
[53]
Zhaoyang Zhang, Zhanghui Kuang, Ping Luo, Litong Feng, and Wei Zhang. 2018a. Temporal sequence distillation: Towards few-frame action recognition in videos. In Proceedings of the 26th ACM international conference on Multimedia (MM). ACM, 257--264.
[54]
Zheng Zhang, Zhihui Lai, Zi Huang, Wai Keung Wong, Guo-Sen Xie, Li Liu, and Ling Shao. 2019 a. Scalable supervised asymmetric hashing with semantic and latent factor embedding. IEEE Transactions on Image Processing, Vol. 28, 10 (2019), 4803--4818.
[55]
Zheng Zhang, Li Liu, Fumin Shen, Heng Tao Shen, and Ling Shao. 2018b. Binary multi-view clustering. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 7 (2018), 1774--1782.
[56]
Linchao Zhu and Yi Yang. 2018. Compound memory networks for few-shot video classification. In Proc. ECCV. Springer, 751--766.
[57]
Yaochen Zhu, Zhenzhong Chen, and Feng Wu. 2019. Multimodal Deep Denoise Framework for Affective Video Content Analysis. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 130--138.

Cited By

View all
  • (2025) Consistency Prototype Module and Motion Compensation for few-shot action recognition (CLIP-CPM C) Neurocomputing10.1016/j.neucom.2024.128649611(128649)Online publication date: Jan-2025
  • (2024)Task-Specific Alignment and Multiple-level Transformer for few-shot action recognitionNeurocomputing10.1016/j.neucom.2024.128044598(128044)Online publication date: Sep-2024
  • (2024)Transfer learning and its extensive appositeness in human activity recognition: A surveyExpert Systems with Applications10.1016/j.eswa.2023.122538240(122538)Online publication date: Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ensemble learning
  2. few-shot learning
  3. memory network
  4. video classification

Qualifiers

  • Research-article

Conference

MM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)51
  • Downloads (Last 6 weeks)7
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025) Consistency Prototype Module and Motion Compensation for few-shot action recognition (CLIP-CPM C) Neurocomputing10.1016/j.neucom.2024.128649611(128649)Online publication date: Jan-2025
  • (2024)Task-Specific Alignment and Multiple-level Transformer for few-shot action recognitionNeurocomputing10.1016/j.neucom.2024.128044598(128044)Online publication date: Sep-2024
  • (2024)Transfer learning and its extensive appositeness in human activity recognition: A surveyExpert Systems with Applications10.1016/j.eswa.2023.122538240(122538)Online publication date: Apr-2024
  • (2023)Disentangled counterfactual learning for physical audiovisual commonsense reasoningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666668(12476-12488)Online publication date: 10-Dec-2023
  • (2023)On the Use of Deep Learning for Video ClassificationApplied Sciences10.3390/app1303200713:3(2007)Online publication date: 3-Feb-2023
  • (2023)Learning Dual-Routing Capsule Graph Neural Network for Few-Shot Video ClassificationIEEE Transactions on Multimedia10.1109/TMM.2022.315693825(3204-3216)Online publication date: 1-Jan-2023
  • (2023)Study on the Vulnerability of Video Retargeting Method for Generated Videos by Deep Learning Model2023 Fourteenth International Conference on Ubiquitous and Future Networks (ICUFN)10.1109/ICUFN57995.2023.10201216(834-836)Online publication date: 4-Jul-2023
  • (2023)Unsupervised Self-Driving Attention Prediction via Uncertainty Mining and Knowledge Embedding2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00786(8524-8534)Online publication date: 1-Oct-2023
  • (2022)Attention-Aware Multiple Granularities Network for Player Re-IdentificationProceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports10.1145/3552437.3555695(137-144)Online publication date: 14-Oct-2022
  • (2022)Diagnosing Ensemble Few-Shot ClassifiersIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.318248828:9(3292-3306)Online publication date: 1-Sep-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media