research-article

Few-Shot Ensemble Learning for Video Classification with SlowFast Memory Networks

Authors:

Jiebo LuoAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 3007 - 3015

https://doi.org/10.1145/3394171.3416269

Published: 12 October 2020 Publication History

Abstract

In the era of big data, few-shot learning has recently received much attention in multimedia analysis and computer vision due to its appealing ability of learning from scarce labeled data. However, it has been largely underdeveloped in the video domain, which is even more challenging due to the huge spatial-temporal variability of video data. In this paper, we address few-shot video classification by learning an ensemble of SlowFast networks augmented with memory units. Specifically, we introduce a family of few-shot learners based on SlowFast networks which are used to extract informative features at multiple rates, and we incorporate a memory unit into each network to enable encoding and retrieving crucial information instantly. Furthermore, we propose a choice controller network to leverage the diversity of few-shot learners by learning to adaptively assign a confidence score to each SlowFast memory network, leading to a strong classifier for enhanced prediction. Experimental results on two widely-adopted video datasets demonstrate the effectiveness of the proposed method, as well as its superior performance over the state-of-the-art approaches.

Supplementary Material

MP4 File (3394171.3416269.mp4)

In the era of big data, few-shot learning has recently received much attention in multimedia analysis and computer vision due to its appealing ability of learning from scarce labeled data. In this work, we address few-shot video classification by learning an ensemble of SlowFast networks augmented with memory units.

Download
8.71 MB

References

[1]

Leo Breiman. 1996. Bagging predictors. Machine learning, Vol. 24, 2 (1996), 123--140.

[2]

Spencer Cappallo and Cees GM Snoek. 2017. Future-Supervised Retrieval of Unseen Queries for Live Video. In Proceedings of the 25th ACM international conference on Multimedia (MM). ACM, 28--36.

Digital Library

[3]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR. IEEE, 6299--6308.

[4]

Xiongtao Chen, Wenmin Wang, Jinzhuo Wang, and Weimian Li. 2017. Learning object-centric transformation for video prediction. In Proceedings of the 25th ACM international conference on Multimedia (MM). ACM, 1503--1512.

Digital Library

[5]

Donghyeon Cho, Yunjae Jung, Francois Rameau, Dahun Kim, Sanghyun Woo, and In So Kweon. 2019. Video Retargeting: Trade-off between Content Preservation and Spatio-temporal Consistency. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 882--889.

Digital Library

[6]

Navneet Dalal, Bill Triggs, and Cordelia Schmid. 2006. Human detection using oriented histograms of flow and appearance. In Proc. ECCV. Springer, 428--441.

[7]

Shancheng Fang, Hongtao Xie, Zheng-Jun Zha, Nannan Sun, Jianlong Tan, and Yongdong Zhang. 2018. Attention and language ensemble for scene text recognition with convolutional sequence modeling. In Proceedings of the 26th ACM international conference on Multimedia (MM). ACM, 248--256.

Digital Library

[8]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proc. ICCV. IEEE, 6202--6211.

[9]

Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016. Spatiotemporal residual networks for video action recognition. In Proc. NIPS. 3468--3476.

[10]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. ICML.

[11]

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics New York.

[12]

Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. 2000. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, Vol. 28, 2 (2000), 337--407.

[13]

Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2018. Watch, think and attend: End-to-end video classification via dynamic knowledge evolution modeling. In Proceedings of the 26th ACM international conference on Multimedia (MM). ACM, 690--699.

Digital Library

[14]

Victor Garcia and Joan Bruna. 2018. Few-shot learning with graph neural networks. Proc. ICLR (2018).

[15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In Proc. ECCV. Springer, 630--645.

[16]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

[17]

Linxi Jiang, Xingjun Ma, Shaoxiang Chen, James Bailey, and Yu-Gang Jiang. 2019. Black-box adversarial attacks on video recognition models. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 864--872.

Digital Library

[18]

Łukasz Kaiser, Ofir Nachum, Aurko Roy, and Samy Bengio. 2017. Learning to remember rare events. In Proc. ICLR.

[19]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).

[20]

Ivan Laptev, Marcin Marszałek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In Proc. CVPR. IEEE, 1--8.

[21]

Dong Li, Ting Yao, Zhaofan Qiu, Houqiang Li, and Tao Mei. 2019. Long Short-Term Relation Networks for Video Action Detection. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 629--637.

Digital Library

[22]

Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankanhalli. 2017. Attention transfer from web images for video recognition. In Proceedings of the 25th ACM international conference on Multimedia (MM). ACM, 1--9.

Digital Library

[23]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proc. CVPR. IEEE, 2117--2125.

[24]

Jakub Lokovc, Gregor Kovalvc ik, Tomávs Souvc ek, Jaroslav Moravec, and Pvr emysl vC ech. 2019. A framework for effective known-item search in video. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 1777--1785.

[25]

Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. 2018. A simple neural attentive meta-learner. Proc. ICLR (2018).

[26]

Nikunj Chandrakant Oza and Stuart Russell. 2001. Online ensemble learning .University of California, Berkeley.

[27]

Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, and Jiebo Luo. 2019 a. Attentive Relational Networks for Mapping Images to Scene Graphs. In Proc. CVRR. IEEE, 3957--3966.

[28]

Mengshi Qi, Jie Qin, Annan Li, Yunhong Wang, Jiebo Luo, and Luc Van Gool. 2018a. stagnet: An attentive semantic RNN for group activity recognition. In Proc. ECCV. Springer, 101--117.

[29]

Mengshi Qi, Jie Qin, Yu Wu, and Yi Yang. 2020 a. Imitative Non-Autoregressive Modeling for Trajectory Forecasting and Imputation. In Proc. CVRR. IEEE, 12736--12745.

[30]

Mengshi Qi, Yunhong Wang, and Annan Li. 2017. Online cross-modal scene retrieval by binary representation and semantic graph. In Proceedings of the 25th ACM international conference on Multimedia (MM). ACM, 744--752.

Digital Library

[31]

Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. 2018b. Sports Video Captioning by Attentive Motion Representation based Hierarchical Recurrent Neural Networks. In Proceedings of the 1st International Workshop on Multimedia Content Analysis in Sports (MMSports). ACM, 77--85.

Digital Library

[32]

Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. 2020 b. STC-GAN: Spatio-Temporally Coupled Generative Adversarial Networks for Predictive Scene Parsing. IEEE Transactions on Image Processing, Vol. 29 (2020), 5420--5430.

[33]

Mengshi Qi, Yunhong Wang, Jie Qin, and Annan Li. 2019 b. KE-GAN: Knowledge Embedded Generative Adversarial Networks for Semi-Supervised Scene Parsing. In Proc. CVPR. IEEE, 5237--5246.

[34]

Xufeng Qian, Yueting Zhuang, Yimeng Li, Shaoning Xiao, Shiliang Pu, and Jun Xiao. 2019. Video relation detection with spatio-temporal graph. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 84--93.

Digital Library

[35]

Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning. In Proc. ICLR.

[36]

Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Meta-learning with memory-augmented neural networks. In Proc. ICML. 1842--1850.

[37]

Robert E Schapire. 2003. The boosting approach to machine learning: An overview. In Nonlinear estimation and classification. Springer, 149--171.

[38]

Robert E Schapire and Yoram Singer. 1999. Improved boosting algorithms using confidence-rated predictions. Machine learning, Vol. 37, 3 (1999), 297--336.

[39]

Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video visual relation detection. In Proceedings of the 25th ACM international conference on Multimedia (MM). ACM, 1300--1308.

Digital Library

[40]

Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proc. ECCV. Springer, 510--526.

[41]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proc. NIPS. 568--576.

[42]

Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Proc. NIPS. 4077--4087.

[43]

Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proc. CVPR. IEEE, 1199--1208.

[44]

Kai Tian, Yi Xu, Shuigeng Zhou, and Jihong Guan. 2019. Versatile multiple choice learning and its application to vision computing. In Proc. CVPR. IEEE, 6349--6357.

[45]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proc. ICCV. IEEE, 4489--4497.

Digital Library

[46]

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Proc. NIPS. 3630--3638.

[47]

Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proc. ICCV. IEEE, 3551--3558.

Digital Library

[48]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proc. ECCV. Springer, 20--36.

[49]

Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. arXiv preprint arXiv:1410.3916 (2014).

[50]

Haoze Wu, Zheng-Jun Zha, Xin Wen, Zhenzhong Chen, Dong Liu, and Xuejin Chen. 2019. Cross-Fiber Spatial-Temporal Co-enhanced Networks for Video Action Recognition. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 620--628.

Digital Library

[51]

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proc. CVPR. IEEE, 4694--4702.

[52]

Chenrui Zhang, Xiaoqing Lyu, and Zhi Tang. 2019 b. TGG: Transferable Graph Generation for Zero-shot and Few-shot Learning. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 1641--1649.

Digital Library

[53]

Zhaoyang Zhang, Zhanghui Kuang, Ping Luo, Litong Feng, and Wei Zhang. 2018a. Temporal sequence distillation: Towards few-frame action recognition in videos. In Proceedings of the 26th ACM international conference on Multimedia (MM). ACM, 257--264.

Digital Library

[54]

Zheng Zhang, Zhihui Lai, Zi Huang, Wai Keung Wong, Guo-Sen Xie, Li Liu, and Ling Shao. 2019 a. Scalable supervised asymmetric hashing with semantic and latent factor embedding. IEEE Transactions on Image Processing, Vol. 28, 10 (2019), 4803--4818.

[55]

Zheng Zhang, Li Liu, Fumin Shen, Heng Tao Shen, and Ling Shao. 2018b. Binary multi-view clustering. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 7 (2018), 1774--1782.

[56]

Linchao Zhu and Yi Yang. 2018. Compound memory networks for few-shot video classification. In Proc. ECCV. Springer, 751--766.

[57]

Yaochen Zhu, Zhenzhong Chen, and Feng Wu. 2019. Multimodal Deep Denoise Framework for Affective Video Content Analysis. In Proceedings of the 27th ACM International Conference on Multimedia (MM). ACM, 130--138.

Digital Library

Cited By

Guo FWang YQi HZhu LSun J(2025) Consistency Prototype Module and Motion Compensation for few-shot action recognition (CLIP-CPM C) Neurocomputing10.1016/j.neucom.2024.128649611(128649)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128649
Guo FZhu LWang YSun J(2024)Task-Specific Alignment and Multiple-level Transformer for few-shot action recognitionNeurocomputing10.1016/j.neucom.2024.128044598(128044)Online publication date: Sep-2024
https://doi.org/10.1016/j.neucom.2024.128044
Ray AKolekar M(2024)Transfer learning and its extensive appositeness in human activity recognition: A surveyExpert Systems with Applications10.1016/j.eswa.2023.122538240(122538)Online publication date: Apr-2024
https://doi.org/10.1016/j.eswa.2023.122538
Show More Cited By

Index Terms

Few-Shot Ensemble Learning for Video Classification with SlowFast Memory Networks
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
  2. Machine learning
    1. Machine learning algorithms
      1. Ensemble methods

Recommendations

Few-Shot Learning Geometric Ensemble for Multi-label Classification of Chest X-Rays
Data Augmentation, Labelling, and Imperfections
Abstract
This paper aims to identify uncommon cardiothoracic diseases and patterns on chest X-ray images. Training a machine learning model to classify rare diseases with multi-label indications is challenging without sufficient labeled training samples. ...
Few-Shot Single-View 3D Reconstruction with Memory Prior Contrastive Network
Computer Vision – ECCV 2022
Abstract
3D reconstruction of novel categories based on few-shot learning is appealing in real-world applications and attracts increasing research interests. Previous approaches mainly focus on how to design shape prior models for different categories. ...
Learning frame relevance for video classification
MM '11: Proceedings of the 19th ACM international conference on Multimedia

Traditional video classification methods typically require a large number of labeled training video frames to achieve satisfactory performance. However, in the real world, we usually only have sufficient labeled video clips (such as tagged online videos)...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

October 2020

4889 pages

ISBN:9781450379885

DOI:10.1145/3394171

General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '20

Sponsor:

SIGMM

MM '20: The 28th ACM International Conference on Multimedia

October 12 - 16, 2020

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
469
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)7

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Guo FWang YQi HZhu LSun J(2025) Consistency Prototype Module and Motion Compensation for few-shot action recognition (CLIP-CPM C) Neurocomputing10.1016/j.neucom.2024.128649611(128649)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128649
Guo FZhu LWang YSun J(2024)Task-Specific Alignment and Multiple-level Transformer for few-shot action recognitionNeurocomputing10.1016/j.neucom.2024.128044598(128044)Online publication date: Sep-2024
https://doi.org/10.1016/j.neucom.2024.128044
Ray AKolekar M(2024)Transfer learning and its extensive appositeness in human activity recognition: A surveyExpert Systems with Applications10.1016/j.eswa.2023.122538240(122538)Online publication date: Apr-2024
https://doi.org/10.1016/j.eswa.2023.122538
Lv CZhang STian YQi MMa HOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Disentangled counterfactual learning for physical audiovisual commonsense reasoningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666668(12476-12488)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666668
ur Rehman ABelhaouari SKabir MKhan A(2023)On the Use of Deep Learning for Video ClassificationApplied Sciences10.3390/app1303200713:3(2007)Online publication date: 3-Feb-2023
https://doi.org/10.3390/app13032007
Feng YGao JXu C(2023)Learning Dual-Routing Capsule Graph Neural Network for Few-Shot Video ClassificationIEEE Transactions on Multimedia10.1109/TMM.2022.315693825(3204-3216)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3156938
Kim AKim DPark S(2023)Study on the Vulnerability of Video Retargeting Method for Generated Videos by Deep Learning Model2023 Fourteenth International Conference on Ubiquitous and Future Networks (ICUFN)10.1109/ICUFN57995.2023.10201216(834-836)Online publication date: 4-Jul-2023
https://doi.org/10.1109/ICUFN57995.2023.10201216
Zhu PQi MLi XLi WMa H(2023)Unsupervised Self-Driving Attention Prediction via Uncertainty Mining and Knowledge Embedding2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00786(8524-8534)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.00786
An QCui KLiu RWang CQi MMa HLienhart RMoeslund TSaito H(2022)Attention-Aware Multiple Granularities Network for Player Re-IdentificationProceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports10.1145/3552437.3555695(137-144)Online publication date: 14-Oct-2022
https://dl.acm.org/doi/10.1145/3552437.3555695
Yang WYe XZhang XXiao LXia JWang ZZhu JPfister HLiu S(2022)Diagnosing Ensemble Few-Shot ClassifiersIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.318248828:9(3292-3306)Online publication date: 1-Sep-2022
https://doi.org/10.1109/TVCG.2022.3182488
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten