research-article

Lite-MKD: A Multi-modal Knowledge Distillation Framework for Lightweight Few-shot Action Recognition

Authors:

Xun WangAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 7283 - 7294

https://doi.org/10.1145/3581783.3612279

Published: 27 October 2023 Publication History

Abstract

Existing few-shot action recognition methods have placed primary focus on improving the recognition accuracy while neglecting another important indicator in practical scenarios, i.e., model efficiency. In this paper, we make the first attempt and propose a Lightweight Multi-modal Knowledge Distillation framework (Lite-MKD) for few-shot action recognition. In this framework, the teacher model conducts multi-modal learning to achieve a comprehensive fusion of the optical flow, depth, and appearance features of human movements, thus achieving a more robust representation of actions. The student model is utilized to learn to recognize actions from the single RGB modality at a lower computational cost under the guidance of the teacher. To fully explore and integrate multi-modal information, a hierarchical Multi-modal Fusion Module (MFM) is introduced in the teacher model. Besides, a multi-level Distinguish-to-Mimic (D2M) knowledge distillation component is proposed for the student model. D2M improves the ability of the student model to mimic the action classification probabilities of the teacher model by enhancing the distinguishability of the student model for different video categories in the support set. Extensive experiments on three action recognition datasets Kinetics, HMDB51, and UCF101 demonstrate our framework's effectiveness and stable generalization ability. With a much more lightweight network for inference, we achieve comparable performance to previous state-of-the-art methods. Our source code is available at https://github.com/HuiGuanLab/Lite-MKD

References

[1]

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. 2021. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4009--4018.

[2]

Mina Bishay, Georgios Zoumpourlis, and Ioannis Patras. 2019. Tarn: Temporal attentive relation network for few-shot and zero-shot action recognition. arXiv preprint arXiv:1907.09021 (2019).

[3]

Léon Bottou. 2012. Stochastic gradient descent tricks. Neural Networks: Tricks of the Trade: Second Edition (2012), 421--436.

[4]

Kaidi Cao, Jingwei Ji, Zhangjie Cao, Chien-Yi Chang, and Juan Carlos Niebles. 2020. Few-shot video classification via temporal alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10618--10627.

[5]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[6]

Jiawei Chen and Chiu Man Ho. 2022. MM-ViT: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1910--1921.

[7]

Zitian Chen, Yanwei Fu, Yu-Xiong Wang, Lin Ma, Wei Liu, and Martial Hebert. 2019. Image deformation meta-networks for one-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8680--8689.

[8]

Jianfeng Dong, Shengkai Sun, Zhonglin Liu, Shujie Chen, Baolong Liu, and Xun Wang. 2023. Hierarchical contrast for unsupervised skeleton-based action representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 525--533.

Digital Library

[9]

Hao-Shu Fang, Jiefeng Li, Hongyang Tang, Chao Xu, Haoyi Zhu, Yuliang Xiu, Yong-Lu Li, and Cewu Lu. 2022. AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).

[10]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning. PMLR, 1126--1135.

[11]

Yuqian Fu, Yu Xie, Yanwei Fu, Jingjing Chen, and Yu-Gang Jiang. 2022. Me-d2n: Multi-expert domain decompositional network for cross-domain few-shot learning. In Proceedings of the 30th ACM international conference on multimedia. 6609--6617.

Digital Library

[12]

Yuqian Fu, Li Zhang, Junke Wang, Yanwei Fu, and Yu-Gang Jiang. 2020. Depth guided adaptive meta-fusion network for few-shot video recognition. In Proceedings of the 28th ACM International Conference on Multimedia. 1142--1151.

Digital Library

[13]

Junyu Gao, Mengyuan Chen, and Changsheng Xu. 2022. Fine-Grained Temporal Contrastive Learning for Weakly-Supervised Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19999--20009.

[14]

Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2021. Learning to Model Relationships for Zero-Shot Video Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 10 (2021), 3476--3491.

[15]

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Computer Vision, Vol. 129 (2021), 1789--1819.

Digital Library

[16]

Yanbin Hao, Shuo Wang, Pei Cao, Xinjian Gao, Tong Xu, Jinmeng Wu, and Xiangnan He. 2022a. Attention in attention: Modeling context correlation for efficient video classification. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 10 (2022), 7120--7132.

[17]

Yanbin Hao, Shuo Wang, Yi Tan, Xiangnan He, Zhenguang Liu, and Meng Wang. 2022b. Spatio-Temporal Collaborative Module for Efficient Action Recognition. IEEE Transactions on Image Processing, Vol. 31 (2022), 7279--7291.

[18]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[19]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).

[20]

James Hong, Matthew Fisher, Michaël Gharbi, and Kayvon Fatahalian. 2021. Video pose distillation for few-shot, fine-grained sports action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9254--9263.

[21]

Tao Huang, Shan You, Fei Wang, Chen Qian, and Chang Xu. 2022. Knowledge distillation from a stronger teacher. arXiv preprint arXiv:2205.10536 (2022).

[22]

Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2462--2470.

[23]

Wei Ji, Yicong Li, Meng Wei, Xindi Shang, Junbin Xiao, Tongwei Ren, and Tat-Seng Chua. 2021. Vidvrd 2021: The third grand challenge on video relation detection. In Proceedings of the 29th ACM International Conference on Multimedia. 4779--4783.

Digital Library

[24]

Hanul Kim, Mihir Jain, Jun-Tae Lee, Sungrack Yun, and Fatih Porikli. 2021. Efficient action recognition via dynamic knowledge propagation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13719--13728.

[25]

Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision. IEEE, 2556--2563.

Digital Library

[26]

Brenden M Lake, Russ R Salakhutdinov, and Josh Tenenbaum. 2013. One-shot learning by inverting a compositional causal process. Advances in neural information processing systems, Vol. 26 (2013).

[27]

Shuyuan Li, Huabin Liu, Rui Qian, Yuxi Li, John See, Mengjuan Fei, Xiaoyuan Yu, and Weiyao Lin. 2022. TA2N: Two-stage action alignment network for few-shot action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1404--1411.

[28]

Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision. 7083--7093.

[29]

Daizong Liu, Xiang Fang, Wei Hu, and Pan Zhou. 2023. Exploring optical-flow-guided motion and detection-based appearance for temporal sentence grounding. IEEE Transactions on Multimedia (2023).

Digital Library

[30]

Daizong Liu and Wei Hu. 2022. Skimming, locating, then perusing: A human-like framework for natural language video localization. In Proceedings of the 30th ACM International Conference on Multimedia. 4536--4545.

Digital Library

[31]

Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, and Yulai Xie. 2021. Context-aware biaffine localizing network for temporal sentence grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11235--11244.

[32]

Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Jointly cross-and self-modal graph attention network for query-based moment localization. In Proceedings of the 28th ACM International Conference on Multimedia. 4070--4078.

Digital Library

[33]

Khoi D Nguyen, Quoc-Huy Tran, Khoi Nguyen, Binh-Son Hua, and Rang Nguyen. 2022. Inductive and Transductive Few-Shot Video Classification via Appearance and Temporal Alignments. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XX. Springer, 471--487.

[34]

Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirmehdi, and Dima Damen. 2021. Temporal-relational crosstransformers for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 475--484.

[35]

Cheng Perng Phoo and Bharath Hariharan. 2020. Self-training For Few-shot Transfer Across Extreme Task Differences. In International Conference on Learning Representations.

[36]

Chengchao Shen, Xinchao Wang, Youtan Yin, Jie Song, Sihui Luo, and Mingli Song. 2021. Progressive network grafting for few-shot knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2541--2549.

[37]

Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S Feris, David Harwath, James Glass, and Hilde Kuehne. 2022. Everything at once-multi-modal fusion transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20020--20029.

[38]

Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. Advances in neural information processing systems, Vol. 30 (2017).

[39]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).

[40]

Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1199--1208.

[41]

Yi Tan, Yanbin Hao, Xiangnan He, Yinwei Wei, and Xun Yang. 2021. Selective dependency aggregation for action classification. In Proceedings of the 29th ACM International Conference on Multimedia. 592--601.

Digital Library

[42]

Anirudh Thatipelli, Sanath Narayan, Salman Khan, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Bernard Ghanem. 2022. Spatio-temporal relation modeling for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19958--19967.

[43]

Yikai Wang, Chengming Xu, Chen Liu, Li Zhang, and Yanwei Fu. 2020. Instance credibility inference for few-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12836--12845.

[44]

Yikai Wang, Li Zhang, Yuan Yao, and Yanwei Fu. 2021. How to trust unlabeled data? instance credibility inference for few-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 10 (2021), 6240--6253.

Digital Library

[45]

Junfei Xiao, Longlong Jing, Lin Zhang, Ju He, Qi She, Zongwei Zhou, Alan Yuille, and Yingwei Li. 2022. Learning from temporal gradient for semi-supervised action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3252--3262.

[46]

Yuheng Yang, Haipeng Chen, Zhenguang Liu, Yingda Lyu, Beibei Zhang, Shuang Wu, Zhibo Wang, and Kui Ren. 2023. Action Recognition with Multi-stream Motion Modeling and Mutual Information Maximization. arXiv preprint arXiv:2306.07576 (2023).

[47]

Hongguang Zhang, Li Zhang, Xiaojuan Qi, Hongdong Li, Philip HS Torr, and Piotr Koniusz. 2020. Few-shot action recognition with permutation-invariant attention. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part V 16. Springer, 525--542.

[48]

Linchao Zhu and Yi Yang. 2018. Compound memory networks for few-shot video classification. In Proceedings of the European Conference on Computer Vision (ECCV). 751--766.

Digital Library

[49]

Mohammadreza Zolfaghari, Gabriel L Oliveira, Nima Sedaghat, and Thomas Brox. 2017. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In Proceedings of the IEEE International Conference on Computer Vision. 2904--2913.

Index Terms

Lite-MKD: A Multi-modal Knowledge Distillation Framework for Lightweight Few-shot Action Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Cross-modal guides spatio-temporal enrichment network for few-shot action recognition
Abstract
Few-shot action recognition aims to learn a model that can be easily adapted to identify novel action classifications using only a few labeled samples. Recent methods primarily focus on visual features and fail to fully utilize the available ...
Multi-view distillation based on multi-modal fusion for few-shot action recognition (CLIP-MDMF)
Abstract
In recent years, the field of few-shot action recognition (FSAR) has garnered significant attention. Although many methods primarily rely on mono-modal data, there is a growing trend towards utilizing multi-modal data. However, existing FSAR ...
Highlights
- This paper proposes Multi-view Distillation framework based on Multi-modal Fusion.
- This paper introduces Probability Prompt Selector to address information inconsistency.
- This paper uses two View Context Extractors with Cross-...
Cross-domain few-shot action recognition with unlabeled videos
Abstract
Current few-shot action recognition approaches have achieved impressive performance using only a few labeled examples. However, they usually assume the base (train) and target (test) videos typically come from the same domain, which may limit ...
Highlights
- Few-shot action recognition methods perform poorly in cross-domain situations.
- Self-supervised learning can alleviate domain shift.
- Temporal modeling is important in the cross-domain few-shot action setting.
- This is the first ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Public Welfare Technology Research Project of Zhejiang Province
National Natural Science Foundation of China
The Pioneer and Leading Goose R&D Program of Zhejiang

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
288
Total Downloads

Downloads (Last 12 months)166
Downloads (Last 6 weeks)11

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten