skip to main content
10.1145/3581783.3612279acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Lite-MKD: A Multi-modal Knowledge Distillation Framework for Lightweight Few-shot Action Recognition

Published: 27 October 2023 Publication History

Abstract

Existing few-shot action recognition methods have placed primary focus on improving the recognition accuracy while neglecting another important indicator in practical scenarios, i.e., model efficiency. In this paper, we make the first attempt and propose a Lightweight Multi-modal Knowledge Distillation framework (Lite-MKD) for few-shot action recognition. In this framework, the teacher model conducts multi-modal learning to achieve a comprehensive fusion of the optical flow, depth, and appearance features of human movements, thus achieving a more robust representation of actions. The student model is utilized to learn to recognize actions from the single RGB modality at a lower computational cost under the guidance of the teacher. To fully explore and integrate multi-modal information, a hierarchical Multi-modal Fusion Module (MFM) is introduced in the teacher model. Besides, a multi-level Distinguish-to-Mimic (D2M) knowledge distillation component is proposed for the student model. D2M improves the ability of the student model to mimic the action classification probabilities of the teacher model by enhancing the distinguishability of the student model for different video categories in the support set. Extensive experiments on three action recognition datasets Kinetics, HMDB51, and UCF101 demonstrate our framework's effectiveness and stable generalization ability. With a much more lightweight network for inference, we achieve comparable performance to previous state-of-the-art methods. Our source code is available at https://github.com/HuiGuanLab/Lite-MKD

References

[1]
Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. 2021. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4009--4018.
[2]
Mina Bishay, Georgios Zoumpourlis, and Ioannis Patras. 2019. Tarn: Temporal attentive relation network for few-shot and zero-shot action recognition. arXiv preprint arXiv:1907.09021 (2019).
[3]
Léon Bottou. 2012. Stochastic gradient descent tricks. Neural Networks: Tricks of the Trade: Second Edition (2012), 421--436.
[4]
Kaidi Cao, Jingwei Ji, Zhangjie Cao, Chien-Yi Chang, and Juan Carlos Niebles. 2020. Few-shot video classification via temporal alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10618--10627.
[5]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[6]
Jiawei Chen and Chiu Man Ho. 2022. MM-ViT: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1910--1921.
[7]
Zitian Chen, Yanwei Fu, Yu-Xiong Wang, Lin Ma, Wei Liu, and Martial Hebert. 2019. Image deformation meta-networks for one-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8680--8689.
[8]
Jianfeng Dong, Shengkai Sun, Zhonglin Liu, Shujie Chen, Baolong Liu, and Xun Wang. 2023. Hierarchical contrast for unsupervised skeleton-based action representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 525--533.
[9]
Hao-Shu Fang, Jiefeng Li, Hongyang Tang, Chao Xu, Haoyi Zhu, Yuliang Xiu, Yong-Lu Li, and Cewu Lu. 2022. AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[10]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning. PMLR, 1126--1135.
[11]
Yuqian Fu, Yu Xie, Yanwei Fu, Jingjing Chen, and Yu-Gang Jiang. 2022. Me-d2n: Multi-expert domain decompositional network for cross-domain few-shot learning. In Proceedings of the 30th ACM international conference on multimedia. 6609--6617.
[12]
Yuqian Fu, Li Zhang, Junke Wang, Yanwei Fu, and Yu-Gang Jiang. 2020. Depth guided adaptive meta-fusion network for few-shot video recognition. In Proceedings of the 28th ACM International Conference on Multimedia. 1142--1151.
[13]
Junyu Gao, Mengyuan Chen, and Changsheng Xu. 2022. Fine-Grained Temporal Contrastive Learning for Weakly-Supervised Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19999--20009.
[14]
Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2021. Learning to Model Relationships for Zero-Shot Video Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 10 (2021), 3476--3491.
[15]
Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Computer Vision, Vol. 129 (2021), 1789--1819.
[16]
Yanbin Hao, Shuo Wang, Pei Cao, Xinjian Gao, Tong Xu, Jinmeng Wu, and Xiangnan He. 2022a. Attention in attention: Modeling context correlation for efficient video classification. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 10 (2022), 7120--7132.
[17]
Yanbin Hao, Shuo Wang, Yi Tan, Xiangnan He, Zhenguang Liu, and Meng Wang. 2022b. Spatio-Temporal Collaborative Module for Efficient Action Recognition. IEEE Transactions on Image Processing, Vol. 31 (2022), 7279--7291.
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[19]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[20]
James Hong, Matthew Fisher, Michaël Gharbi, and Kayvon Fatahalian. 2021. Video pose distillation for few-shot, fine-grained sports action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9254--9263.
[21]
Tao Huang, Shan You, Fei Wang, Chen Qian, and Chang Xu. 2022. Knowledge distillation from a stronger teacher. arXiv preprint arXiv:2205.10536 (2022).
[22]
Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2462--2470.
[23]
Wei Ji, Yicong Li, Meng Wei, Xindi Shang, Junbin Xiao, Tongwei Ren, and Tat-Seng Chua. 2021. Vidvrd 2021: The third grand challenge on video relation detection. In Proceedings of the 29th ACM International Conference on Multimedia. 4779--4783.
[24]
Hanul Kim, Mihir Jain, Jun-Tae Lee, Sungrack Yun, and Fatih Porikli. 2021. Efficient action recognition via dynamic knowledge propagation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13719--13728.
[25]
Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision. IEEE, 2556--2563.
[26]
Brenden M Lake, Russ R Salakhutdinov, and Josh Tenenbaum. 2013. One-shot learning by inverting a compositional causal process. Advances in neural information processing systems, Vol. 26 (2013).
[27]
Shuyuan Li, Huabin Liu, Rui Qian, Yuxi Li, John See, Mengjuan Fei, Xiaoyuan Yu, and Weiyao Lin. 2022. TA2N: Two-stage action alignment network for few-shot action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1404--1411.
[28]
Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision. 7083--7093.
[29]
Daizong Liu, Xiang Fang, Wei Hu, and Pan Zhou. 2023. Exploring optical-flow-guided motion and detection-based appearance for temporal sentence grounding. IEEE Transactions on Multimedia (2023).
[30]
Daizong Liu and Wei Hu. 2022. Skimming, locating, then perusing: A human-like framework for natural language video localization. In Proceedings of the 30th ACM International Conference on Multimedia. 4536--4545.
[31]
Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, and Yulai Xie. 2021. Context-aware biaffine localizing network for temporal sentence grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11235--11244.
[32]
Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Jointly cross-and self-modal graph attention network for query-based moment localization. In Proceedings of the 28th ACM International Conference on Multimedia. 4070--4078.
[33]
Khoi D Nguyen, Quoc-Huy Tran, Khoi Nguyen, Binh-Son Hua, and Rang Nguyen. 2022. Inductive and Transductive Few-Shot Video Classification via Appearance and Temporal Alignments. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XX. Springer, 471--487.
[34]
Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirmehdi, and Dima Damen. 2021. Temporal-relational crosstransformers for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 475--484.
[35]
Cheng Perng Phoo and Bharath Hariharan. 2020. Self-training For Few-shot Transfer Across Extreme Task Differences. In International Conference on Learning Representations.
[36]
Chengchao Shen, Xinchao Wang, Youtan Yin, Jie Song, Sihui Luo, and Mingli Song. 2021. Progressive network grafting for few-shot knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2541--2549.
[37]
Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S Feris, David Harwath, James Glass, and Hilde Kuehne. 2022. Everything at once-multi-modal fusion transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20020--20029.
[38]
Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. Advances in neural information processing systems, Vol. 30 (2017).
[39]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
[40]
Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1199--1208.
[41]
Yi Tan, Yanbin Hao, Xiangnan He, Yinwei Wei, and Xun Yang. 2021. Selective dependency aggregation for action classification. In Proceedings of the 29th ACM International Conference on Multimedia. 592--601.
[42]
Anirudh Thatipelli, Sanath Narayan, Salman Khan, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Bernard Ghanem. 2022. Spatio-temporal relation modeling for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19958--19967.
[43]
Yikai Wang, Chengming Xu, Chen Liu, Li Zhang, and Yanwei Fu. 2020. Instance credibility inference for few-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12836--12845.
[44]
Yikai Wang, Li Zhang, Yuan Yao, and Yanwei Fu. 2021. How to trust unlabeled data? instance credibility inference for few-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 10 (2021), 6240--6253.
[45]
Junfei Xiao, Longlong Jing, Lin Zhang, Ju He, Qi She, Zongwei Zhou, Alan Yuille, and Yingwei Li. 2022. Learning from temporal gradient for semi-supervised action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3252--3262.
[46]
Yuheng Yang, Haipeng Chen, Zhenguang Liu, Yingda Lyu, Beibei Zhang, Shuang Wu, Zhibo Wang, and Kui Ren. 2023. Action Recognition with Multi-stream Motion Modeling and Mutual Information Maximization. arXiv preprint arXiv:2306.07576 (2023).
[47]
Hongguang Zhang, Li Zhang, Xiaojuan Qi, Hongdong Li, Philip HS Torr, and Piotr Koniusz. 2020. Few-shot action recognition with permutation-invariant attention. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part V 16. Springer, 525--542.
[48]
Linchao Zhu and Yi Yang. 2018. Compound memory networks for few-shot video classification. In Proceedings of the European Conference on Computer Vision (ECCV). 751--766.
[49]
Mohammadreza Zolfaghari, Gabriel L Oliveira, Nima Sedaghat, and Thomas Brox. 2017. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In Proceedings of the IEEE International Conference on Computer Vision. 2904--2913.

Index Terms

  1. Lite-MKD: A Multi-modal Knowledge Distillation Framework for Lightweight Few-shot Action Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. efficient deep learning
    2. few-shot action recognition
    3. knowledge distillation
    4. multi-modal learning

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 287
      Total Downloads
    • Downloads (Last 12 months)165
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media