A Fuzzy Error Based Fine-Tune Method for Spatio-Temporal Recognition Model

Li, Jiulin; Yang, Mengyu; Liu, Yang; Xi, Gongli; Zhang, Lanshan; Tian, Ye

doi:10.1007/978-981-99-8429-9_8

Jiulin Li¹⁵,
Mengyu Yang¹⁵,
Yang Liu¹⁵,
Gongli Xi¹⁵,
Lanshan Zhang¹⁵ &
…
Ye Tian¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14425))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

1643 Accesses

Abstract

The spatio-temporal convolution model is widely recognized for its effectiveness in predicting action in various fields. This model typically uses video clips as input and employs multiple clips for inference, ultimately deriving a video-level prediction through an aggregation function. However, the model will give a high confidence prediction result, regardless of whether the input clips have sufficient spatio-temporal information to indicate its class or not. The inaccurate high confidence prediction errors can subsequently affect the accuracy of the video-level results. Although the current approach to mitigating this problem involves increasing the number of clips used, it fails to address this problem from its root causes. To solve this issue, we propose a fine-tuning framework based on Fuzzy error loss, aimed at further refining the well-trained spatio-temporal convolution model that relies on dense sampling. By giving a low confidence prediction output for clips with insufficient spatio-temporal information, our framework strives to enhance the accuracy of video-level motion recognition. We conducted extensive experiments on two motion recognition datasets, namely UCF101 and Kinetics-Sounds, to evaluate the effectiveness of our proposed framework. The results indicate a significant improvement in motion recognition accuracy at the video level on both data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Representing Discrimination of Video by a Motion Map

STAM: a spatio-temporal adaptive module for improving static convolutions in action recognition

Article 07 December 2023

STTG-net: a Spatio-temporal network for human motion prediction based on transformer and graph convolution network

Article Open access 29 July 2022

References

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV) (2015). https://doi.org/10.1109/iccv.2015.510
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet? In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018). https://doi.org/10.1109/cvpr.2018.00685
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018). https://doi.org/10.1109/cvpr.2018.00675
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019). https://doi.org/10.1109/iccv.2019.00630
Feichtenhofer, C.: X3D: expanding architectures for efficient video recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020). https://doi.org/10.1109/cvpr42600.2020.00028
Jiang, Y., Gong, X., Wu, J., Shi, H., Yan, Z., Wang, Z.: Auto-X3D: ultra-efficient video understanding via finer-grained neural architecture search. In: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2022). https://doi.org/10.1109/wacv51458.2022.00241
Wang, J., et al.: Maximizing spatio-temporal entropy of deep 3D CNNs for efficient video recognition (2023)
Google Scholar
Tan, Y., Hao, Y., Zhang, H., Wang, S., He, X.: Hierarchical hourglass convolutional network for efficient video classification (2022)
Google Scholar
Chen, C.F.R., et al.: Deep analysis of CNN-based spatio-temporal representations for action recognition. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021). https://doi.org/10.1109/cvpr46437.2021.00610
Shalmani, S., Chiang, F., Zheng, R.: Efficient action recognition using confidence distillation (2021)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017). https://doi.org/10.1109/cvpr.2017.502
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning for video understanding (2017)
Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018). https://doi.org/10.1109/cvpr.2018.00813
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Wu, Z., Xiong, C., Ma, C.Y., Socher, R., Davis, L.S.: Adaframe: adaptive frame selection for fast video recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019). https://doi.org/10.1109/cvpr.2019.00137
Alwassel, H., Caba Heilbron, F., Ghanem, B.: Action search: spotting actions in videos and its application to temporal action localization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 253–269. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_16
Chapter Google Scholar
Gao, R., Oh, T.H., Grauman, K., Torresani, L.: Listen to look: action recognition by previewing audio. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020). https://doi.org/10.1109/cvpr42600.2020.01047
Wang, Y., Chen, Z., Jiang, H., Song, S., Han, Y., Huang, G.: Adaptive focus for efficient video recognition. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021). https://doi.org/10.1109/iccv48922.2021.01594
Wu, W., He, D., Tan, X., Chen, S., Wen, S.: Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019). https://doi.org/10.1109/iccv.2019.00632
Huang, H., Zhou, X., He, R.: Orthogonal transformer: an efficient vision transformer backbone with token orthogonalization (2022)
Google Scholar
UCF101: a dataset of 101 human actions classes from videos in the wild
Google Scholar
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017). https://doi.org/10.1109/iccv.2017.73
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035. Curran Associates, Inc. (2019). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Fan, H., Li, Y., Xiong, B., Lo, W.Y., Feichtenhofer, C.: Pyslowfast (2020). https://github.com/facebookresearch/slowfast

Download references

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grant 62072048, and in part by Industry-University-Research Innovation Fund of Universities in China under Grant 2021ITA07005.

Author information

Authors and Affiliations

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
Jiulin Li, Mengyu Yang, Yang Liu, Gongli Xi, Lanshan Zhang & Ye Tian

Authors

Jiulin Li
View author publications
You can also search for this author in PubMed Google Scholar
Mengyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Gongli Xi
View author publications
You can also search for this author in PubMed Google Scholar
Lanshan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ye Tian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ye Tian .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, J., Yang, M., Liu, Y., Xi, G., Zhang, L., Tian, Y. (2024). A Fuzzy Error Based Fine-Tune Method for Spatio-Temporal Recognition Model. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14425. Springer, Singapore. https://doi.org/10.1007/978-981-99-8429-9_8

Download citation

DOI: https://doi.org/10.1007/978-981-99-8429-9_8
Published: 24 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8428-2
Online ISBN: 978-981-99-8429-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Fuzzy Error Based Fine-Tune Method for Spatio-Temporal Recognition Model