Swin-Fusion: Swin-Transformer with Feature Fusion for Human Action Recognition

Chen, Tiansheng; Mo, Lingfei

doi:10.1007/s11063-023-11367-1

Swin-Fusion: Swin-Transformer with Feature Fusion for Human Action Recognition

Published: 20 July 2023

Volume 55, pages 11109–11130, (2023)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Tiansheng Chen¹ &
Lingfei Mo¹

695 Accesses
3 Citations
Explore all metrics

Abstract

Human action recognition based on still images is one of the most challenging computer vision tasks. In the past decade, convolutional neural networks (CNNs) have developed rapidly and achieved good performance in human action recognition tasks based on still images. Due to the absence of the remote perception ability of CNNs, it is challenging to have a global structural understanding of human behavior and the overall relationship between the behavior and the environment. Recently, transformer-based models have been making a splash in computer vision, even reaching SOTA in several vision tasks. We explore the transformer’s capability in human action recognition based on still images and add a simple but effective feature fusion module based on the Swin-Transformer model. More specifically, we propose a new transformer-based model for behavioral feature extraction that uses a pre-trained Swin-Transformer as the backbone network. Swin-Transformer’s distinctive hierarchical structure, combined with the feature fusion module, is used to extract and fuse multi-scale behavioral information. Extensive experiments were conducted on five still image-based human action recognition datasets, including the Li’s action dataset, the Stanford-40 dataset, the PPMI-24 dataset, the AUC-V1 dataset, and the AUC-V2 dataset. Results indicate that our proposed Swin-Fusion model achieves better behavior recognition than previously improved CNN-based models by sharing and reusing feature maps of different scales at multiple stages, without modifying the original backbone training method and with only increasing training resources by 1.6%. The code and models will be available at https://github.com/cts4444/Swin-Fusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Article 12 August 2023

A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications

Article 25 September 2020

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

Article 03 June 2022

References

Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, 7083–7093
Li K, Wang Y, Gao P, Song G, Liu Y, Li H, Qiao Y (2022) Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676
Girdhar R, Singh M, Ravi N, van der Maaten L, Joulin A, Misra I (2022) Omnivore: A single model for many visual modalities. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16102–16112
Zhang J, Yang J, Yu J, Fan J (2022) Semisupervised image classification by mutual learning of multiple self-supervised models. Int J Intell Syst 37(5):3117–3141
Article Google Scholar
Qi T, Xu Y, Quan Y, Wang Y, Ling H (2017) Image-based action recognition using hint-enhanced deep neural networks. Neurocomputing 267:475–488
Article Google Scholar
Lavinia Y, Vo HH, Verma A (2016) Fusion based deep cnn for improved large-scale image action recognition. In: 2016 IEEE international symposium on multimedia (ISM), 609–614. IEEE
Hirooka K, Hasan MAM, Shin J, Srizon AY (2022) Ensembled transfer learning based multichannel attention networks for human activity recognition in still images. IEEE Access 10:47051–47062
Article Google Scholar
Mohammadi S, Majelan SG, Shokouhi SB (2019) Ensembles of deep neural networks for action recognition in still images. In: 2019 9th international conference on computer and knowledge engineering (ICCKE), 315–318. IEEE
Chong Z, Mo L (2022) St-vton: self-supervised vision transformer for image-based virtual try-on. Image Vis Comput 127:104568
Article Google Scholar
Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol 30(12):4467–4480
Article Google Scholar
Zhang J, Cao Y, Wu Q (2021) Vector of locally and adaptively aggregated descriptors for image feature representation. Pattern Recogn 116:107952
Article Google Scholar
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, vol 1, 1–2. Prague
Ikizler N, Cinbis RG, Pehlivan S, Duygulu P (2008) Recognizing actions from still images. In: 2008 19th international conference on pattern recognition, pp 1–4. IEEE
Yao B, Khosla A, Fei-Fei L (2011) Combining randomization and discrimination for fine-grained image categorization. In: CVPR 2011, pp 1577–1584. IEEE
Yu X, Zhang Z, Wu L, Pang W, Chen H, Yu Z, Li B (2020) Deep ensemble learning for human action recognition in still images. Complexity 2020
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 1–9
Sreela S, Idicula SM (2018) Action recognition in still images using residual neural network features. Procedia Comput. Sci. 143:563–569
Article Google Scholar
Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with r* cnn. In: Proceedings of the IEEE international conference on computer vision, 1080–1088
Zhang Y, Cheng L, Wu J, Cai J, Do MN, Lu J (2016) Action recognition in still images with minimum annotation efforts. IEEE Trans Image Process 25(11):5479–5490
Article MathSciNet MATH Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers and distillation through attention. In: International conference on machine learning, 10347–10357. PMLR
Yu W, Luo M, Zhou P, Si C, Zhou Y, Wang X, Feng J, Yan S (2022) Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10819–10829
Li Y, Yuan G, Wen Y, Hu E, Evangelidis G, Tulyakov S, Wang Y, Ren J (2022) Efficientformer: vision transformers at mobilenet speed. arXiv preprint arXiv:2206.01191
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022
Cruz-Mota J, Bogdanova I, Paquier B, Bierlaire M, Thiran J-P (2012) Scale invariant feature transform on the sphere: theory and applications. Int J Comput Vis. 98(2):217–241
Article MathSciNet MATH Google Scholar
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1, 886–893. IEEE
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 3431–3440
Hariharan B, Arbeláez P, Girshick R, Malik J (2015) Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 447–456
Ghiasi G, Fowlkes CC (2016) Laplacian pyramid reconstruction and refinement for semantic segmentation. In: European conference on computer vision, 519–534. Springer
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2117–2125
Li Z, Ge Y, Feng J, Qin X, Yu J, Yu H (2020) Deep selective feature learning for action recognition. In: 2020 IEEE international conference on multimedia and expo (ICME), 1–6. IEEE
Li R, Liu Z, Tan J (2018) Reassessing hierarchical representation for action recognition in still images. IEEE Access 6:61386–61400
Article Google Scholar
Bera A, Wharton Z, Liu Y, Bessis N, Behera A (2021) Attend and guide (ag-net): a keypoints-driven attention-based deep network for image recognition. IEEE Trans Image Process 30:3691–3704
Article Google Scholar
Behera A, Wharton Z, Liu Y, Ghahremani M, Kumar S, Bessis N (2020) Regional attention network (ran) for head pose and fine-grained gesture recognition. IEEE Trans Affect Comput. https://doi.org/10.1109/TAFFC.2020.3031841
Article Google Scholar
Eraqi HM, Abouelnaga Y, Saad MH, Moustafa MN (2019) Driver distraction identification with an ensemble of convolutional neural networks. J Adv Transp. https://doi.org/10.1155/2019/4125865
Article Google Scholar
Wharton Z, Behera A, Liu Y, Bessis N (2021) Coarse temporal attention network (cta-net) for driver’s activity recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 1279–1289
Alotaibi M, Alotaibi B (2020) Distracted driver classification using deep learning. SIViP 14(3):617–624
Article Google Scholar
Arefin MR, Makhmudkhujaev F, Chae O, Kim J (2019) Aggregating cnn and hog features for real-time distracted driver detection. In: 2019 IEEE international conference on consumer electronics (ICCE), 1–3. IEEE
Behera A, Keidel AH (2018) Latent body-pose guided densenet for recognizing driver’s fine-grained secondary activities. In: 2018 15th IEEE international conference on advanced video and signal based surveillance (AVSS), 1–6. IEEE
Wu M, Zhang X, Shen L, Yu H (2021) Pose-aware multi-feature fusion network for driver distraction recognition. In: 2020 25th international conference on pattern recognition (ICPR), 1228–1235. IEEE
Mase JM, Chapman P, Figueredo GP, Torres MT (2020) A hybrid deep learning approach for driver distraction detection. In: 2020 international conference on information and communication technology convergence (ICTC), 1–6. IEEE
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, 618–626

Download references

Funding

This study was funded by the National Natural Science Foundation of China [61603091, Multi-Dimensions Based Physical Activity Assessment for the Human Daily Life] and the 2021 Blue Project for University of Jiangsu Province.

Author information

Authors and Affiliations

School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, Jiangsu, China
Tiansheng Chen & Lingfei Mo

Authors

Tiansheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lingfei Mo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Tiansheng Chen: Conceptualization, Methodology, Software, Data curation, Writing–original draft, Visualization, Investigation, Validation. Lingfei Mo: Conceptualization, Writing– review & editing, Supervision.

Corresponding author

Correspondence to Lingfei Mo.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, T., Mo, L. Swin-Fusion: Swin-Transformer with Feature Fusion for Human Action Recognition. Neural Process Lett 55, 11109–11130 (2023). https://doi.org/10.1007/s11063-023-11367-1

Download citation

Accepted: 10 July 2023
Published: 20 July 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s11063-023-11367-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Swin-Fusion: Swin-Transformer with Feature Fusion for Human Action Recognition

Abstract

Access this article

Similar content being viewed by others

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Swin-Fusion: Swin-Transformer with Feature Fusion for Human Action Recognition

Abstract

Access this article

Similar content being viewed by others

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation