skip to main content
10.1145/3581783.3611900acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Skeleton MixFormer: Multivariate Topology Representation for Skeleton-based Action Recognition

Published: 27 October 2023 Publication History

Abstract

Vision Transformer, which performs well in various vision tasks, encounters a bottleneck in skeleton-based action recognition and falls short of advanced GCN-based methods. The root cause is that the current skeleton transformer depends on the self-attention mechanism of the complete channel of the global joint, ignoring the highly discriminative differential correlation within the channel, so it is challenging to learn the expression of the multivariate topology dynamically. To tackle this, we present Skeleton MixFormer, an innovative spatio-temporal architecture to effectively represent the physical correlations and temporal interactivity of the compact skeleton data. Two essential components make up the proposed framework: 1) Spatial MixFormer. The channel-grouping and mix-attention are utilized to calculate the dynamic multivariate topological relationships. Compared with the full-channel self-attention method, Spatial MixFormer better highlights the channel groups' discriminative differences and the joint adjacency's interpretable learning. 2) Temporal MixFormer, which consists of Multiscale Convolution, Temporal Transformer and Sequential Holding Module. The multivariate temporal models ensure the richness of global difference expression and realize the discrimination of crucial intervals in the sequence, thereby enabling more effective learning of long and short-term dependencies in actions. Our Skeleton MixFormer demonstrates state-of-the-art (SOTA) performance across seven different settings on four standard datasets, namely NTU-60, NTU-120, NW-UCLA, and UAV-Human. Related code will be available on https://github.com/ElricXin/Skeleton-MixFormer.

Supplemental Material

MP4 File
Presentation video

References

[1]
Dasom Ahn, Sangwon Kim, Hyunsu Hong, and Byoung Chul Ko. 2023. STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3330--3339.
[2]
Ruwen Bai, Min Li, Bo Meng, Fengfa Li, Miao Jiang, Junxing Ren, and Degang Sun. 2022. Hierarchical graph convolutional skeleton transformer for action recognition. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 01--06.
[3]
Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. 2021. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13359--13368.
[4]
Ke Cheng, Yifan Zhang, Congqi Cao, Lei Shi, Jian Cheng, and Hanqing Lu. 2020a. Decoupling gcn with dropgraph module for skeleton-based action recognition. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIV 16. Springer, 536--553.
[5]
Ke Cheng, Yifan Zhang, Xiangyu He, Weihan Chen, Jian Cheng, and Hanqing Lu. 2020b. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 183--192.
[6]
Hyung-gun Chi, Myoung Hoon Ha, Seunggeun Chi, Sang Wan Lee, Qixing Huang, and Karthik Ramani. 2022. Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20186--20196.
[7]
Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. 2022. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13608--13618.
[8]
Ronghao Dang, Chengju Liu, Ming Liu, and Qijun Chen. 2022. Channel attention and multi-scale graph neural networks for skeleton-based action recognition. AI Communications Preprint (2022), 1--19.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[10]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
[11]
Haodong Duan, Jiaqi Wang, Kai Chen, and Dahua Lin. 2022a. Pyskl: Towards good practices for skeleton action recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 7351--7354.
[12]
Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. 2022b. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2969--2978.
[13]
Zhimin Gao, Peitao Wang, Pei Lv, Xiaoheng Jiang, Qidong Liu, Pichao Wang, Mingliang Xu, and Wanqing Li. 2022. Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition. In Proceedings of the Asian Conference on Computer Vision. 382--398.
[14]
Ruijie Hou, Yanran Li, Ningyu Zhang, Yulin Zhou, Xiaosong Yang, and Zhao Wang. 2022. Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton based Action Recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 4987--4995.
[15]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.
[16]
Momal Ijaz, Renato Diaz, and Chen Chen. 2022. Multimodal transformer for nursing activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2065--2074.
[17]
Muhammad Attique Khan, Kashif Javed, Sajid Ali Khan, Tanzila Saba, Usman Habib, Junaid Ali Khan, and Aaqif Afzaal Abbasi. 2020. Human action recognition using fusion of multiview and deep features: an application to video surveillance. Multimedia tools and applications (2020), 1--27.
[18]
Jungho Lee, Minhyeok Lee, Dogyoon Lee, and Sangyoon Lee. 2022. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:2208.10741 (2022).
[19]
Peixuan Li and Jieyu Jin. 2022. Time3d: End-to-end joint monocular 3d object detection and tracking for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3885--3894.
[20]
Tianjiao Li, Jun Liu, Wei Zhang, Yun Ni, Wenqian Wang, and Zhiheng Li. 2021. Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16266--16275.
[21]
Haowei Liu, Yongcheng Liu, Yuxin Chen, Chunfeng Yuan, Bing Li, and Weiming Hu. 2023. TranSkeleton: Hierarchical Spatial-Temporal Transformer for Skeleton-Based Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology (2023).
[22]
Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. 2019. Ntu rgb d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, Vol. 42, 10 (2019), 2684--2701.
[23]
Yanan Liu, Hao Zhang, Dan Xu, and Kangjian He. 2022. Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowledge-Based Systems, Vol. 240 (2022), 108146.
[24]
Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. 2020. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 143--152.
[25]
Vittorio Mazzia, Simone Angarano, Francesco Salvetti, Federico Angelini, and Marcello Chiaberge. 2022. Action Transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recognition, Vol. 124 (2022), 108487.
[26]
Yunsheng Pang, Qiuhong Ke, Hossein Rahmani, James Bailey, and Jun Liu. 2022. IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXV. Springer, 605--622.
[27]
Chiara Plizzari, Marco Cannici, and Matteo Matteucci. 2021. Spatial temporal transformer network for skeleton-based action recognition. In Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10-15, 2021, Proceedings, Part III. Springer, 694--701.
[28]
Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. Ntu rgb d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1010--1019.
[29]
Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12026--12035.
[30]
Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2020. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Transactions on Image Processing, Vol. 29 (2020), 9532--9545.
[31]
Yi-Fan Song, Zhang Zhang, Caifeng Shan, and Liang Wang. 2022. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 45, 2 (2022), 1474--1488.
[32]
Zehua Sun, Qiuhong Ke, Hossein Rahmani, Mohammed Bennamoun, Gang Wang, and Jun Liu. 2022. Human action recognition from various data modalities: A review. IEEE transactions on pattern analysis and machine intelligence (2022).
[33]
Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. 2021. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, Vol. 34 (2021), 24261--24272.
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[35]
Tijana Vuletic, Alex Duffy, Laura Hay, Chris McTeague, Gerard Campbell, and Madeleine Grealy. 2019. Systematic literature review of hand gestures used in human computer interaction interfaces. International Journal of Human-Computer Studies, Vol. 129 (2019), 74--94.
[36]
Jiang Wang, Xiaohan Nie, Yin Xia, Ying Wu, and Song-Chun Zhu. 2014. Cross-view action modeling, learning and recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2649--2656.
[37]
Xuanhan Wang, Yan Dai, Lianli Gao, and Jingkuan Song. 2022. Skeleton-based Action Recognition via Adaptive Cross-Form Learning. In Proceedings of the 30th ACM International Conference on Multimedia. 1670--1678.
[38]
Wentian Xin, Ruyi Liu, Yi Liu, Yu Chen, Wenxin Yu, and Qiguang Miao. 2023. Transformer for Skeleton-based action recognition: A review of recent advances. Neurocomputing (2023).
[39]
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
[40]
Di Yang, Yaohui Wang, Antitza Dantcheva, Lorenzo Garattoni, Gianpiero Francesca, and Francois Bremond. 2021. Unik: A unified framework for real-world skeleton-based action recognition. arXiv preprint arXiv:2107.08580 (2021).
[41]
Sen Yang, Xuanhan Wang, Lianli Gao, and Jingkuan Song. 2022. MKE-GCN: Multi-Modal Knowledge Embedded Graph Convolutional Network for Skeleton-Based Action Recognition in the Wild. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 01--06.
[42]
Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. 2021. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, Vol. 34 (2021), 28877--28888.
[43]
Qinyang Zeng, Chengju Liu, Ming Liu, and Qijun Chen. 2023. Contrastive 3D Human Skeleton Action Representation Learning via CrossMoCo with Spatiotemporal Occlusion Mask Data Augmentation. IEEE Transactions on Multimedia (2023).
[44]
Pengfei Zhang, Cuiling Lan, Wenjun Zeng, Junliang Xing, Jianru Xue, and Nanning Zheng. 2020. Semantics-guided neural networks for efficient skeleton-based human action recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1112--1121.
[45]
Yuhan Zhang, Bo Wu, Wen Li, Lixin Duan, and Chuang Gan. 2021. STST: Spatial-temporal specialized transformer for skeleton-based action recognition. In Proceedings of the 29th ACM International Conference on Multimedia. 3229--3237.
[46]
Huanyu Zhou, Qingjie Liu, and Yunhong Wang. 2023. Learning discriminative representations for skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10608--10617.

Cited By

View all
  • (2025)InfoGCN++: Learning Representation by Predicting the Future for Online Skeleton-Based Action RecognitionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346621247:1(514-528)Online publication date: Jan-2025
  • (2025)Adaptive Pitfall: Exploring the Effectiveness of Adaptation in Skeleton-Based Action RecognitionIEEE Transactions on Multimedia10.1109/TMM.2024.352177427(56-71)Online publication date: 1-Jan-2025
  • (2025)SG-CLR: Semantic representation-guided contrastive learning for self-supervised skeleton-based action recognitionPattern Recognition10.1016/j.patcog.2025.111377(111377)Online publication date: Jan-2025
  • Show More Cited By

Index Terms

  1. Skeleton MixFormer: Multivariate Topology Representation for Skeleton-based Action Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. attention
    2. skeleton action recognition
    3. topology representation
    4. transformer
    5. video understanding

    Qualifiers

    • Research-article

    Funding Sources

    • The National Key R&D Program of China
    • The Teaching Reform Project of Shaanxi Higher Continuing Education
    • The National Natural Science Foundations of China

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)417
    • Downloads (Last 6 weeks)22
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)InfoGCN++: Learning Representation by Predicting the Future for Online Skeleton-Based Action RecognitionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346621247:1(514-528)Online publication date: Jan-2025
    • (2025)Adaptive Pitfall: Exploring the Effectiveness of Adaptation in Skeleton-Based Action RecognitionIEEE Transactions on Multimedia10.1109/TMM.2024.352177427(56-71)Online publication date: 1-Jan-2025
    • (2025)SG-CLR: Semantic representation-guided contrastive learning for self-supervised skeleton-based action recognitionPattern Recognition10.1016/j.patcog.2025.111377(111377)Online publication date: Jan-2025
    • (2024)MGR-Dark: A Large Multimodal Video Dataset and RGB-IR Benchmark for Gesture Recognition in DarknessProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681267(2321-2330)Online publication date: 28-Oct-2024
    • (2024)Multi-Modality Co-Learning for Efficient Skeleton-based Action RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681015(4909-4918)Online publication date: 28-Oct-2024
    • (2024)Frequency Guidance Matters: Skeletal Action Recognition by Frequency-Aware Mixed TransformerProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681009(4660-4669)Online publication date: 28-Oct-2024
    • (2024)Localized Linear Temporal Dynamics for Self-Supervised Skeleton Action RecognitionIEEE Transactions on Multimedia10.1109/TMM.2024.340571226(10189-10199)Online publication date: 1-Jan-2024
    • (2024)Multi-View Time-Series Hypergraph Neural Network for Action RecognitionIEEE Transactions on Image Processing10.1109/TIP.2024.339191333(3301-3313)Online publication date: 3-May-2024
    • (2024)Action Jitter Killer: Joint Noise Optimization Cascade for Skeleton-Based Action RecognitionIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2024.337095873(1-14)Online publication date: 2024
    • (2024)HDBN: A Novel Hybrid Dual-Branch Network for Robust Skeleton-Based Action Recognition2024 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)10.1109/ICMEW63481.2024.10645450(1-6)Online publication date: 15-Jul-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media