research-article

Skeleton MixFormer: Multivariate Topology Representation for Skeleton-based Action Recognition

Authors:

Yi Liu,

Cheng ShiAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 2211 - 2220

https://doi.org/10.1145/3581783.3611900

Published: 27 October 2023 Publication History

Get Access

Abstract

Vision Transformer, which performs well in various vision tasks, encounters a bottleneck in skeleton-based action recognition and falls short of advanced GCN-based methods. The root cause is that the current skeleton transformer depends on the self-attention mechanism of the complete channel of the global joint, ignoring the highly discriminative differential correlation within the channel, so it is challenging to learn the expression of the multivariate topology dynamically. To tackle this, we present Skeleton MixFormer, an innovative spatio-temporal architecture to effectively represent the physical correlations and temporal interactivity of the compact skeleton data. Two essential components make up the proposed framework: 1) Spatial MixFormer. The channel-grouping and mix-attention are utilized to calculate the dynamic multivariate topological relationships. Compared with the full-channel self-attention method, Spatial MixFormer better highlights the channel groups' discriminative differences and the joint adjacency's interpretable learning. 2) Temporal MixFormer, which consists of Multiscale Convolution, Temporal Transformer and Sequential Holding Module. The multivariate temporal models ensure the richness of global difference expression and realize the discrimination of crucial intervals in the sequence, thereby enabling more effective learning of long and short-term dependencies in actions. Our Skeleton MixFormer demonstrates state-of-the-art (SOTA) performance across seven different settings on four standard datasets, namely NTU-60, NTU-120, NW-UCLA, and UAV-Human. Related code will be available on https://github.com/ElricXin/Skeleton-MixFormer.

Supplemental Material

MP4 File

Presentation video

Download
23.77 MB

References

[1]

Dasom Ahn, Sangwon Kim, Hyunsu Hong, and Byoung Chul Ko. 2023. STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3330--3339.

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation

Graph-aware transformer for skeleton-based action recognition

View-Invariant Skeleton Action Representation Learning via Motion Retargeting

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations