Authors:
Yushan Wang
1
;
Shuhei Tarashima
2
;
1
and
Norio Tagawa
1
Affiliations:
1
Faculty of Systems Design, Tokyo Metropolitan University, Tokyo, Japan
;
2
Innovation Center, NTT Communications Corporation, Tokyo, Japan
Keyword(s):
Transformer, 3D Human Pose and Shape Estimation, ViT, Attention Visualization.
Abstract:
: Fully-transformer frameworks have gradually replaced traditional convolutional neural networks (CNNs) in recent 3D human pose and shape estimation tasks, especially due to its attention mechanism that can capture long-range and complex relationships between input tokens, surpassing CNN’s representation capabilities. Recent attention designs have reduced the computational complexity of transformers in core computer vision tasks like classification and segmentation, achieving extraordinary strong results. However, their potential for more complex, higher-level tasks remains unexplored. For the first time, we propose to integrate the group-mix attention mechanism to 3D human pose and shape estimation task. We combine token-to-token, token-to-group, and group-to-group correlations, enabling a broader capture of human body part relationships and making it promising for challenging scenarios like occlusion+blur. We believe this mix of tokens and groups is well suited to our task, where w
e need to learn the relevance of various parts of the human body, which are often not individual tokens, but larger in scope. We quantitatively and qualitatively validated our method successfully reduces the parameter count by 97.3% (from 620M to 17M) and the FLOPs count by 96.1% (from 242.1G to 9.5G), with a performance gap of less than 3%.
(More)