Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Authors: Yushan Wang 1 ; Shuhei Tarashima 2 ; 1 and Norio Tagawa 1

Affiliations: 1 Faculty of Systems Design, Tokyo Metropolitan University, Tokyo, Japan ; 2 Innovation Center, NTT Communications Corporation, Tokyo, Japan

Keyword(s): Transformer, 3D Human Pose and Shape Estimation, ViT, Attention Visualization.

Abstract: : Fully-transformer frameworks have gradually replaced traditional convolutional neural networks (CNNs) in recent 3D human pose and shape estimation tasks, especially due to its attention mechanism that can capture long-range and complex relationships between input tokens, surpassing CNN’s representation capabilities. Recent attention designs have reduced the computational complexity of transformers in core computer vision tasks like classification and segmentation, achieving extraordinary strong results. However, their potential for more complex, higher-level tasks remains unexplored. For the first time, we propose to integrate the group-mix attention mechanism to 3D human pose and shape estimation task. We combine token-to-token, token-to-group, and group-to-group correlations, enabling a broader capture of human body part relationships and making it promising for challenging scenarios like occlusion+blur. We believe this mix of tokens and groups is well suited to our task, where w e need to learn the relevance of various parts of the human body, which are often not individual tokens, but larger in scope. We quantitatively and qualitatively validated our method successfully reduces the parameter count by 97.3% (from 620M to 17M) and the FLOPs count by 96.1% (from 242.1G to 9.5G), with a performance gap of less than 3%. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.137.157.182

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Wang, Y., Tarashima, S. and Tagawa, N. (2025). Efficient 3D Human Pose and Shape Estimation Using Group-Mix Attention in Transformer Models. In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP; ISBN 978-989-758-728-3; ISSN 2184-4321, SciTePress, pages 735-742. DOI: 10.5220/0013306400003912

@conference{visapp25,
author={Yushan Wang and Shuhei Tarashima and Norio Tagawa},
title={Efficient 3D Human Pose and Shape Estimation Using Group-Mix Attention in Transformer Models},
booktitle={Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP},
year={2025},
pages={735-742},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013306400003912},
isbn={978-989-758-728-3},
issn={2184-4321},
}

TY - CONF

JO - Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP
TI - Efficient 3D Human Pose and Shape Estimation Using Group-Mix Attention in Transformer Models
SN - 978-989-758-728-3
IS - 2184-4321
AU - Wang, Y.
AU - Tarashima, S.
AU - Tagawa, N.
PY - 2025
SP - 735
EP - 742
DO - 10.5220/0013306400003912
PB - SciTePress