Efficient 3D Human Pose and Shape Estimation Using Group-Mix Attention in Transformer Models

Yushan Wang; Shuhei Tarashima; Shuhei Tarashima; Norio Tagawa

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

Efficient 3D Human Pose and Shape Estimation Using Group-Mix Attention in Transformer Models

Topics: 3D Deep Learning; Deep Learning for Visual Understanding ; Image-Based Modeling and 3D Reconstruction

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2 VISAPP: VISAPP, 735-742, 2025 , Porto, Portugal

Authors: Yushan Wang ¹ ; Shuhei Tarashima ^{2

;

1} and Norio Tagawa ¹

Affiliations: ¹ Faculty of Systems Design, Tokyo Metropolitan University, Tokyo, Japan ; ² Innovation Center, NTT Communications Corporation, Tokyo, Japan

Keyword(s): Transformer, 3D Human Pose and Shape Estimation, ViT, Attention Visualization.

Abstract: : Fully-transformer frameworks have gradually replaced traditional convolutional neural networks (CNNs) in recent 3D human pose and shape estimation tasks, especially due to its attention mechanism that can capture long-range and complex relationships between input tokens, surpassing CNN’s representation capabilities. Recent attention designs have reduced the computational complexity of transformers in core computer vision tasks like classification and segmentation, achieving extraordinary strong results. However, their potential for more complex, higher-level tasks remains unexplored. For the first time, we propose to integrate the group-mix attention mechanism to 3D human pose and shape estimation task. We combine token-to-token, token-to-group, and group-to-group correlations, enabling a broader capture of human body part relationships and making it promising for challenging scenarios like occlusion+blur. We believe this mix of tokens and groups is well suited to our task, where w e need to learn the relevance of various parts of the human body, which are often not individual tokens, but larger in scope. We quantitatively and qualitatively validated our method successfully reduces the parameter count by 97.3% (from 620M to 17M) and the FLOPs count by 96.1% (from 242.1G to 9.5G), with a performance gap of less than 3%. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 3.137.157.182

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Wang, Y., Tarashima, S. and Tagawa, N. (2025). Efficient 3D Human Pose and Shape Estimation Using Group-Mix Attention in Transformer Models. In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP; ISBN 978-989-758-728-3; ISSN 2184-4321, SciTePress, pages 735-742. DOI: 10.5220/0013306400003912

@conference{visapp25,
author={Yushan Wang and Shuhei Tarashima and Norio Tagawa},
title={Efficient 3D Human Pose and Shape Estimation Using Group-Mix Attention in Transformer Models},
booktitle={Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP},
year={2025},
pages={735-742},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013306400003912},
isbn={978-989-758-728-3},
issn={2184-4321},
}

TY - CONF

JO - Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP
TI - Efficient 3D Human Pose and Shape Estimation Using Group-Mix Attention in Transformer Models
SN - 978-989-758-728-3
IS - 2184-4321
AU - Wang, Y.
AU - Tarashima, S.
AU - Tagawa, N.
PY - 2025
SP - 735
EP - 742
DO - 10.5220/0013306400003912
PB - SciTePress