SSpose: Self-Supervised Spatial-Aware Model for Human Pose Estimation | IEEE Journals & Magazine | IEEE Xplore

SSpose: Self-Supervised Spatial-Aware Model for Human Pose Estimation


Impact Statement:In this article, a spatial-aware model SS-pose is proposed, which employs the multihead self-attention mechanism of the Transformer to improve the modeling of global spat...Show More

Abstract:

Human pose estimation (HPE) relies on the anatomical relationships among different body parts to locate keypoints. Despite the significant progress achieved by convolutio...Show More
Impact Statement:
In this article, a spatial-aware model SS-pose is proposed, which employs the multihead self-attention mechanism of the Transformer to improve the modeling of global spatial relationships for specific keypoints. This explicit learning approach enhances the model’s interpretability. A self-supervised training framework is introduced, which utilizes masked convolution and hierarchical masking strategy to address the issue of visible information leakage in convolutional reconstruction. By integrating MAE into the SSpose, this framework achieves efficient self-supervised learning, enhancing the model’s generalization capability. The combination of the innovations mentioned above can effectively improve the performance of human keypoint detection on public datasets.

Abstract:

Human pose estimation (HPE) relies on the anatomical relationships among different body parts to locate keypoints. Despite the significant progress achieved by convolutional neural networks (CNN)-based models in HPE, they typically fail to explicitly learn the global dependencies among various body parts. To overcome this limitation, we propose a spatial-aware HPE model called SSpose that explicitly captures the spatial dependencies between specific key points and different locations in an image. The proposed SSpose model adopts a hybrid CNN-Transformer encoder to simultaneously capture local features and global dependencies. To better preserve image details, a multiscale fusion module is introduced to integrate coarse- and fine-grained image information. By establishing a connection with the activation maximization (AM) principle, the final attention layer of the Transformer aggregates contributions (i.e., attention scores) from all image positions and forms the maximum position in th...
Published in: IEEE Transactions on Artificial Intelligence ( Volume: 5, Issue: 11, November 2024)
Page(s): 5403 - 5417
Date of Publication: 08 August 2024
Electronic ISSN: 2691-4581

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.