Joint usage of global and local attentions in hourglass network for human pose estimation

doi:10.1016/j.neucom.2021.10.073

Neurocomputing

Volume 472, 1 February 2022, Pages 95-102

https://doi.org/10.1016/j.neucom.2021.10.073 Get rights and content

Abstract

Human pose estimation is a challenging research task in the field of computer vision. The current mainstream work has made great progress in pose estimation, but these works still do not pay enough attention to the negative impact of background on human pose estimation. In this work, we propose a human pose estimation framework characterized by the joint usage of both global and local attention module in an hourglass backbone network. The global attention module aims to reduces the negative impact of background. The local attention module is designed to help refine each joint. We tested our method on two benchmark datasets for human pose estimation, and the experimental results show that the proposed model is superior to current mainstream algorithms.

Introduction

The purpose of human pose estimation is to locate human body joints, e.g. the head, hips, knees, and ankles, from images. Human pose estimation plays an important role in analyzing human behavior based on images or videos. Accurate and efficient human pose estimation may facilitate various applications such as human action recognition [5], [6], [40], person ReID [3], human–computer interaction [24], [35] and video object tracking [32]. However, due to the volatile camera view angle and complex human posture, human pose estimation remains a challenging task after decades of study.

One overlooked problem by the current methods are the ignorance of the negative impact of background. The problem usually occurs in the case of complex background. In addition, the accurate of pose estimation still needs to be improved. To tackle the above two problems, a model could consider representing features from multiple scales. The higher-level features are computed with a large receptive field and thus can capture more global information about the image content, while the lower-level features are obtained with a small receptive field and contain fine-grained information about local regions. The Stacked Hourglass Network (SHN) [1] is a typically used multi-scale feature extraction model in pose estimation, in which the higher-level layer focuses on learning the overall human poses and the lower-level layer concentrates on fine-grained detection of local joints. Therefore, we believe the ‘SHN’ is beneficial for solving the above two problems.

In this paper, we propose a novel framework named as Global and Local Attention Strengthened Stacked Hourglass Network (GLASS-HN) for human pose estimation. This model is characterized by incorporating a Global Attention Module (GAM) and a Local Attention Module (LAM) into the backbone stacked network. The ‘GAM’ is realized by a non-local attention map inserted before each hourglass block. The ‘LAM’ is a self-attention scheme applied to each channel of feature map derived from convolving the output of hourglass block with a distinct kernel. Since the non-local attention mechanism can capture long-range dependencies directly by computing interactions between any two positions, it is beneficial to distinguish a person from his background, thereby reduce negative impact of background. We convolve the output of hourglass block with different kernels such that each channel can focus on a specific body joint with properly labeled training samples, therefore the LAM can help to refine each joint of person and experiments have proved that it has achieve better results.

The contributions of our work are fourfold:

(1)
We propose a new architecture for pose estimation, called Global and Local Attention Strengthened Stacked Hourglass Network (GLASS-HN) in which jointly use the global and local attention mechanisms in Hourglass Network.
(2)
We propose the Global Attention Mechanism(GAM). global attention block is realized by utilize a non-local attention map, which can reduces the negative impact of background.
(3)
We propose the Local Attention Mechanism(LAM). local attention block is realized by channel-wise self-attention model, which has help to refine each joint.
(4)
The experimental results show that the GLASS-HN is superior to current mainstream algorithms on MPII Human Pose dataset and Leeds Sports Poses (LSP) datasets.

The rest part of this paper is organized as follows: Section 2 introduces the related works, Section 3 present the proposed method in detail, Section 4 demonstrates and analyzes the experimental results and Section 5 concludes this paper.

Section snippets

Human pose estimation

Before the boom of deep learning, the researchers considered to achieve accurate pose estimation from two aspects, i.e. computing better feature representation and exploiting the spatial position relationship between body joints. Therefore, traditional human pose estimation methods can be roughly divided into two categories. The first type treats the pose estimation problem as a classification or regression problem from global features [4] such as HOG, Shape Context, and SIFT. However, this

Model architecture

An overview of the proposed network is illustrated in Fig. 1. We adopt the highly modularized stacked Hourglass Network [1] as the basic network structure for human pose estimation. The stacked Hourglass Network contains eight hourglass blocks, each of which captures the multi-scale feature representation of the input signal. Before each hourglass block, we plug a Global Attention Module (GAM) in the network, aiming to learn the inter-relationship between two arbitrary image locations such that

Experiments

In this section, we firstly introduce the datasets and their setup adopted in the experiments, and then we show the comparison of our ‘GLASS-HN’ network with a group of state-of-the-art method, indicating the significance of the proposed model. Subsequently, we present the ablation experimental results about different network components.

Conclusion

We have proposed a ‘GLASS-HN’ framework according to which a Global Attention Module (GAM) and Local Attention Module (LAM) are jointly used in a baseline hourglass network to explore global and local information for human pose estimation. The ‘GAM’ uses the nonlocal attention maps to distinguish a person from his background, thereby reduce negative impact of background. And the LAM has a better ability to refine each joints of person. The experimental results prove the effectiveness of this

CRediT authorship contribution statement

Xiena Dong: Methodology, Validation, Investigation, Writing-original-draft. Jun Yu: Conceptualization, Formal-analysis, Visualization. Jian Zhang: Resources, Writing-review-editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant No. 61836002, Grant No. 62125201 and Grant No. 61972361.

References (48)

Z. Wang et al.
Human skeleton mutual learning for person reidentification
Neurocomputing
(2020)
X. Chen et al.
Skeleton-based action recognition with extreme learning machines
Neurocomputing
(2015)
J. Zhu et al.
Convolutional relation network for skeleton-based action recognition
Neurocomputing
(2019)
Y. Tian et al.
Densely connected attentional pyramid residual network for human pose estimation
Neurocomputing
(2019)
G. Zheng et al.
Hierarchical structure correlation inference for pose estimation
Neurocomputing
(2020)
B. Ma et al.
VoD: a novel image representation for head yaw estimation
Neurocomputing
(2015)
A. Newell et al.
Stacked hourglass networks for human pose estimation
M. Andriluka et al.
2d human pose estimation: New benchmark and state of the art analysis
R. Urtasun et al.
Sparse probabilistic regression for activity-independent human pose inference
X. Fan et al.
Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation

J. Carreira et al.

Human pose estimation with iterative error feedback

J. Tompson et al.

Efficient object localization using convolutional networks

S. Wei et al.

Convolutional pose machines

D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in: ICLR,...

J. Ba et al.

Multiple object recognition with visual attention

J. Yu et al.

Generative Image Inpainting with Contextual Attention

A. Jain, J. Tompson, M.Andriluka, Graham W.Taylor, C.Bregler, Learning Human Pose Estimation Features with...

E. Chio, M. Bahadori, L. Song, W. Stewart, J. Sun, GRAM: Graph-Based Attention Model for Healthcare Representation...

Q. You et al.

Image captioning with semantic attention

arXiv 1603.03925

(2016)

J. Chen, H. Zhang, X. He, L. Nie, W. Liu, T. Chua, Attentive Collaborative Filtering: Multimedia Recommendation with...

Z. Yang et al.

Stacked attention networks for image question answering

arXiv 1511.02274

(2015)

J. Kuen et al.

Recurrent attentional networks for saliency detection

X. Chu et al.

Multi-context attention for human pose estimation

X. Wang et al.

Non-local Neural Networks

Cited by (0)

Jun Yu (M-13) received the B.Eng. and Ph.D. degrees from Zhejiang University, Zhejiang, China. He was an Associate Professor with the School of Information Science and Technology, Xiamen University, Xiamen, China. From 2009 to 2011, he was with Nanyang Technological University, Singapore. From 2012 to 2013, he was a Visiting Researcher at Microsoft Research Asia (MSRA). He is currently a Professor with the School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China. He has authored or coauthored more than 100 scientific articles. Over the past years, his research interests have included multimedia analysis, machine learning, and image processing. In 2017, he received the IEEE SPS Best Paper Award. Dr. Yu has (co-)chaired several special sessions, invited sessions, and workshops. He served as a program committee member or reviewer of top conferences and prestigious journals. He is a Professional Member of the Association for Computing Machinery and the China Computer Federation.

View full text

Joint usage of global and local attentions in hourglass network for human pose estimation

Abstract

Introduction

Section snippets

Human pose estimation

Model architecture

Experiments

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Stacked hourglass networks for human pose estimation

2d human pose estimation: New benchmark and state of the art analysis

Sparse probabilistic regression for activity-independent human pose inference

Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation

Human pose estimation with iterative error feedback

Efficient object localization using convolutional networks

Convolutional pose machines

Multiple object recognition with visual attention

Generative Image Inpainting with Contextual Attention

Image captioning with semantic attention

arXiv 1603.03925

Stacked attention networks for image question answering

arXiv 1511.02274

Recurrent attentional networks for saliency detection

Multi-context attention for human pose estimation

Non-local Neural Networks