Elsevier

Neurocomputing

Volume 472, 1 February 2022, Pages 95-102
Neurocomputing

Joint usage of global and local attentions in hourglass network for human pose estimation

https://doi.org/10.1016/j.neucom.2021.10.073Get rights and content

Abstract

Human pose estimation is a challenging research task in the field of computer vision. The current mainstream work has made great progress in pose estimation, but these works still do not pay enough attention to the negative impact of background on human pose estimation. In this work, we propose a human pose estimation framework characterized by the joint usage of both global and local attention module in an hourglass backbone network. The global attention module aims to reduces the negative impact of background. The local attention module is designed to help refine each joint. We tested our method on two benchmark datasets for human pose estimation, and the experimental results show that the proposed model is superior to current mainstream algorithms.

Introduction

The purpose of human pose estimation is to locate human body joints, e.g. the head, hips, knees, and ankles, from images. Human pose estimation plays an important role in analyzing human behavior based on images or videos. Accurate and efficient human pose estimation may facilitate various applications such as human action recognition [5], [6], [40], person ReID [3], human–computer interaction [24], [35] and video object tracking [32]. However, due to the volatile camera view angle and complex human posture, human pose estimation remains a challenging task after decades of study.

One overlooked problem by the current methods are the ignorance of the negative impact of background. The problem usually occurs in the case of complex background. In addition, the accurate of pose estimation still needs to be improved. To tackle the above two problems, a model could consider representing features from multiple scales. The higher-level features are computed with a large receptive field and thus can capture more global information about the image content, while the lower-level features are obtained with a small receptive field and contain fine-grained information about local regions. The Stacked Hourglass Network (SHN) [1] is a typically used multi-scale feature extraction model in pose estimation, in which the higher-level layer focuses on learning the overall human poses and the lower-level layer concentrates on fine-grained detection of local joints. Therefore, we believe the ‘SHN’ is beneficial for solving the above two problems.

In this paper, we propose a novel framework named as Global and Local Attention Strengthened Stacked Hourglass Network (GLASS-HN) for human pose estimation. This model is characterized by incorporating a Global Attention Module (GAM) and a Local Attention Module (LAM) into the backbone stacked network. The ‘GAM’ is realized by a non-local attention map inserted before each hourglass block. The ‘LAM’ is a self-attention scheme applied to each channel of feature map derived from convolving the output of hourglass block with a distinct kernel. Since the non-local attention mechanism can capture long-range dependencies directly by computing interactions between any two positions, it is beneficial to distinguish a person from his background, thereby reduce negative impact of background. We convolve the output of hourglass block with different kernels such that each channel can focus on a specific body joint with properly labeled training samples, therefore the LAM can help to refine each joint of person and experiments have proved that it has achieve better results.

The contributions of our work are fourfold:

  • (1)

    We propose a new architecture for pose estimation, called Global and Local Attention Strengthened Stacked Hourglass Network (GLASS-HN) in which jointly use the global and local attention mechanisms in Hourglass Network.

  • (2)

    We propose the Global Attention Mechanism(GAM). global attention block is realized by utilize a non-local attention map, which can reduces the negative impact of background.

  • (3)

    We propose the Local Attention Mechanism(LAM). local attention block is realized by channel-wise self-attention model, which has help to refine each joint.

  • (4)

    The experimental results show that the GLASS-HN is superior to current mainstream algorithms on MPII Human Pose dataset and Leeds Sports Poses (LSP) datasets.

The rest part of this paper is organized as follows: Section 2 introduces the related works, Section 3 present the proposed method in detail, Section 4 demonstrates and analyzes the experimental results and Section 5 concludes this paper.

Section snippets

Human pose estimation

Before the boom of deep learning, the researchers considered to achieve accurate pose estimation from two aspects, i.e. computing better feature representation and exploiting the spatial position relationship between body joints. Therefore, traditional human pose estimation methods can be roughly divided into two categories. The first type treats the pose estimation problem as a classification or regression problem from global features [4] such as HOG, Shape Context, and SIFT. However, this

Model architecture

An overview of the proposed network is illustrated in Fig. 1. We adopt the highly modularized stacked Hourglass Network [1] as the basic network structure for human pose estimation. The stacked Hourglass Network contains eight hourglass blocks, each of which captures the multi-scale feature representation of the input signal. Before each hourglass block, we plug a Global Attention Module (GAM) in the network, aiming to learn the inter-relationship between two arbitrary image locations such that

Experiments

In this section, we firstly introduce the datasets and their setup adopted in the experiments, and then we show the comparison of our ‘GLASS-HN’ network with a group of state-of-the-art method, indicating the significance of the proposed model. Subsequently, we present the ablation experimental results about different network components.

Conclusion

We have proposed a ‘GLASS-HN’ framework according to which a Global Attention Module (GAM) and Local Attention Module (LAM) are jointly used in a baseline hourglass network to explore global and local information for human pose estimation. The ‘GAM’ uses the nonlocal attention maps to distinguish a person from his background, thereby reduce negative impact of background. And the LAM has a better ability to refine each joints of person. The experimental results prove the effectiveness of this

CRediT authorship contribution statement

Xiena Dong: Methodology, Validation, Investigation, Writing-original-draft. Jun Yu: Conceptualization, Formal-analysis, Visualization. Jian Zhang: Resources, Writing-review-editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant No. 61836002, Grant No. 62125201 and Grant No. 61972361.

Jun Yu (M-13) received the B.Eng. and Ph.D. degrees from Zhejiang University, Zhejiang, China. He was an Associate Professor with the School of Information Science and Technology, Xiamen University, Xiamen, China. From 2009 to 2011, he was with Nanyang Technological University, Singapore. From 2012 to 2013, he was a Visiting Researcher at Microsoft Research Asia (MSRA). He is currently a Professor with the School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China.

References (48)

  • J. Carreira et al.

    Human pose estimation with iterative error feedback

  • J. Tompson et al.

    Efficient object localization using convolutional networks

  • S. Wei et al.

    Convolutional pose machines

  • D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in: ICLR,...
  • J. Ba et al.

    Multiple object recognition with visual attention

  • J. Yu et al.

    Generative Image Inpainting with Contextual Attention

  • A. Jain, J. Tompson, M.Andriluka, Graham W.Taylor, C.Bregler, Learning Human Pose Estimation Features with...
  • E. Chio, M. Bahadori, L. Song, W. Stewart, J. Sun, GRAM: Graph-Based Attention Model for Healthcare Representation...
  • Q. You et al.

    Image captioning with semantic attention

    arXiv 1603.03925

    (2016)
  • J. Chen, H. Zhang, X. He, L. Nie, W. Liu, T. Chua, Attentive Collaborative Filtering: Multimedia Recommendation with...
  • Z. Yang et al.

    Stacked attention networks for image question answering

    arXiv 1511.02274

    (2015)
  • J. Kuen et al.

    Recurrent attentional networks for saliency detection

  • X. Chu et al.

    Multi-context attention for human pose estimation

  • X. Wang et al.

    Non-local Neural Networks

  • Cited by (0)

    Jun Yu (M-13) received the B.Eng. and Ph.D. degrees from Zhejiang University, Zhejiang, China. He was an Associate Professor with the School of Information Science and Technology, Xiamen University, Xiamen, China. From 2009 to 2011, he was with Nanyang Technological University, Singapore. From 2012 to 2013, he was a Visiting Researcher at Microsoft Research Asia (MSRA). He is currently a Professor with the School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China. He has authored or coauthored more than 100 scientific articles. Over the past years, his research interests have included multimedia analysis, machine learning, and image processing. In 2017, he received the IEEE SPS Best Paper Award. Dr. Yu has (co-)chaired several special sessions, invited sessions, and workshops. He served as a program committee member or reviewer of top conferences and prestigious journals. He is a Professional Member of the Association for Computing Machinery and the China Computer Federation.

    View full text