Elsevier

Digital Signal Processing

Volume 129, September 2022, 103681
Digital Signal Processing

Adaptive data augmentation network for human pose estimation

https://doi.org/10.1016/j.dsp.2022.103681Get rights and content

Highlights

  • The diversity of training samples will enhance the performance of human pose estimation.

  • The appropriate number of pasted samples on the original image can affect network performance.

  • Pasting the complete person is beneficial in enhancing the original image information.

  • The generative adversarial network brings the synthetic image closer to real challenging cases.

Abstract

With the rapid development of convolutional neural networks (CNNs), the performance of human pose estimation has been significantly improved. However, state-of-the-art methods still face specific challenges, such as occluded keypoints and nearby persons. Unlike the classical network improvement strategy, our approach obtains adaptive data to obtain more accurate keypoints through data augmentation. In this paper, we propose Adaptive Data Augmentation Network (ADA-Net), which brings more adaptive training data by adding occlusion and interleaving information to the original image. First, we introduce Active Transmission Network (ATNet), which actively learns the transformation matrix from the original image and uses the matrix to synthesize new images for training. Second, we adopt an adversarial training strategy combined with the ATNet that allows us to capture more challenging cases. Extensive experiments show that our approach achieves comparable or even better results than most state-of-the-art methods. In particular, our ADA-Net outperforms High-Resolution Network (HRNet) by 1.1 and 0.7 points on the COCO test-dev set and MPII validation set, respectively.

Introduction

Human pose estimation identifies and locates the keypoints of all persons in a still image. This is a fundamental research technique for numerous visual applications including human motion analysis [43], human-computer interaction [17], animation [23], etc. With the rapid development of deep convolutional neural networks (DCNNs) [40], [21], the task of human pose estimation has made significant progress, and an obvious breakthrough has been achieved on standard benchmark datasets such as MS COCO [28] and MPII Human Pose [1]. However, there still exist many challenging cases, including occluded keypoints [14] and nearby persons [3], which cannot be well localized.

Insufficient training data can affect the performance of DCNN-based methods for keypoint locations, especially in the absence of challenging cases. For example, in [40], samples with occluded keypoints make it hard to train High-Resolution Network (HRNet) for accurate keypoint localization. In addition, it is costly to annotate the keypoints localization ourselves.

A common solution is to use data augmentation, which is a method of enhancing machine learning capabilities by generating additional samples. However, it requires technicians to design policies that capture prior knowledge in the corresponding domain [12]. Traditionally, data augmentation employs global image transformation (e.g., scaling, shifting, rotating, cropping, flipping, or color dithering), shown in Fig. 1. Although this method enhances the information within the training images, they are of little help in solving challenging cases [3]. The production of occluded data [21], [10] can be regarded as an effective tool. Cheng et al. [10] introduce a Cylinder Man Model to generate occlusion labels, which is hardly inspiring to us for 2D data augmentation. For 2D human pose estimation, Ke et al. [21] propose a Keypoint Masking Training strategy to enhance the information via copying a background patch and putting it onto a keypoint. While this method can simulate occluded data, it cannot significantly improve the final effect. The main reasons are twofold: 1) the training image only considers occluded keypoints without learning extra information; 2) the paste part used to occlude the keypoints is difficult to learn from samples of nearby persons, and such occlusion is not in line with the real scenes.

To solve realistic challenging cases, we propose the Adaptive Data Augmentation Network (ADA-Net), as shown in Fig. 2. First, we build a pool of complete persons. During training, a complete person is randomly selected and pasted into the original image to learn additional information. On the one hand, the complete person can easily overlap the different positions of the characters in the original image, thus forming more unique training data. On the other hand, all we see in real scenes are complete persons, so this is in line with the real scenes to add a complete person to the original image. Our research is practical from this perspective.

Then, ADA-Net applies the Active Transmission Network (ATNet) to add the complete person into the original image while changing its shape and position. Specifically, ATNet is an active approach that learns parameters from each original image to form different transformation matrices. The transformation matrix is used to obtain the shape and position of the complete person so that it can be better blended into the original image to produce a synthetic image. Furthermore, ATNet as a generator that acquires the corresponding transformation matrix to make the synthetic image as close as possible to real challenging cases. Finally, when using a pose estimation network as a discriminator for adversarial learning, ATNet can generate the most confusing training samples after receiving the original image and will synthesize more challenging cases. Thus, our method can obtain more adaptive training data by adding occlusion and interleaving information to the original image. In summary, our contributions can be summarized as follows:

  • We build a pool of multiple complete persons. Exploring the appropriate number of complete persons within the pool is beneficial in enhancing the original image information and improving the overall performance of our method.

  • We propose a novel network called ADA-Net. It exploits ATNet as a generator to actively learn the adaptive transformation matrix of the synthetic image, while using human pose estimation as a discriminator to bring the synthetic image closer to the real challenging cases.

  • We evaluate our approach on both the COCO and MPII benchmarks. Experimental results show that our method is more effective than most state-of-the-art approaches.

Section snippets

Related work

Due to the high demand for real-life applications, multi-person pose estimation has become increasingly popular. To introduce the idea of this paper, we analyze relevant work from the following aspects.

Proposed method

As described in [35], adding extra data and its annotations is a safe way to effectively address the misjudgments of the current model when dealing with challenging cases, but doing so is costly [12]. The main objective of our paper is to show how to further improve the performance in the case of the existing human pose estimation models. The entire process of our method is described in detail below, as shown in Fig. 2.

ADA-Net makes three improvements. First, a paste part pool of multiple

Experiment

Our overall pipeline follows the top-down strategy for human pose estimation, as shown in Fig. 2. For joint training, we use ATNet as a generator to obtain the synthetic image, and then HRNet [40] is used as a discriminator, which is due to the excellent performance of multi-scale fusion. This section will verify our innovations based on experimental results.

Conclusions

We propose Adaptive Data Augmentation Network (ADA-Net) for human pose estimation. It synthesizes challenging cases by pasting a Paste Part Pool into the original image based on adversarial learning, which can solve occluded keypoints and nearby persons in real scenes. An Active Transmission Network (ATNet) is designed as a generator, which is combined with the Paste Part Pool to obtain the synthetic image. Furthermore, we use an existing human pose network as a discriminator for adversarial

CRediT authorship contribution statement

Dong Wang: Conceptualization, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft. Wenjun Xie: Resources, Validation. Youcheng Cai: Software, Writing – review & editing. Xiaoping Liu: Conceptualization, Funding acquisition, Project administration, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China under Grant 2020YFC1523100 and in part the National Natural Science Foundation of China under Grant 61877016.

Dong Wang received the B.S. degree in electronic information engineering from the Anhui Normal University, Wuhui, China, in 2016, and the M.S. degree in electronic and communication engineering from the Hefei University of Technology, Hefei, China, in 2019. He is currently pursuing the Ph.D. at the Hefei University of Technology. His current research interests include 2D/3D human pose estimation, computer vision, and machine learning.

References (56)

  • M. Andriluka et al.

    2d human pose estimation: new benchmark and state of the art analysis

  • A. Antoniou et al.

    Data augmentation generative adversarial networks

  • Y. Bin et al.

    Adversarial semantic data augmentation for human pose estimation

  • Y. Cai et al.

    Learning delicate local representations for multi-person pose estimation

  • Z. Cao et al.

    Realtime multi-person 2d pose estimation using part affinity fields

  • Y. Chen et al.

    Adversarial posenet: a structure-aware convolutional network for human pose estimation

  • Y. Chen et al.

    Cascaded pyramid network for multi-person pose estimation

  • B. Cheng et al.

    Higherhrnet: scale-aware representation learning for bottom-up human pose estimation

  • Y. Cheng et al.

    Occlusion-aware networks for 3d human pose estimation in video

  • C.J. Chou et al.

    Self adversarial training for human pose estimation

  • E.D. Cubuk et al.

    Randaugment: practical automated data augmentation with a reduced search space

  • T. DeVries et al.

    Improved regularization of convolutional neural networks with cutout

  • D. Dwibedi et al.

    Cut, paste and learn: surprisingly easy synthesis for instance detection

  • R. Girshick et al.

    Detectron

    (2018)
  • W. Guo

    Multi-person pose estimation in complex physical interactions

  • K. He et al.

    Mask r-cnn

  • J. Huang et al.

    The devil is in the details: delving into unbiased data processing for human pose estimation

  • M. Jaderberg et al.

    Spatial transformer networks

    Adv. Neural Inf. Process. Syst.

    (2015)
  • Cited by (0)

    Dong Wang received the B.S. degree in electronic information engineering from the Anhui Normal University, Wuhui, China, in 2016, and the M.S. degree in electronic and communication engineering from the Hefei University of Technology, Hefei, China, in 2019. He is currently pursuing the Ph.D. at the Hefei University of Technology. His current research interests include 2D/3D human pose estimation, computer vision, and machine learning.

    Wenjun Xie received his B.Sc. Degree in 2006, M.Sc. degree in 2010 and Ph.D. degree in 2016 all from Hefei University of Technology. Now, he is an Experimentalist in Hefei University of Technology. His main research direction includes human motion capture, motion synthesis and natural interaction.

    Youcheng Cai received the B.S. degree in information and computing science from the Hefei University of Technology, Anhui, China, in 2008, where he is currently pursuing the Ph.D. degree in computer science. His research interests include 3D reconstruction, computer vision, and machine learning.

    Xiaoping Liu received the master's and Ph.D. degrees in computer science from the Hefei University of Technology, Hefei, China, respectively. He is currently working as a Professor with the School of Computer and Information, Hefei University of Technology. His research interests include 3D reconstruction, computer animation, and cooperative computing.

    View full text