Pay Attention Selectively and Comprehensively: Pyramid Gating Network for Human Pose Estimation without Pre-training

Published: 12 October 2020


Deep neural network with multi-scale feature fusion has achieved great success in human pose estimation. However, drawbacks still exist in these methods: 1) they consider multi-scale features equally, which may over-emphasize redundant features; 2) preferring deeper structures, they can learn features with the strong semantic representation, but tend to lose natural discriminative information; 3) to attain good performance, they rely heavily on pretraining, which is time-consuming, or even unavailable practically. To mitigate these problems, we propose a novel comprehensive recalibration model called Pyramid GAting Network (PGA-Net) that is capable of distillating, selecting, and fusing the discriminative and attention-aware features at different scales and different levels (i.e., both semantic and natural levels). Meanwhile, focusing on fusing features both selectively and comprehensively, PGA-Net can demonstrate remarkable stability and encouraging performance even without pre-training, making the model can be trained truly from scratch. We demonstrate the effectiveness of PGA-Net through validating on COCO and MPII benchmarks, attaining new state-of-the-art performance.

In this work, we develop a novel framework called Pyramid Gating Network. First, we design a multi-stage residual feature pyramid gating strategy which aims to train a very deep network end-to-end. Moreover, we manage to learn soft gates on multi-scale features in the top-down structure, enabling to distillate and select significant features automatically and dynamically. Second, we propose an image pyramid attention which aims at preserving more natural information so as to fuse with semantic features. Third, we excogitate an effective incorporation framework which can combine two pyramid gating strategies (i.e. naturally and semantically) at multiple scales. Importantly, with such reinforced and discriminative features, our model demonstrates remarkably more stable performance and much faster convergence even without the pre-training process, enabling a model which can be truly trained from scratch end-to-end. It is also noted that our method can be also readily applied in other models.


