skip to main content
10.1145/3394171.3414041acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Pay Attention Selectively and Comprehensively: Pyramid Gating Network for Human Pose Estimation without Pre-training

Published: 12 October 2020 Publication History

Abstract

Deep neural network with multi-scale feature fusion has achieved great success in human pose estimation. However, drawbacks still exist in these methods: 1) they consider multi-scale features equally, which may over-emphasize redundant features; 2) preferring deeper structures, they can learn features with the strong semantic representation, but tend to lose natural discriminative information; 3) to attain good performance, they rely heavily on pretraining, which is time-consuming, or even unavailable practically. To mitigate these problems, we propose a novel comprehensive recalibration model called Pyramid GAting Network (PGA-Net) that is capable of distillating, selecting, and fusing the discriminative and attention-aware features at different scales and different levels (i.e., both semantic and natural levels). Meanwhile, focusing on fusing features both selectively and comprehensively, PGA-Net can demonstrate remarkable stability and encouraging performance even without pre-training, making the model can be trained truly from scratch. We demonstrate the effectiveness of PGA-Net through validating on COCO and MPII benchmarks, attaining new state-of-the-art performance. https://github.com/ssr0512/PGA-Net

Supplementary Material

MP4 File (3394171.3414041.mp4)
In this work, we develop a novel framework called Pyramid Gating Network. First, we design a multi-stage residual feature pyramid gating strategy which aims to train a very deep network end-to-end. Moreover, we manage to learn soft gates on multi-scale features in the top-down structure, enabling to distillate and select significant features automatically and dynamically. Second, we propose an image pyramid attention which aims at preserving more natural information so as to fuse with semantic features. Third, we excogitate an effective incorporation framework which can combine two pyramid gating strategies (i.e. naturally and semantically) at multiple scales. Importantly, with such reinforced and discriminative features, our model demonstrates remarkably more stable performance and much faster convergence even without the pre-training process, enabling a model which can be truly trained from scratch end-to-end. It is also noted that our method can be also readily applied in other models.

References

[1]
Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2d human pose estimation: New benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3686--3693.
[2]
Vijay Badrinarayanan, Ankur Handa, and Roberto Cipolla. 2015. Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[3]
Adrian Bulat and Georgios Tzimiropoulos. 2016. Human pose estimation via convolutional part heatmap regression. In European Conference on Computer Vision (ECCV). Springer, 717--732.
[4]
Yuanhao Cai, Zhicheng Wang, Zhengxiong Luo, Binyi Yin, Angang Du, Haoqian Wang, Xinyu Zhou, Erjin Zhou, Xiangyu Zhang, and Jian Sun. 2020. Learning Delicate Local Representations for Multi-Person Pose Estimation. In arXiv preprint arXiv:2003.04030.
[5]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multiperson 2d pose estimation using part affinity fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7291--7299.
[6]
Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. 2016. Human pose estimation with iterative error feedback. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4733--4742.
[7]
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat- Seng Chua. 2017. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR). 5659--5667.
[8]
Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. 2018. Cascaded pyramid network for multi-person pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7103--7112.
[9]
Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. Structured feature learning for pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4715--4723.
[10]
Xiao Chu,Wei Yang,Wanli Ouyang, Cheng Ma, Alan L Yuille, and XiaogangWang. 2017. Multi-context attention for human pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1831--1840.
[11]
Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. Rmpe: Regional multi-person pose estimation. In IEEE International Conference on Computer Vision (ICCV). 2334--2343.
[12]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In IEEE International Conference on Computer Vision (ICCV). 2961--2969.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In European Conference on Computer Vision (ECCV). Springer, 630--645.
[15]
Peiyun Hu and Deva Ramanan. 2016. Bottom-up and top-down reasoning with hierarchical rectified gaussians. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5600--5609.
[16]
Kaizhu Huang, Amir Hussain, Qiu-Feng Wang, and Rui Zhang. 2019. Deep Learning: Fundamentals, Theory and Applications. Vol. 2. Springer.
[17]
Shaoli Huang, Mingming Gong, and Dacheng Tao. 2017. A coarse-fine network for keypoint localization. In IEEE International Conference on Computer Vision (ICCV). 3028--3037.
[18]
Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. 2016. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision (ECCV). Springer, 34--50.
[19]
Lipeng Ke, Ming-Ching Chang, Honggang Qi, and Siwei Lyu. 2018. Multi-scale structure-aware network for human pose estimation. In European Conference on Computer Vision (ECCV). 713--728.
[20]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In CoRR, abs/1412.6980.
[21]
Muhammed Kocabas, Salih Karagoz, and Emre Akbas. 2018. Multiposenet: Fast multi-person pose estimation using pose residual network. In European Conference on Computer Vision (ECCV). 417--433.
[22]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS). 1097--1105.
[23]
Xu Lan, Xiatian Zhu, and Shaogang Gong. 2018. Knowledge distillation by on-thefly native ensemble. In International Conference on Neural Information Processing Systems (NIPS). Curran Associates Inc., 7528--7538.
[24]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV). Springer, 740--755.
[25]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3431--3440.
[26]
Chunchuan Lyu, Kaizhu Huang, and Hai-Ning Liang. 2015. A unified gradient regularization family for adversarial examples. In 2015 IEEE International Conference on Data Mining (ICDM). 301--309.
[27]
Alejandro Newell, Zhiao Huang, and Jia Deng. 2017. Associative embedding: End-to-end learning for joint detection and grouping. In Advances in Neural Information Processing Systems (NeurIPS). 2277--2287.
[28]
Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (ECCV). Springer, 483--499.
[29]
Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2015. Learning deconvolution network for semantic segmentation. In IEEE International Conference on Computer Vision (ICCV). 1520--1528.
[30]
Yanwei Pang, Tiancai Wang, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Ling Shao. 2019. Efficient featurized image pyramid network for single shot detector. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7336--7344.
[31]
George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, and Kevin Murphy. 2018. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In European Conference on Computer Vision (ECCV). 269--286.
[32]
George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. 2017. Towards accurate multi-person pose estimation in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4903--4911.
[33]
Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V Gehler, and Bernt Schiele. 2016. Deepcut: Joint subset partition and labeling for multi person pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4929--4937.
[34]
Kai Su, Dongdong Yu, Zhenqi Xu, Xin Geng, and Changhu Wang. 2019. Multi- Person Pose Estimation with Enhanced Channel-wise and Spatial Information. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5674--5682.
[35]
Ke Sun, Bin Xiao, Dong Liu, and JingdongWang. 2019. Deep high-resolution representation learning for human pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5693--5703.
[36]
Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. 2018. Integral human pose regression. In European Conference on Computer Vision (ECCV). 529--545.
[37]
Wei Tang, Pei Yu, and Ying Wu. 2018. Deeply learned compositional models for human pose estimation. In European Conference on Computer Vision (ECCV). 190--206.
[38]
Alexander Toshev and Christian Szegedy. 2014. Deeppose: Human pose estimation via deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1653--1660.
[39]
Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. 2017. Residual attention network for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3156--3164.
[40]
Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4724--4732.
[41]
Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV). 466--481.
[42]
Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2017. Learning feature pyramids for human pose estimation. In IEEE International Conference on Computer Vision (ICCV). 1281--1290.
[43]
Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3073--3082.
[44]
Feng Zhang, Xiatian Zhu, Hanbin Dai, Mao Ye, and Ce Zhu. 2019. Distribution- Aware Coordinate Representation for Human Pose Estimation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
[45]
Feng Zhang, Xiatian Zhu, and Mao Ye. 2019. Fast human pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3517--3526.
[46]
Kun Zhang, Peng He, Ping Yao, Ge Chen, Chuanguang Yang, Huimin Li, Li Fu, and Tianyao Zheng. 2019. DNANet: De-Normalized Attention Based Multi- Resolution Network for Human Pose Estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[47]
Liang Zheng, Yujia Huang, Huchuan Lu, and Yi Yang. 2019. Pose invariant embedding for deep person re-identification. In IEEE Transactions on Image Processing.

Cited By

View all
  • (2023)Pose Relation Transformer Refine Occlusions for Human Pose Estimation2023 IEEE International Conference on Robotics and Automation (ICRA)10.1109/ICRA48891.2023.10161259(6138-6145)Online publication date: 29-May-2023
  • (2023)Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01643(17131-17141)Online publication date: Jun-2023
  • (2022)2D Human pose estimation: a surveyMultimedia Systems10.1007/s00530-022-01019-029:5(3115-3138)Online publication date: 11-Nov-2022
  • Show More Cited By

Index Terms

  1. Pay Attention Selectively and Comprehensively: Pyramid Gating Network for Human Pose Estimation without Pre-training

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          MM '20: Proceedings of the 28th ACM International Conference on Multimedia
          October 2020
          4889 pages
          ISBN:9781450379885
          DOI:10.1145/3394171
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 12 October 2020

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. human pose estimation
          2. pyramid gating system
          3. stabilization

          Qualifiers

          • Research-article

          Funding Sources

          Conference

          MM '20
          Sponsor:

          Acceptance Rates

          Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)26
          • Downloads (Last 6 weeks)1
          Reflects downloads up to 20 Feb 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2023)Pose Relation Transformer Refine Occlusions for Human Pose Estimation2023 IEEE International Conference on Robotics and Automation (ICRA)10.1109/ICRA48891.2023.10161259(6138-6145)Online publication date: 29-May-2023
          • (2023)Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01643(17131-17141)Online publication date: Jun-2023
          • (2022)2D Human pose estimation: a surveyMultimedia Systems10.1007/s00530-022-01019-029:5(3115-3138)Online publication date: 11-Nov-2022
          • (2022)Spatial and contextual aware network based on multi-resolution for human pose estimationThe Visual Computer10.1007/s00371-021-02364-339:2(651-662)Online publication date: 10-Jan-2022
          • (2022)Human Pose Estimation Based on Feature Fusion and Graph Encoding OptimizationProceedings of the 12th International Conference on Computer Engineering and Networks10.1007/978-981-19-6901-0_59(554-564)Online publication date: 20-Oct-2022
          • (2021)InsPose: Instance-Aware Networks for Single-Stage Multi-Person Pose EstimationProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475447(3079-3087)Online publication date: 17-Oct-2021
          • (2021)Estimating Human Pose Efficiently by Parallel Pyramid NetworksIEEE Transactions on Image Processing10.1109/TIP.2021.309783630(6785-6800)Online publication date: 2021
          • (2021)Semantic segmentation of 3D indoor LiDAR point clouds through feature pyramid architecture searchISPRS Journal of Photogrammetry and Remote Sensing10.1016/j.isprsjprs.2021.05.009177(279-290)Online publication date: Jul-2021

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media