Abstract
Human pose estimation from a single image is a fundamental yet challenging task in computer vision. Most existing methods gradually generated multi-resolution from high-resolution to low-resolution, then recovered the higher resolution from the low resolution and used it to generate final pose heatmaps, such as Hourglass and HRNet and their variants. In this paper, we propose a novel architecture named fixed-resolution representation network for human pose estimation, which maintains fixed-resolution through the whole process to keep rich spatial-structural information. An Improved Pyramid Convolutional Bottleneck (IPCB) is firstly proposed to encode feature maps with multi receptive fields with the same resolution. Secondly, we introduce an efficient channel attention mechanism to enhance the feature extraction and information selection capability of IPCB, making the performance of IPCB better. Thirdly, considering the deviation from using the flip test of reasoning, we use an existing technology: Unbiased Data Processing. Fourthly, due to the change of the model structure and the limited computing resources, we introduce an iterative retraining strategy to solve the problem of pre-training. We empirically demonstrate the effectiveness of our method and achieve a competitive performance with 1.7M parameters and 3G FLOPs, 89.5 (PCKh@0.5) and 92.7 (PCK@0.2) respectively, compared with the state-of-the-art methods on the benchmark dataset: the MPII and LSP key points detection dataset.










Similar content being viewed by others
References
Wang, C., Wang, Y., Yuille, A.L.: An approach to pose-based action recognition. In: CVPR, pp. 915–922 (2013)
Zheng, L., Huang, Y., Lu, H., Yang, Y.: Pose invariant embedding for deep person re-identification. Proc. IEEE Trans. Image Process. 28, 4500–4509 (2019)
Zhang, Z.: Microsoft kinect sensor and its effect. IEEE MultiMedia 19, 4–10 (2012)
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware Siamese networks for visual object tracking. In: ECCV, pp. 103–119 (2018)
Li, Y., Chen, X., Zhu, Z., Xie, L., Huang, G., Du, D., Wang, X.: Attention-guided unified network for panoptic segmentation. In: CVPR, pp. 7019–7028 (2019)
Zhu, J., Zou, W., Xu, L., Hu, Y., Zhu, Z., Chang, M., Huang, J., Huang, G., Du, D.: Action machine: rethinking action recognition in trimmed videos. In: arXiv (2018)
Zhu, J., Zou, W., Zhu, Z., Hu, Y.: Convolutional relation network for skeleton-based action recognition. Neurocomputing 370, 109–117 (2019)
Zhu, J., Zou, W., Zhu, Z.: End-to-end video-live representation learning for action recognition. In: ICPR, pp. 645–650 (2018)
Zhu, J., Zhou, W., Zhu, Z.: Two-stream gated fusion convnets for action recognition. In: ICPR, pp. 597–602 (2018)
Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. NIPS 27, 1799–1807 (2014)
Toshev, A., Szegedy DeepPose, C.: Human pose estimation via deep neural networks. CVPR 27, 1653–1660 (2014)
Newell, A., Yang, K.: Jia Deng Stacked hourglass networks for human pose estimation. ECCV 9912, 483–499 (2016)
Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: CVPR, pp. 4733–4742 (2016)
Wei, S., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. CVPR 9912, 4724–4732 (2016)
Chen, Y., Yingli, T., Mingyi, H.: Monocular human pose estimation: a survey of deep learning-based methods. Comput. Vis. Image Understand. 192, 102897 (2020)
Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. ICCV 27, 1799–1807 (2017)
Rafi, U., Leibe, B., Gall, J., Kostrikov, I.: An efficient convolutional network for human pose estimation. In: BMVC (2016)
Belagiannis, V., Zisserman, A.: Recurrent human pose estimation. In: IEEE International Conference on Automatic Face and Gesture Recognition (FG 2017), pp. 468–475 (2017)
Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. ECCV 9911, 717–732 (2016)
Nie, X., Feng, J., Zuo, Y., Yan, S.: Human pose estimation with parsing induced learner. In: CVPR (2018)
Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: CVPR, pp. 3512–3521 (2019)
Lipeng, K., Ming Ching, C., Honggang, Q., Siwei, L.: Multi-scale structure-aware network for human pose estimation. In: ECCV (2018)
Sun, K., xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR, pp. 5686–5696 (2019)
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: CVPR, pp. 5385–5394 (2020)
Cai, Y., Wang, Z., Luo, Z., Yin, B., Angang, D., Wang, H., Zhang, X., Zhou, X., Zhou, E., Sun, J.: Learning delicate local representations for multi-person pose estimation. ECCV 12348, 455–472 (2020)
Kim, S.-T., Lee, H.J.: Lightweight stacked hourglass network for human pose estimation. In: Appl. Sci., 10 (2020)
Lianping, Y., Qin, Y., Xiangde, Z.: Lightweight densely connected residual network for human pose estimation. Real Time Image Process 18, 825–827 (2021)
Xiao, Y., Yu, D., Wang, X., Lv, T., Fan, Y., Wu, L.: SPCNet: spatial preserve and content-aware network for human pose estimation. In: European Conference on Artificial Intelligence, pp. 2776–2783 (2020)
Yu, C., Xiao, B., Gao, C.: et. Lite-HRNet: a lightweight high-resolution network. In: CVPR, pp. 10440–10450 (2021)
Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: CVPR, pp. 3517–3526 (2019)
Ren, Z., Zhou, Y., Chen, Y., et al.: Efficient human pose estimation by maximizing fusion and high-level spatial attention. In: arXiv (2021)
Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: CVPR, pp. 648–656 (2015)
Hou, L., Cao, J., Zhao, Y., et al.: \(P^{2}\) Net: augmented parallel-pyramid net for attention guided pose estimation. In: ICPR, pp. 9658–9665 (2020)
Yang, H., Guo, L., Wu, X., et al.: Scale-aware attention-based multi-resolution representation for multi-person pose estimation. In: Multimedia Systems (2021)
Artacho, B., Savakis, A.: OmniPose: a multi-scale framework for multi-person pose estimation. In: arXiv (2021)
Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. In: CVPR, pp. 6450–6458 (2017)
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)
Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. In: arXiv, pp. 1412–7755 (2014)
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29 (2016)
Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: CVPR, pp. 5669–5678 (2017)
Su, K., Yu, D., Xu, Z., Geng, X., Wang, C.: Multi-person pose estimation with enhanced channel-wise and spatial information. In: CVPR, pp. 5674–5682 (2019)
Yuan, Y., Fu, R., Huang, L., et al.: HRFormer: high-resolution transformer for dense prediction. In: arXiv (2021)
Huang, L., Yuan, Y., Guo, J., et al.: Interlaced sparse self-attention for semantic segmentation. In: arXiv (2019)
Luo, Z., Wang, Z., Cai, Y., et al.: Efficient human pose estimation by learning deeply aggregated representations. In: arXiv (2020)
Wang, Q., Banggu, W., Zhu, P., Li, P., Zuo, W., Qinghua, H.: ECA-Net: efficient channel attention for deep convolutional neural network. CVPR 9912, 7132–7141 (2020)
Sun, X., Xiao, B., Wei, F., et al.: Integral human pose regression. In: ECCV, pp. 536–553 (2018)
Zhang, F., Zhu, X., Dai, H., et al.: Distribution-aware coordinate representation for human pose estimation. In: CVPR, pp. 7091–7100 (2020)
Huang, J., Zhu, Z., Guo, F., Huang, G.: The devil is in the details: delving into unbiased data processing for human pose estimation. In: CVPR, pp. 5699–5708 (2020)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: ECCV, pp. 472–487 (2018)
Zhang, Z., Tang, J., Wu, G.: Simple and lightweight human pose estimation. In: arXiv (2020)
Yilun, C., Zhicheng, W., Yuxiang, P., Zhiqiang, Z., Gang, Y., Jian, S.: Cascaded pyramid network for multi-person pose estimation. In: CVPR, pp. 7103–7112 (2018)
Cosmin Duta, I., Liu, L., Zhu, F., Shao, L.: Pyramidal convolution: rethinking convolutional neural network for visual recognition. In: arXiv (2020)
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: arXiv (2020)
Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: British Machine Vision Conference (2010)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Computer Science, vol. 12 (2014)
Peng, X., Tang, Z., Yang, F., Feris, R., Metaxas, D.: Jointly optimize data augmentation and network training: adversarial data augmentation in human pose estimation. In: CVPR, pp. 2226–2234 (2018)
Su, Z., Ye, M., Zhang, G., Dai, L., Sheng, J.: Cascade feature aggregation for human pose estimation. In: arXiv, pp. 1902–07837 (2019)
Bin, Y., Cao, X., Chen, X., Ge, Y., Tai, Y., Wang, C., Li, J., Huang, F., Gao, C., Sang, N.: Adversarial semantic data augmentation for human pose estimation. In: ECCV (2020)
Chen, X., Yuille, A.L.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Advances in neural information processing systems (2014)
Ning, G., Zhang, Z., He, Z.: Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans. Multim. 20, 1246–1259 (2018)
Bulat, D., Kossaifi, J., Tzimiropoulos, G., Pantic, M.: Toward fast and accurate human pose estimation via soft-gated skip connections. In: FG, pp. 8–15 (2020)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Communicated by I. Bartolini.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, Y., Hou, X. Fixed-resolution representation network for human pose estimation. Multimedia Systems 28, 1597–1609 (2022). https://doi.org/10.1007/s00530-022-00919-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-022-00919-5