Skip to main content
Log in

Position Puzzle Network and Augmentation: localizing human keypoints beyond the bounding box

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

When estimating human pose with a partial image of a person, we, humans, do not confine the spatial range of our estimation to the given image and can readily localize keypoints outside of the image by referring to visual clues such as the body size. However, computational methods for human pose estimation do not consider those keypoints outside and focus only on the bounded area of a given image. In this paper, we propose a neural network and a data augmentation method to extend the range of human pose estimation beyond the bounding box. While our Position Puzzle Network expands the spatial range of keypoint localization by refining the position and the size of the target’s bounding box, Position Puzzle Augmentation enables the keypoint detector to estimate keypoints not only within, but also beyond the input image. We show that the proposed method enhances the baseline keypoint detectors by 39.5% and 30.5% on average in mAP and mAR, respectively, by enabling the localization of keypoints out of the bounding box using a cropped image dataset prepared for proper evaluation. Additionally, we verify that the proposed method does not degrade the performance under the original benchmarks and instead, improves the performance by alleviating false-positive errors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: CVPR 2011, pp. 1465–1472. IEEE (2011)

  2. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Proceedings of the European Conference on Computer Vision, pp. 740–755. Springer (2014)

  3. Liang, X., Gong, K., Shen, X., Lin, L.: Look into person: joint body parsing and pose estimation network and a new benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 871–885 (2018)

    Article  Google Scholar 

  4. Gao, C., Zou, Y., Huang, J.-B.: ican: instance-centric attention network for human-object interaction detection. In: British Machine Vision Conference (2018)

  5. Wang, T., Yang, T., Danelljan, M., Khan, F.S., Zhang, X., Sun, J.: Learning human-object interaction detection using interaction points. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4116–4125 (2020)

  6. Bansal, A., Rambhatla, S.S., Shrivastava, A., Chellappa, R.: Detecting human-object interactions via functional generalization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.e 34, pp. 10460–10469 (2020)

  7. Zhou, T., Wang, W., Qi, S., Ling, H., Shen, J.: Cascaded human-object interaction recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4263–4272 (2020)

  8. Su, C., Li, J., Zhang, S., Xing, J., Gao, W., Tian, Q.: Pose-driven deep convolutional model for person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3960–3969 (2017)

  9. Miao, J., Wu, Y., Liu, P., Ding, Y., Yang, Y.: Pose-guided feature alignment for occluded person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 542–551 (2019)

  10. Yan, C., Pang, G., Jiao, J., Bai, X., Feng, X., Shen, C.: Occluded person re-identification with single-scale global representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11875–11884 (2021)

  11. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Proceedings of the European Conference on Computer Vision, pp. 483–499. Springer (2016)

  12. Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)

  13. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2980–2988. IEEE (2017)

  14. Ke, L., Chang, M.-C., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. In: Proceedings of the European Conference on Computer Vision, pp. 713–728 (2018)

  15. Kocabas, M., Karagoz, S., Akbas, E.: Multiposenet: fast multi-person pose estimation using pose residual network. In: Proceedings of the European Conference on Computer Vision, pp. 417–433 (2018)

  16. Papandreou, G., Zhu, T., Chen, L.-C., Gidaris, S., Tompson, J., Murphy, K.: Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Proceedings of the European Conference on Computer Vision, pp. 269–286 (2018)

  17. Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proceedings of the European Conference on Computer Vision, pp. 466–481 (2018)

  18. Wang, Z., Li, W., Yin, B., Peng, Q., Xiao, T., Du, Y., Li, Z., Zhang, X., Yu, G., Sun, J.: Mscoco keypoints challenge 2018. In: Joint Recognition Challenge Workshop at ECCV (2018)

  19. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation (2019). arXiv preprint arXiv:1902.09212

  20. Cheng, Y., Yang, B., Wang, B., Yan, W., Tan, R.T.: Occlusion-aware networks for 3d human pose estimation in video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 723–732 (2019)

  21. Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: AAAI, pp. 13001–13008 (2020)

  22. Park, S., Lee, S., Park, J.: Data augmentation method for improving the accuracy of human pose estimation with cropped images. Pattern Recognit. Lett. 136, 244–250 (2020)

    Article  Google Scholar 

  23. Bin, Y., Chen, Z.-M., Wei, X.-S., Chen, X., Gao, C., Sang, N.: Structure-aware human pose estimation with graph convolutional networks. Pattern Recognit. 106, 107410 (2020)

    Article  Google Scholar 

  24. Park, S., Park, J.: Localizing human keypoints beyond the bounding box. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1602–1611 (2021)

  25. Tian, L., Wang, P., Liang, G., Shen, C.: An adversarial human pose estimation network injected with graph structure. Pattern Recognit. 115, 107863 (2021)

    Article  Google Scholar 

  26. Li, Y., Zhang, S., Wang, Z., Yang, S., Yang, W., Xia, S.-T., Zhou, E.: Tokenpose: learning keypoint tokens for human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11313–11322 (2021)

  27. Chang, J.Y., Moon, G., Lee, K.M.: Poselifter: absolute 3d human pose lifting network from a single noisy 2d human pose (2019). arXiv preprint arXiv:1910.12029

  28. Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., Luo, J.: Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE Trans. Circuits Syst. Video Technol. 32(1), 198–209 (2021)

    Article  Google Scholar 

  29. Lutz, S., Blythman, R., Ghosal, K., Moynihan, M., Simms, C., Smolic, A.: Jointformer: single-frame lifting transformer with error prediction and refinement for 3d human pose estimation. In: Proceedings of International Conference on Pattern Recognition, pp. 1156–1163. IEEE (2022)

  30. Song, L., Gang, Yu., Yuan, J., Liu, Z.: Human pose estimation and its application to action recognition: a survey. J. Vis. Commun. Image Represent. 76, 103055 (2021)

    Article  Google Scholar 

  31. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015)

  32. Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1227–1236 (2019)

  33. Zhao, R., Wang, K., Su, H., Ji, Q.: Bayesian graph convolution LSTM for skeleton based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6882–6892 (2019a)

  34. Rao, H., Wang, S., Hu, X., Tan, M., Da, H., Cheng, J., Hu, B.: Self-supervised gait encoding with locality-aware attention for person re-identification. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 898–905 (2021)

  35. Rao, H., Li, Y., Miao, C.: Revisiting-reciprocal distance re-ranking for skeleton-based person re-identification. IEEE Signal Process. Lett. 29, 2103–2107 (2022)

    Article  Google Scholar 

  36. Rao, H., Miao, C.: Transg: Transformer-based skeleton graph prototype contrastive learning with structure-trajectory prompted reconstruction for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22118–22128 (2023)

  37. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)

  38. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1440–1448 (2015)

  39. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot multibox detector. In: Proceedings of the European Conference on Computer Vision, pp. 21–37. Springer (2016)

  40. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

  41. Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., Ling, H.: M2det: A single-shot object detector based on multi-level feature pyramid network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9259–9266 (2019)

  42. Chen, S., Sun, P., Song, Y., Luo, P.: Diffusiondet: diffusion model for object detection (2022). arXiv preprint arXiv:2211.09788

  43. Tang, B., Liu, Z., Tan, Y., He, Q.: Hrtransnet: Hrformer-driven two-modality salient object detection. IEEE Trans. Circuits Syst. Video Technol. 33(2), 728–742 (2022)

    Article  Google Scholar 

  44. Yoo, D., Park, S., Lee, J.-Y., Paek, A.S., Kweon, I.S.: Attentionnet: aggregating weak directions for accurate object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2659–2667 (2015)

  45. Najibi, M., Rastegari, M., Davis, L.S.: G-CNN: an iterative grid based object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2369–2377 (2016)

  46. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)

  47. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)

  48. Yu, C., Xiao, B., Gao, C., Yuan, L., Zhang, L., Sang, N., Wang, J.: Lite-hrnet: a lightweight high-resolution network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10440–10450 (2021)

  49. Yuan, Y., Rao, F., Huang, L., Lin, W., Zhang, C., Chen, X., Wang, J.: Hrformer: high-resolution transformer for dense prediction. In: Proceedings of Advances in Neural Information Processing Systems vol. 34, pp. 7281–7293 (2021)

  50. Yufei, X., Zhang, J., Zhang, Q., Tao, D.: Vitpose: simple vision transformer baselines for human pose estimation. In: Proceedings of Advances in Neural Information Processing Systems, vol. 35, pp. 38571–38584 (2022)

  51. Qiu, Z., Yang, Q., Wang, J., Wang, X., Xu, C., Fu, D., Yao, K., Han, J., Ding, E., Wang, J.: Learning structure-guided diffusion model for 2d human pose estimation (2023). arXiv preprint arXiv:2306.17074

  52. Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: Higherhrnet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5386–5395 (2020)

  53. Geng, Z., Sun, K., Xiao, B., Zhang, Z., Wang, J.: Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14676–14686 (2021)

  54. Shi, D., Wei, X., Li, L., Ren, Y., Tan, W.: End-to-end multi-person pose estimation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11069–11078 (2022)

  55. Jin, L., Wang, X., Nie, X., Wang, W., Guo, Y., Yan, S., Zhao, J.: Rethinking the person localization for single-stage multi-person pose estimation. IEEE Trans. Multimed. (2023). https://doi.org/10.1109/TMM.2023.3282139

    Article  Google Scholar 

  56. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of International Conference on Learning Representations (2020)

  57. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Proceedings of Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)

  58. Dhariwal, P., Nichol, A.: Diffusion models beat GANS on image synthesis. In: Proceedings of Advances in neural information processing systems, vol. 34, pp. 8780–8794 (2021)

  59. Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T.: Unitbox: an advanced object detection network. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 516–520 (2016)

  60. Rezatofighi, H., Tsoi, N., Gwak, J.Y., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666, (2019)

  61. Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IOU loss: faster and better learning for bounding box regression. In: AAAI, pp. 12993–13000 (2020)

  62. leoxiaobin. deep-high-resolution-net.pytorch (2019). https://github.com/leoxiaobin/deep-high-resolution-net.pytorch

  63. leeyegy. Tokenpose (2021). https://github.com/leeyegy/TokenPose

  64. Daniil-Osokin. gccpm-look-into-person-cvpr19.pytorch (2019). https://github.com/Daniil-Osokin/gccpm-look-into-person-cvpr19.pytorch

Download references

Acknowledgements

This research is supported by Ministry of Culture, Sports and Tourism and Korea Creative Content Agency (Project Number: R2020070002).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinah Park.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary information

We have supplementary materials containing additional explanation and visualizations. (10,674 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, S., Park, J. Position Puzzle Network and Augmentation: localizing human keypoints beyond the bounding box. Machine Vision and Applications 34, 129 (2023). https://doi.org/10.1007/s00138-023-01471-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-023-01471-6

Keywords

Navigation