Position Puzzle Network and Augmentation: localizing human keypoints beyond the bounding box

Park, Soonchan; Park, Jinah

doi:10.1007/s00138-023-01471-6

Position Puzzle Network and Augmentation: localizing human keypoints beyond the bounding box

Original Paper
Published: 27 October 2023

Volume 34, article number 129, (2023)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

179 Accesses
Explore all metrics

Abstract

When estimating human pose with a partial image of a person, we, humans, do not confine the spatial range of our estimation to the given image and can readily localize keypoints outside of the image by referring to visual clues such as the body size. However, computational methods for human pose estimation do not consider those keypoints outside and focus only on the bounded area of a given image. In this paper, we propose a neural network and a data augmentation method to extend the range of human pose estimation beyond the bounding box. While our Position Puzzle Network expands the spatial range of keypoint localization by refining the position and the size of the target’s bounding box, Position Puzzle Augmentation enables the keypoint detector to estimate keypoints not only within, but also beyond the input image. We show that the proposed method enhances the baseline keypoint detectors by 39.5% and 30.5% on average in mAP and mAR, respectively, by enabling the localization of keypoints out of the bounding box using a cropped image dataset prepared for proper evaluation. Additionally, we verify that the proposed method does not degrade the performance under the original benchmarks and instead, improves the performance by alleviating false-positive errors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

References

Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: CVPR 2011, pp. 1465–1472. IEEE (2011)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Proceedings of the European Conference on Computer Vision, pp. 740–755. Springer (2014)
Liang, X., Gong, K., Shen, X., Lin, L.: Look into person: joint body parsing and pose estimation network and a new benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 871–885 (2018)
Article Google Scholar
Gao, C., Zou, Y., Huang, J.-B.: ican: instance-centric attention network for human-object interaction detection. In: British Machine Vision Conference (2018)
Wang, T., Yang, T., Danelljan, M., Khan, F.S., Zhang, X., Sun, J.: Learning human-object interaction detection using interaction points. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4116–4125 (2020)
Bansal, A., Rambhatla, S.S., Shrivastava, A., Chellappa, R.: Detecting human-object interactions via functional generalization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.e 34, pp. 10460–10469 (2020)
Zhou, T., Wang, W., Qi, S., Ling, H., Shen, J.: Cascaded human-object interaction recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4263–4272 (2020)
Su, C., Li, J., Zhang, S., Xing, J., Gao, W., Tian, Q.: Pose-driven deep convolutional model for person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3960–3969 (2017)
Miao, J., Wu, Y., Liu, P., Ding, Y., Yang, Y.: Pose-guided feature alignment for occluded person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 542–551 (2019)
Yan, C., Pang, G., Jiao, J., Bai, X., Feng, X., Shen, C.: Occluded person re-identification with single-scale global representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11875–11884 (2021)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Proceedings of the European Conference on Computer Vision, pp. 483–499. Springer (2016)
Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2980–2988. IEEE (2017)
Ke, L., Chang, M.-C., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. In: Proceedings of the European Conference on Computer Vision, pp. 713–728 (2018)
Kocabas, M., Karagoz, S., Akbas, E.: Multiposenet: fast multi-person pose estimation using pose residual network. In: Proceedings of the European Conference on Computer Vision, pp. 417–433 (2018)
Papandreou, G., Zhu, T., Chen, L.-C., Gidaris, S., Tompson, J., Murphy, K.: Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Proceedings of the European Conference on Computer Vision, pp. 269–286 (2018)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proceedings of the European Conference on Computer Vision, pp. 466–481 (2018)
Wang, Z., Li, W., Yin, B., Peng, Q., Xiao, T., Du, Y., Li, Z., Zhang, X., Yu, G., Sun, J.: Mscoco keypoints challenge 2018. In: Joint Recognition Challenge Workshop at ECCV (2018)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation (2019). arXiv preprint arXiv:1902.09212
Cheng, Y., Yang, B., Wang, B., Yan, W., Tan, R.T.: Occlusion-aware networks for 3d human pose estimation in video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 723–732 (2019)
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: AAAI, pp. 13001–13008 (2020)
Park, S., Lee, S., Park, J.: Data augmentation method for improving the accuracy of human pose estimation with cropped images. Pattern Recognit. Lett. 136, 244–250 (2020)
Article Google Scholar
Bin, Y., Chen, Z.-M., Wei, X.-S., Chen, X., Gao, C., Sang, N.: Structure-aware human pose estimation with graph convolutional networks. Pattern Recognit. 106, 107410 (2020)
Article Google Scholar
Park, S., Park, J.: Localizing human keypoints beyond the bounding box. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1602–1611 (2021)
Tian, L., Wang, P., Liang, G., Shen, C.: An adversarial human pose estimation network injected with graph structure. Pattern Recognit. 115, 107863 (2021)
Article Google Scholar
Li, Y., Zhang, S., Wang, Z., Yang, S., Yang, W., Xia, S.-T., Zhou, E.: Tokenpose: learning keypoint tokens for human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11313–11322 (2021)
Chang, J.Y., Moon, G., Lee, K.M.: Poselifter: absolute 3d human pose lifting network from a single noisy 2d human pose (2019). arXiv preprint arXiv:1910.12029
Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., Luo, J.: Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE Trans. Circuits Syst. Video Technol. 32(1), 198–209 (2021)
Article Google Scholar
Lutz, S., Blythman, R., Ghosal, K., Moynihan, M., Simms, C., Smolic, A.: Jointformer: single-frame lifting transformer with error prediction and refinement for 3d human pose estimation. In: Proceedings of International Conference on Pattern Recognition, pp. 1156–1163. IEEE (2022)
Song, L., Gang, Yu., Yuan, J., Liu, Z.: Human pose estimation and its application to action recognition: a survey. J. Vis. Commun. Image Represent. 76, 103055 (2021)
Article Google Scholar
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015)
Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1227–1236 (2019)
Zhao, R., Wang, K., Su, H., Ji, Q.: Bayesian graph convolution LSTM for skeleton based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6882–6892 (2019a)
Rao, H., Wang, S., Hu, X., Tan, M., Da, H., Cheng, J., Hu, B.: Self-supervised gait encoding with locality-aware attention for person re-identification. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 898–905 (2021)
Rao, H., Li, Y., Miao, C.: Revisiting-reciprocal distance re-ranking for skeleton-based person re-identification. IEEE Signal Process. Lett. 29, 2103–2107 (2022)
Article Google Scholar
Rao, H., Miao, C.: Transg: Transformer-based skeleton graph prototype contrastive learning with structure-trajectory prompted reconstruction for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22118–22128 (2023)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1440–1448 (2015)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot multibox detector. In: Proceedings of the European Conference on Computer Vision, pp. 21–37. Springer (2016)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., Ling, H.: M2det: A single-shot object detector based on multi-level feature pyramid network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9259–9266 (2019)
Chen, S., Sun, P., Song, Y., Luo, P.: Diffusiondet: diffusion model for object detection (2022). arXiv preprint arXiv:2211.09788
Tang, B., Liu, Z., Tan, Y., He, Q.: Hrtransnet: Hrformer-driven two-modality salient object detection. IEEE Trans. Circuits Syst. Video Technol. 33(2), 728–742 (2022)
Article Google Scholar
Yoo, D., Park, S., Lee, J.-Y., Paek, A.S., Kweon, I.S.: Attentionnet: aggregating weak directions for accurate object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2659–2667 (2015)
Najibi, M., Rastegari, M., Davis, L.S.: G-CNN: an iterative grid based object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2369–2377 (2016)
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Yu, C., Xiao, B., Gao, C., Yuan, L., Zhang, L., Sang, N., Wang, J.: Lite-hrnet: a lightweight high-resolution network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10440–10450 (2021)
Yuan, Y., Rao, F., Huang, L., Lin, W., Zhang, C., Chen, X., Wang, J.: Hrformer: high-resolution transformer for dense prediction. In: Proceedings of Advances in Neural Information Processing Systems vol. 34, pp. 7281–7293 (2021)
Yufei, X., Zhang, J., Zhang, Q., Tao, D.: Vitpose: simple vision transformer baselines for human pose estimation. In: Proceedings of Advances in Neural Information Processing Systems, vol. 35, pp. 38571–38584 (2022)
Qiu, Z., Yang, Q., Wang, J., Wang, X., Xu, C., Fu, D., Yao, K., Han, J., Ding, E., Wang, J.: Learning structure-guided diffusion model for 2d human pose estimation (2023). arXiv preprint arXiv:2306.17074
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: Higherhrnet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5386–5395 (2020)
Geng, Z., Sun, K., Xiao, B., Zhang, Z., Wang, J.: Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14676–14686 (2021)
Shi, D., Wei, X., Li, L., Ren, Y., Tan, W.: End-to-end multi-person pose estimation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11069–11078 (2022)
Jin, L., Wang, X., Nie, X., Wang, W., Guo, Y., Yan, S., Zhao, J.: Rethinking the person localization for single-stage multi-person pose estimation. IEEE Trans. Multimed. (2023). https://doi.org/10.1109/TMM.2023.3282139
Article Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of International Conference on Learning Representations (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Proceedings of Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)
Dhariwal, P., Nichol, A.: Diffusion models beat GANS on image synthesis. In: Proceedings of Advances in neural information processing systems, vol. 34, pp. 8780–8794 (2021)
Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T.: Unitbox: an advanced object detection network. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 516–520 (2016)
Rezatofighi, H., Tsoi, N., Gwak, J.Y., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666, (2019)
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IOU loss: faster and better learning for bounding box regression. In: AAAI, pp. 12993–13000 (2020)
leoxiaobin. deep-high-resolution-net.pytorch (2019). https://github.com/leoxiaobin/deep-high-resolution-net.pytorch
leeyegy. Tokenpose (2021). https://github.com/leeyegy/TokenPose
Daniil-Osokin. gccpm-look-into-person-cvpr19.pytorch (2019). https://github.com/Daniil-Osokin/gccpm-look-into-person-cvpr19.pytorch

Download references

Acknowledgements

This research is supported by Ministry of Culture, Sports and Tourism and Korea Creative Content Agency (Project Number: R2020070002).

Author information

Authors and Affiliations

Creatrive Content Division, Electronics and Telecommunications Research Institute, 218, Gajeong-ro, Daejeon, 34129, Republic of Korea
Soonchan Park
School of Computing, Korea Advanced Institute of Science and Technology, 291, Daehak-ro, Daejeon, 34141, Republic of Korea
Soonchan Park & Jinah Park

Authors

Soonchan Park
View author publications
You can also search for this author in PubMed Google Scholar
Jinah Park
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinah Park.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary information

We have supplementary materials containing additional explanation and visualizations. (10,674 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Park, S., Park, J. Position Puzzle Network and Augmentation: localizing human keypoints beyond the bounding box. Machine Vision and Applications 34, 129 (2023). https://doi.org/10.1007/s00138-023-01471-6

Download citation

Received: 01 April 2023
Revised: 08 August 2023
Accepted: 18 September 2023
Published: 27 October 2023
DOI: https://doi.org/10.1007/s00138-023-01471-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Position Puzzle Network and Augmentation: localizing human keypoints beyond the bounding box

Abstract

Access this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Position Puzzle Network and Augmentation: localizing human keypoints beyond the bounding box

Abstract

Access this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation