Skip to main content
Log in

Self-supervised Siamese keypoint inference network for human pose estimation and tracking

  • RESEARCH
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Human pose estimation and tracking are important tasks to help understand human behavior. Currently, human pose estimation and tracking face the challenges of missed detection due to sparse annotation of video datasets and difficulty in associating partially occluded and unoccluded cases of the same person. To address these challenges, we propose a self-supervised learning-based method, which infers the correspondence between keypoints to associate persons in the videos. Specifically, we propose a bounding box recovery module to recover missed detections and a Siamese keypoint inference network to solve the issue of error matching caused by occlusions. The local–global attention module, which is designed in the Siamese keypoint inference network, learns the varying dependence information of human keypoints between frames. To simulate the occlusions, we mask random pixels in the image before pre-training using knowledge distillation to associate the differing occlusions of the same person. Our method achieves better results than state-of-the-art methods for human pose estimation and tracking on the PoseTrack 2018 and PoseTrack 2021 datasets. Code is available at: https://github.com/yhtian2023/SKITrack.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 466–481 (2018).

  2. Raaj, Y., Idrees, H., Hidalgo, G., Sheikh, Y.: Efficient online multi-person 2d pose tracking with recurrent spatio-temporal affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4620–4628 (2019)

  3. Nguyen, H.-C., Nguyen, T.-H., Scherer, R., Le, V.-H.: Unified end-to-end YOLOv5-HR-TCM framework for automatic 2D/3D human pose estimation for real-time applications. Sensors 22(14), 5419 (2022)

    Article  Google Scholar 

  4. Nguyen, H.-C., Nguyen, T.-H., Scherer, R., Le, V.-H.: Deep learning for human activity recognition on 3d human skeleton: survey and comparative study. Sensors. 23(11), 5121 (2023)

    Article  Google Scholar 

  5. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision (ECCV) (2016)

  6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

  7. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5693–5703 (2019)

  8. Snower, M., Kadav, A., Lai, F., Graf, H.P.: 15 Keypoints is all you need. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6738–6748 (2020)

  9. Xu, Y., Zhang, J., Zhang, Q., Tao, D.:Vitpose: simple vision transformer baselines for human pose estimation. arXiv preprint arXiv:2204.12484 (2022)

  10. Liu, Z., Chen, H., Feng, R., Wu, S., Ji, S., Yang, B., Wang, X.: Deep dual consecutive network for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 525–534 (2021)

  11. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 740–755 (2014)

  12. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3686–3693 (2014)

  13. Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J., Schiele, B.: Posetrack: a benchmark for human pose estimation and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5167–5176 (2018)

  14. Girdhar, R., Gkioxari, G., Torresani, L., Paluri, M., Tran, D.: Detect-and-track: EFFICIENT pose estimation in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 350–359 (2018)

  15. Ning, G., Pei, J., Huang, H.: LightTrack: a generic framework for online top-down human pose tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1034–1035 (2020)

  16. Wang, M., Tighe, J., Modolo, D.: Combining detection and tracking for human pose estimation in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11088–11096 (2020)

  17. Zhou, C., Ren, Z., Hua, G.: Temporal keypoint matching and refinement network for pose estimation and tracking. In: European Conference on Computer Vision (ECCV), pp. 680–695 (2020)

  18. Xiu, Y., Li, J., Wang, H., Fang, Y., Lu, C.: Pose flow: efficient online pose tracking. arXiv preprint arXiv:1802.00977 (2018)

  19. Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., Schiele, B.: Deepcut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4929–4937 (2016)

  20. Girdhar, R., Gkioxari, G., Torresani, L., Paluri, M., Tran, D.: Deepercut: a deeper, stronger, and faster multi-person pose estimation model. In: European Conference on Computer Vision (ECCV), pp. 34–50 (2016)

  21. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.:Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7291–7299 (2017)

  22. Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.:Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20963–20972 (2020)

  23. Xue, N., Wu, T., Xia, G.S. and Zhang, L.: Learning local-global contextual adaptation for multi-person pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13065–13074 (2022)

  24. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16000–16009 (2022)

  25. Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1281–1290 (2017)

  26. Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., Loy, C.C.: Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4974–4983 (2019)

  27. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-CNN: towards real-time object detection with region proposal networks. In; Advances in Neural Information Processing Systems (NeurIPS), pp. 1137–1149 (2015)

  28. Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: more deformable, better results. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9308–9316 (2019)

  29. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

  30. Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6154–6162 (2018)

  31. Hwang, J., Lee, J., Park, S., Kwak, N.: Pose estimator and tracker using temporal flow maps for limbs. In: International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2019)

  32. Jin, S., Liu, W., Ouyang, W., Qian, C.: Multi-person articulated tracking with spatial and temporal embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5664–5673 (2019)

  33. Yang, Y., Ren, Z., Li, H., Zhou, C., Wang, X., Hua, G.: Learning dynamics via graph neural networks for human pose estimation and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8074–8084 (2021)

  34. Rafi, U., Doering, A., Leibe, B., Gall, J.: Self-supervised keypoint correspondences for multi-person pose estimation and tracking in videos. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 36–52 (2020)

  35. Qiu, Z., Qiu, K., Fu, J., Fu, D.: Dgcn: Dynamic graph convolutional network for efficient multi-person pose estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 11924–11931 (2020)

  36. Bao, Q., Liu, W., Cheng, Y., Zhou, B., Mei, T.: Pose-guided tracking-by-detection: robust multi-person pose tracking. IEEE Trans. Multimedia 23, 161–175 (2020)

    Article  Google Scholar 

  37. Bin, Y., Chen, Z.M., Wei, X.S., Chen, X., Gao, C., Sang, N.: Structure-aware human pose estimation with graph convolutional networks. Pattern Recogn. 106, 107410 (2020)

    Article  Google Scholar 

  38. Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., Piot, B.: Bootstrap your own latent-a new approach to self-supervised learning. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 1271–21284 (2020)

  39. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning (PMLR), pp. 1597–1607 (2020)

  40. Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. In: International Conference on Machine Learning (PMLR), pp. 12310–12320 (2021)

  41. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738 (2020)

  42. Li, X., Liu, S., De Mello, S., Wang, X., Kautz, J., Yang, M.H.: Joint-task self-supervised learning for temporal correspondence. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)

  43. Wang, X., Jabri, A., Efros, A.A.:Learning correspondence from the cycle-consistency of time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2566–2576 (2019)

  44. Jiang, S., Campbell, D., Lu, Y., Li, H., Hartley, R.: Learning to estimate hidden motions with global motion aggregation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9772–9781 (2021)

  45. Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.:LoFTR: detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8922–8931 (2021)

  46. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.:Swin Transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10019 (2021)

  47. Cui, Y., Jiang, C., Wang, L., Wu, G.:Mixformer: End-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13608–13618 (2022)

  48. Zhao, S., Zhao, L., Zhang, Z., Zhou, E., Metaxas, D.: Global matching with overlapping attention for optical flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17592–17601 (2022)

  49. Gao, S., Zhou, C., Ma, C., Wang, X., Yuan, J.:Aiatrack: attention in attention for transformer visual tracking. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 146–164 (2022)

  50. Doering, A., Chen, D., Zhang, S., Schiele, B., Gall, J.: PoseTrack21: A dataset for person search, multi-object tracking and multi-person pose tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20963–20972 (2022)

  51. Yu, D., Su, K., Sun, J., Wang, C.:Multiperson pose estimation for pose tracking with enhanced cascaded pyramid network. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

  52. Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 941–951 (2019)

  53. Guo, H., Tang, T., Luo, G., Chen, R., Lu, Y., Wen, L.: Multi-domain pose network for multi-person pose estimation and tracking. In: Proc. European Conference on Computer Vision (ECCV) (2018).

  54. Xie, R., Wang, C., Zeng, W., Wang, Y.: An empirical study of the collapsing problem in semi-supervised 2d human pose estimation. In: Proceedings of the CVF International Conference on Computer Vision (ICCV), pp. 11240–11249 (2021)

  55. Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.:Bytetrack: multi-object tracking by associating every detection box. In: European Conference on Computer Vision (ECCV), pp. 1–21 (2022)

  56. Feng, R., Gao, Y., Ma, X., Tse, T.H.E., Chang, H.J.: Mutual information-based temporal difference learning for human pose estimation in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv preprint arXiv:2303.08475 (2023)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 61771299.

Author information

Authors and Affiliations

Authors

Contributions

Y. T conducted research analysis and wrote the original draft, X.W. provided research guidance, supervision, and review editing, and R.W. provided financial support for project management. All authors have reviewed the manuscript.

Corresponding author

Correspondence to Rui Wang.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Tian, Y. & Wang, R. Self-supervised Siamese keypoint inference network for human pose estimation and tracking. Machine Vision and Applications 35, 32 (2024). https://doi.org/10.1007/s00138-024-01515-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-024-01515-5

Keywords

Navigation