Skip to main content

Autoencoder and Masked Image Encoding-Based Attentional Pose Network

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14426))

Included in the following conference series:

  • 467 Accesses

Abstract

Despite recent advances in single-image-based 3D human pose and shape estimation, partial occlusion remains a major challenge for many methods, leading to significant prediction errors. Some existing methods fail to provide satisfactory performance for 3D human body reconstruction in occluded outdoor environments. To address these issues, we propose an autoencoder for feature extraction that integrates image masking methods to improve training stability. Our approach utilizes an attention mechanism to effectively capture the features of partially visible body parts, addressing partial occlusion. We further employ a partial attention mechanism to obtain the final features and use a regressor to estimate human model parameters. Experimental results on outdoor 3D poses in benchmark datasets demonstrate that our method outperforms state-of-the-art image-based methods in terms of robustness and efficiency. Qualitative evaluation shows that our method achieves more accurate and robust reconstruction results than existing methods, not only in occluded scenarios but also on standard benchmarks. Our approach exhibits excellent model robustness and training stability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human model fitting towards in-the-wild 3D human pose estimation. In: 2021 International Conference on 3D Vision (3DV), pp. 42–52 (2021). https://doi.org/10.1109/3DV53792.2021.00015

  2. Cheng, Y., Yang, B., Wang, B., Tan, R.T.: 3D human pose estimation using spatio-temporal networks with explicit occlusion training. Proc. AAAI Conf. Artif. Intell. 34(07), 10631–10638 (2020)

    Google Scholar 

  3. Cheng, Y., Yang, B., Wang, B., Wending, Y., Tan, R.: Occlusion-aware networks for 3D human pose estimation in video. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 723–732 (2019). https://doi.org/10.1109/ICCV.2019.00081

  4. Choi, H., Moon, G., Lee, K.M.: Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. arXiv arXiv:2008.09047 (2020)

  5. Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M.J.: Monocular expressive body regression through body-driven attention. arXiv arXiv:2008.09062 (2020)

  6. Doersch, C., Zisserman, A.: Sim2real transfer learning for 3D human pose estimation: motion to the rescue. arXiv arXiv:1907.02499 (2019)

  7. Georgakis, G., Li, R., Karanam, S., Chen, T., Kosecka, J., Wu, Z.: Hierarchical kinematic human mesh recovery. arXiv arXiv:2003.04232 (2020)

  8. Ghiasi, G., Yang, Y., Ramanan, D., Fowlkes, C.C.: Parsing occluded people. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2401–2408 (2014). https://doi.org/10.1109/CVPR.2014.308

  9. He, Y., Yan, R., Fragkiadaki, K., Yu, S.I.: Epipolar transformers. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7776–7785 (2020). https://doi.org/10.1109/CVPR42600.2020.00780

  10. Hinton, G.E., Zemel, R.S.: Autoencoders, minimum description length and Helmholtz free energy. In: Neural Information Processing Systems (1993)

    Google Scholar 

  11. Huang, J.-B., Yang, M.-H.: Estimating human pose from occluded images. In: Zha, H., Taniguchi, R., Maybank, S. (eds.) ACCV 2009. LNCS, vol. 5994, pp. 48–60. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12307-8_5

    Chapter  Google Scholar 

  12. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv arXiv:1502.03167 (2015)

  13. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018). https://doi.org/10.1109/CVPR.2018.00744

  14. Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5607–5616 (2019). https://doi.org/10.1109/CVPR.2019.00576

  15. Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5252–5262 (2020). https://doi.org/10.1109/CVPR42600.2020.00530

  16. Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: PARE: part attention regressor for 3D human body estimation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11107–11117 (2021). https://doi.org/10.1109/ICCV48922.2021.01094

  17. Kolotouros, N., Pavlakos, G., Black, M., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2252–2261 (2019). https://doi.org/10.1109/ICCV.2019.00234

  18. Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4496–4505 (2019). https://doi.org/10.1109/CVPR.2019.00463

  19. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)

    Article  Google Scholar 

  20. Lu, X., Wang, W., Ma, C., Shen, J., Shao, L., Porikli, F.: See more, know more: unsupervised video object segmentation with co-attention siamese networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3618–3627 (2019). https://doi.org/10.1109/CVPR.2019.00374

  21. Luo, Z., Golestaneh, S.A., Kitani, K.M.: 3D human motion estimation via motion compression and refinement. arXiv arXiv:2008.03789 (2020)

  22. von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 614–631. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_37

    Chapter  Google Scholar 

  23. Moon, G., Lee, K.M.: I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. arXiv arXiv:2008.03713 (2020)

  24. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: International Conference on Machine Learning (2010)

    Google Scholar 

  25. Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 459–468 (2018). https://doi.org/10.1109/CVPR.2018.00055

  26. Rafi, U., Gall, J., Leibe, B.: A semantic occlusion model for human pose estimation from a single depth image. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 67–74 (2015). https://doi.org/10.1109/CVPRW.2015.7301338

  27. Rockwell, C., Fouhey, D.F.: Full-body awareness from partial observations. arXiv arXiv:2008.06046 (2020)

  28. Song, J., Chen, X., Hilliges, O.: Human body model fitting by learned gradient descent. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 744–760. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_44

    Chapter  Google Scholar 

  29. Sun, Y., Ye, Y., Liu, W., Gao, W., Fu, Y., Mei, T.: Human mesh recovery from monocular images via a skeleton-disentangled representation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5348–5357 (2019). https://doi.org/10.1109/ICCV.2019.00545

  30. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103 (2008)

    Google Scholar 

  31. Vosoughi, S., Amer, M.A.: Deep 3D human pose estimation under partial body presence. In: International Conference on Image Processing (2018)

    Google Scholar 

  32. Wang, J., Xu, E., Xue, K., Kidzinski, L.: 3D pose detection in videos: focusing on occlusion. arXiv arXiv:2006.13517 (2020)

  33. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. arXiv arXiv:1711.07971 (2017)

  34. Zanfir, A., Bazavan, E.G., Xu, H., Freeman, W.T., Sukthankar, R., Sminchisescu, C.: Weakly supervised 3D human pose and shape reconstruction with normalizing flows. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 465–481. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_28

    Chapter  Google Scholar 

  35. Zeiler, M.D., Krishnan, D., Taylor, G.W., Fergus, R.: Deconvolutional networks. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2528–2535 (2010). https://doi.org/10.1109/CVPR.2010.5539957

  36. Zhang, T., Huang, B., Wang, Y.: Object-occluded human shape and pose estimation from a single color image. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7374–7383 (2020). https://doi.org/10.1109/CVPR42600.2020.00740

Download references

Acknowledgment

This work was supported in part by the Shenzhen Technology Project (JCYJ20220531095810023), National Natural Science Foundation of China (61976143, U21A20487), Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems (2019B121205007).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hu, L., Ma, X., He, C., Wang, L., Cheng, J. (2024). Autoencoder and Masked Image Encoding-Based Attentional Pose Network. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14426. Springer, Singapore. https://doi.org/10.1007/978-981-99-8432-9_18

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8432-9_18

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8431-2

  • Online ISBN: 978-981-99-8432-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics