Abstract
Despite recent advances in single-image-based 3D human pose and shape estimation, partial occlusion remains a major challenge for many methods, leading to significant prediction errors. Some existing methods fail to provide satisfactory performance for 3D human body reconstruction in occluded outdoor environments. To address these issues, we propose an autoencoder for feature extraction that integrates image masking methods to improve training stability. Our approach utilizes an attention mechanism to effectively capture the features of partially visible body parts, addressing partial occlusion. We further employ a partial attention mechanism to obtain the final features and use a regressor to estimate human model parameters. Experimental results on outdoor 3D poses in benchmark datasets demonstrate that our method outperforms state-of-the-art image-based methods in terms of robustness and efficiency. Qualitative evaluation shows that our method achieves more accurate and robust reconstruction results than existing methods, not only in occluded scenarios but also on standard benchmarks. Our approach exhibits excellent model robustness and training stability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human model fitting towards in-the-wild 3D human pose estimation. In: 2021 International Conference on 3D Vision (3DV), pp. 42–52 (2021). https://doi.org/10.1109/3DV53792.2021.00015
Cheng, Y., Yang, B., Wang, B., Tan, R.T.: 3D human pose estimation using spatio-temporal networks with explicit occlusion training. Proc. AAAI Conf. Artif. Intell. 34(07), 10631–10638 (2020)
Cheng, Y., Yang, B., Wang, B., Wending, Y., Tan, R.: Occlusion-aware networks for 3D human pose estimation in video. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 723–732 (2019). https://doi.org/10.1109/ICCV.2019.00081
Choi, H., Moon, G., Lee, K.M.: Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. arXiv arXiv:2008.09047 (2020)
Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M.J.: Monocular expressive body regression through body-driven attention. arXiv arXiv:2008.09062 (2020)
Doersch, C., Zisserman, A.: Sim2real transfer learning for 3D human pose estimation: motion to the rescue. arXiv arXiv:1907.02499 (2019)
Georgakis, G., Li, R., Karanam, S., Chen, T., Kosecka, J., Wu, Z.: Hierarchical kinematic human mesh recovery. arXiv arXiv:2003.04232 (2020)
Ghiasi, G., Yang, Y., Ramanan, D., Fowlkes, C.C.: Parsing occluded people. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2401–2408 (2014). https://doi.org/10.1109/CVPR.2014.308
He, Y., Yan, R., Fragkiadaki, K., Yu, S.I.: Epipolar transformers. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7776–7785 (2020). https://doi.org/10.1109/CVPR42600.2020.00780
Hinton, G.E., Zemel, R.S.: Autoencoders, minimum description length and Helmholtz free energy. In: Neural Information Processing Systems (1993)
Huang, J.-B., Yang, M.-H.: Estimating human pose from occluded images. In: Zha, H., Taniguchi, R., Maybank, S. (eds.) ACCV 2009. LNCS, vol. 5994, pp. 48–60. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12307-8_5
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv arXiv:1502.03167 (2015)
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018). https://doi.org/10.1109/CVPR.2018.00744
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5607–5616 (2019). https://doi.org/10.1109/CVPR.2019.00576
Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5252–5262 (2020). https://doi.org/10.1109/CVPR42600.2020.00530
Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: PARE: part attention regressor for 3D human body estimation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11107–11117 (2021). https://doi.org/10.1109/ICCV48922.2021.01094
Kolotouros, N., Pavlakos, G., Black, M., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2252–2261 (2019). https://doi.org/10.1109/ICCV.2019.00234
Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4496–4505 (2019). https://doi.org/10.1109/CVPR.2019.00463
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)
Lu, X., Wang, W., Ma, C., Shen, J., Shao, L., Porikli, F.: See more, know more: unsupervised video object segmentation with co-attention siamese networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3618–3627 (2019). https://doi.org/10.1109/CVPR.2019.00374
Luo, Z., Golestaneh, S.A., Kitani, K.M.: 3D human motion estimation via motion compression and refinement. arXiv arXiv:2008.03789 (2020)
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 614–631. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_37
Moon, G., Lee, K.M.: I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. arXiv arXiv:2008.03713 (2020)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: International Conference on Machine Learning (2010)
Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 459–468 (2018). https://doi.org/10.1109/CVPR.2018.00055
Rafi, U., Gall, J., Leibe, B.: A semantic occlusion model for human pose estimation from a single depth image. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 67–74 (2015). https://doi.org/10.1109/CVPRW.2015.7301338
Rockwell, C., Fouhey, D.F.: Full-body awareness from partial observations. arXiv arXiv:2008.06046 (2020)
Song, J., Chen, X., Hilliges, O.: Human body model fitting by learned gradient descent. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 744–760. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_44
Sun, Y., Ye, Y., Liu, W., Gao, W., Fu, Y., Mei, T.: Human mesh recovery from monocular images via a skeleton-disentangled representation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5348–5357 (2019). https://doi.org/10.1109/ICCV.2019.00545
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103 (2008)
Vosoughi, S., Amer, M.A.: Deep 3D human pose estimation under partial body presence. In: International Conference on Image Processing (2018)
Wang, J., Xu, E., Xue, K., Kidzinski, L.: 3D pose detection in videos: focusing on occlusion. arXiv arXiv:2006.13517 (2020)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. arXiv arXiv:1711.07971 (2017)
Zanfir, A., Bazavan, E.G., Xu, H., Freeman, W.T., Sukthankar, R., Sminchisescu, C.: Weakly supervised 3D human pose and shape reconstruction with normalizing flows. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 465–481. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_28
Zeiler, M.D., Krishnan, D., Taylor, G.W., Fergus, R.: Deconvolutional networks. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2528–2535 (2010). https://doi.org/10.1109/CVPR.2010.5539957
Zhang, T., Huang, B., Wang, Y.: Object-occluded human shape and pose estimation from a single color image. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7374–7383 (2020). https://doi.org/10.1109/CVPR42600.2020.00740
Acknowledgment
This work was supported in part by the Shenzhen Technology Project (JCYJ20220531095810023), National Natural Science Foundation of China (61976143, U21A20487), Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems (2019B121205007).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Hu, L., Ma, X., He, C., Wang, L., Cheng, J. (2024). Autoencoder and Masked Image Encoding-Based Attentional Pose Network. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14426. Springer, Singapore. https://doi.org/10.1007/978-981-99-8432-9_18
Download citation
DOI: https://doi.org/10.1007/978-981-99-8432-9_18
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8431-2
Online ISBN: 978-981-99-8432-9
eBook Packages: Computer ScienceComputer Science (R0)