Autoencoder and Masked Image Encoding-Based Attentional Pose Network

Hu, Longhua; Ma, Xiaoliang; He, Cheng; Wang, Lei; Cheng, Jun

doi:10.1007/978-981-99-8432-9_18

Longhua Hu¹⁵,
Xiaoliang Ma¹⁵,
Cheng He¹⁵,
Lei Wang¹⁶ &
…
Jun Cheng¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14426))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

467 Accesses

Abstract

Despite recent advances in single-image-based 3D human pose and shape estimation, partial occlusion remains a major challenge for many methods, leading to significant prediction errors. Some existing methods fail to provide satisfactory performance for 3D human body reconstruction in occluded outdoor environments. To address these issues, we propose an autoencoder for feature extraction that integrates image masking methods to improve training stability. Our approach utilizes an attention mechanism to effectively capture the features of partially visible body parts, addressing partial occlusion. We further employ a partial attention mechanism to obtain the final features and use a regressor to estimate human model parameters. Experimental results on outdoor 3D poses in benchmark datasets demonstrate that our method outperforms state-of-the-art image-based methods in terms of robustness and efficiency. Qualitative evaluation shows that our method achieves more accurate and robust reconstruction results than existing methods, not only in occluded scenarios but also on standard benchmarks. Our approach exhibits excellent model robustness and training stability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human model fitting towards in-the-wild 3D human pose estimation. In: 2021 International Conference on 3D Vision (3DV), pp. 42–52 (2021). https://doi.org/10.1109/3DV53792.2021.00015
Cheng, Y., Yang, B., Wang, B., Tan, R.T.: 3D human pose estimation using spatio-temporal networks with explicit occlusion training. Proc. AAAI Conf. Artif. Intell. 34(07), 10631–10638 (2020)
Google Scholar
Cheng, Y., Yang, B., Wang, B., Wending, Y., Tan, R.: Occlusion-aware networks for 3D human pose estimation in video. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 723–732 (2019). https://doi.org/10.1109/ICCV.2019.00081
Choi, H., Moon, G., Lee, K.M.: Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. arXiv arXiv:2008.09047 (2020)
Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M.J.: Monocular expressive body regression through body-driven attention. arXiv arXiv:2008.09062 (2020)
Doersch, C., Zisserman, A.: Sim2real transfer learning for 3D human pose estimation: motion to the rescue. arXiv arXiv:1907.02499 (2019)
Georgakis, G., Li, R., Karanam, S., Chen, T., Kosecka, J., Wu, Z.: Hierarchical kinematic human mesh recovery. arXiv arXiv:2003.04232 (2020)
Ghiasi, G., Yang, Y., Ramanan, D., Fowlkes, C.C.: Parsing occluded people. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2401–2408 (2014). https://doi.org/10.1109/CVPR.2014.308
He, Y., Yan, R., Fragkiadaki, K., Yu, S.I.: Epipolar transformers. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7776–7785 (2020). https://doi.org/10.1109/CVPR42600.2020.00780
Hinton, G.E., Zemel, R.S.: Autoencoders, minimum description length and Helmholtz free energy. In: Neural Information Processing Systems (1993)
Google Scholar
Huang, J.-B., Yang, M.-H.: Estimating human pose from occluded images. In: Zha, H., Taniguchi, R., Maybank, S. (eds.) ACCV 2009. LNCS, vol. 5994, pp. 48–60. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12307-8_5
Chapter Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv arXiv:1502.03167 (2015)
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018). https://doi.org/10.1109/CVPR.2018.00744
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5607–5616 (2019). https://doi.org/10.1109/CVPR.2019.00576
Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5252–5262 (2020). https://doi.org/10.1109/CVPR42600.2020.00530
Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: PARE: part attention regressor for 3D human body estimation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11107–11117 (2021). https://doi.org/10.1109/ICCV48922.2021.01094
Kolotouros, N., Pavlakos, G., Black, M., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2252–2261 (2019). https://doi.org/10.1109/ICCV.2019.00234
Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4496–4505 (2019). https://doi.org/10.1109/CVPR.2019.00463
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)
Article Google Scholar
Lu, X., Wang, W., Ma, C., Shen, J., Shao, L., Porikli, F.: See more, know more: unsupervised video object segmentation with co-attention siamese networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3618–3627 (2019). https://doi.org/10.1109/CVPR.2019.00374
Luo, Z., Golestaneh, S.A., Kitani, K.M.: 3D human motion estimation via motion compression and refinement. arXiv arXiv:2008.03789 (2020)
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 614–631. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_37
Chapter Google Scholar
Moon, G., Lee, K.M.: I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. arXiv arXiv:2008.03713 (2020)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: International Conference on Machine Learning (2010)
Google Scholar
Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 459–468 (2018). https://doi.org/10.1109/CVPR.2018.00055
Rafi, U., Gall, J., Leibe, B.: A semantic occlusion model for human pose estimation from a single depth image. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 67–74 (2015). https://doi.org/10.1109/CVPRW.2015.7301338
Rockwell, C., Fouhey, D.F.: Full-body awareness from partial observations. arXiv arXiv:2008.06046 (2020)
Song, J., Chen, X., Hilliges, O.: Human body model fitting by learned gradient descent. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 744–760. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_44
Chapter Google Scholar
Sun, Y., Ye, Y., Liu, W., Gao, W., Fu, Y., Mei, T.: Human mesh recovery from monocular images via a skeleton-disentangled representation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5348–5357 (2019). https://doi.org/10.1109/ICCV.2019.00545
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103 (2008)
Google Scholar
Vosoughi, S., Amer, M.A.: Deep 3D human pose estimation under partial body presence. In: International Conference on Image Processing (2018)
Google Scholar
Wang, J., Xu, E., Xue, K., Kidzinski, L.: 3D pose detection in videos: focusing on occlusion. arXiv arXiv:2006.13517 (2020)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. arXiv arXiv:1711.07971 (2017)
Zanfir, A., Bazavan, E.G., Xu, H., Freeman, W.T., Sukthankar, R., Sminchisescu, C.: Weakly supervised 3D human pose and shape reconstruction with normalizing flows. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 465–481. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_28
Chapter Google Scholar
Zeiler, M.D., Krishnan, D., Taylor, G.W., Fergus, R.: Deconvolutional networks. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2528–2535 (2010). https://doi.org/10.1109/CVPR.2010.5539957
Zhang, T., Huang, B., Wang, Y.: Object-occluded human shape and pose estimation from a single color image. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7374–7383 (2020). https://doi.org/10.1109/CVPR42600.2020.00740

Download references

Acknowledgment

This work was supported in part by the Shenzhen Technology Project (JCYJ20220531095810023), National Natural Science Foundation of China (61976143, U21A20487), Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems (2019B121205007).

Author information

Authors and Affiliations

Shenzhen University, Shenzhen, 518060, China
Longhua Hu, Xiaoliang Ma & Cheng He
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Beijing, China
Lei Wang & Jun Cheng

Authors

Longhua Hu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoliang Ma
View author publications
You can also search for this author in PubMed Google Scholar
Cheng He
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Wang .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, L., Ma, X., He, C., Wang, L., Cheng, J. (2024). Autoencoder and Masked Image Encoding-Based Attentional Pose Network. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14426. Springer, Singapore. https://doi.org/10.1007/978-981-99-8432-9_18

Download citation

DOI: https://doi.org/10.1007/978-981-99-8432-9_18
Published: 24 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8431-2
Online ISBN: 978-981-99-8432-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Autoencoder and Masked Image Encoding-Based Attentional Pose Network