Cross-View Self-fusion for Self-supervised 3D Human Pose Estimation in the Wild

Kim, Hyun-Woo; Lee, Gun-Hee; Oh, Myeong-Seok; Lee, Seong-Whan

doi:10.1007/978-3-031-26319-4_12

Hyun-Woo Kim⁶,
Gun-Hee Lee⁷,
Myeong-Seok Oh⁷ &
…
Seong-Whan Lee^6,7

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13841))

Included in the following conference series:

Asian Conference on Computer Vision

445 Accesses

Abstract

Human pose estimation methods have recently shown remarkable results with supervised learning that requires large amounts of labeled training data. However, such training data for various human activities does not exist since 3D annotations are acquired with traditional motion capture systems that usually require a controlled indoor environment. To address this issue, we propose a self-supervised approach that learns a monocular 3D human pose estimator from unlabeled multi-view images by using multi-view consistency constraints. Furthermore, we refine inaccurate 2D poses, which adversely affect 3D pose predictions, using the property of canonical space without relying on camera calibration. Since we do not require camera calibrations to leverage the multi-view information, we can train a network from in-the-wild environments. The key idea is to fuse the 2D observations across views and combine predictions from the observations to satisfy the multi-view consistency during training. We outperform state-of-the-art methods in self-supervised learning on the two benchmark datasets Human3.6M and MPI-INF-3DHP as well as on the in-the-wild dataset SkiPose. Code and models are available at https://github.com/anonyAcc/CVSF_for_3DHPE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Liu, W., Mei, T.: Recent advances of monocular 2d and 3d human pose estimation: a deep learning perspective. ACM Comput. Surv. (CSUR) 55(4), 1–41 (2022)
Article Google Scholar
Lim, Y.K., Choi, S.H., Lee, S.W.: Text extraction in mpeg compressed video for content-based indexing. In: Proceedings 15th International Conference on Pattern Recognition, ICPR-2000, vol. 4, pp. 409–412. IEEE (2000)
Google Scholar
Lee, G.H., Lee, S.W.: Uncertainty-aware human mesh recovery from video by learning part-based 3d dynamics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12375–12384 (2021)
Google Scholar
Yang, H.D., Lee, S.W.: Reconstruction of 3d human body pose from stereo image sequences based on top-down learning. Pattern Recogn. 40, 3120–3131 (2007)
Article MATH Google Scholar
Ahmad, M., Lee, S.W.: Human action recognition using multi-view image sequences. In: 7th International Conference on Automatic Face and Gesture Recognition (FGR06), pp. 523–528. IEEE (2006)
Google Scholar
Roh, M.C., Shin, H.K., Lee, S.W.: View-independent human action recognition with volume motion template on single stereo camera. Pattern Recog. Lett. 31, 639–647 (2010)
Article Google Scholar
Ji, X., Fang, Q., Dong, J., Shuai, Q., Jiang, W., Zhou, X.: A survey on monocular 3d human pose estimation. Virtual Reality Intell. Hardware 2, 471–500 (2020)
Article Google Scholar
Roh, M.C., Kim, T.Y., Park, J., Lee, S.W.: Accurate object contour tracking based on boundary edge selection. Pattern Recogn. 40, 931–943 (2007)
Article MATH Google Scholar
Kocabas, M., Karagoz, S., Akbas, E.: Self-supervised learning of 3d human pose using multi-view geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1077–1086 (2019)
Google Scholar
Jenni, S., Favaro, P.: Self-supervised multi-view synchronization learning for 3d pose estimation. In: Proceedings of the Asian Conference on Computer Vision (2020)
Google Scholar
Iqbal, U., Molchanov, P., Kautz, J.: Weakly-supervised 3d human pose learning via multi-view images in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5243–5252 (2020)
Google Scholar
Wandt, B., Rudolph, M., Zell, P., Rhodin, H., Rosenhahn, B.: CanonPose: selfsupervised monocular 3d human pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 13294–13304 (2021)
Google Scholar
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press (2003)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1325–1339 (2013)
Article Google Scholar
Mehta, D., et al.: Monocular 3d human pose estimation in the wild using improved CNN supervision. In: 2017 International Conference on 3D Vision, pp. 506–516 (2017)
Google Scholar
Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., Fua, P.: Structured prediction of 3d human pose with deep neural networks. arXiv preprint arXiv:1605.05180 (2016)
Tekin, B., M’arquez-Neila, P., Salzmann, M., Fua, P.: Learning to fuse 2d and 3d image cues for monocular body pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3941–3950 (2017)
Google Scholar
Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2602–2611 (2017)
Google Scholar
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7025–7034 (2017)
Google Scholar
Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-Net: localization-classification regression for human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3433–3441 (2017)
Google Scholar
Mehta, D., et al.: Single-shot multi-person 3d pose estimation from monocular RGB. In: 2018 International Conference on 3D Vision, pp. 120–130 (2018)
Google Scholar
Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., Wang, X.: 3d human pose estimation in the wild by adversarial learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5255–5264 (2018)
Google Scholar
Fang, H.S., Xu, Y., Wang, W., Liu, X., Zhu, S.C.: Learning pose grammar to encode human body configuration for 3d pose estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3d multi-person pose estimation from a single RGB image. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 10133–10142 (2019)
Google Scholar
Wang, C., Li, J., Liu, W., Qian, C., Lu, C.: HMOR: hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 242–259. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_15
Chapter Google Scholar
Xi, D., Podolak, I.T., Lee, S.W.: Facial component extraction and face recognition with support vector machines. In: Proceedings of 5th IEEE International Conference on Automatic Face Gesture Recognition, pp. 83–88. IEEE (2002)
Google Scholar
Lee, S.-W., Verri, A. (eds.): SVM 2002. LNCS, vol. 2388. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45665-1
Book Google Scholar
Lee, S.W., Kim, S.Y.: Integrated segmentation and recognition of handwritten numerals with cascade neural network. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 29, 285–290 (1999)
Article Google Scholar
Lee, S.W., Kim, J.H., Groen, F.C.: Translation-, rotation-and scale-invariant recognition of hand-drawn symbols in schematic diagrams. Int. J. Pattern Recogn. Artif. Intell. 4, 1–25 (1990)
Article Google Scholar
Rogez, G., Schmid, C.: MoCap-guided data augmentation for 3d pose estimation in the wild. In: Advances in Neural Information Processing Systems (2016)
Google Scholar
Varol, G., et al.: Learning from synthetic humans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 109–117 (2017)
Google Scholar
Cheng, Y., Yang, B., Wang, B., Yan, W., Tan, R.T.: Occlusion-aware networks for 3d human pose estimation in video. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 723–732 (2019)
Google Scholar
Gong, K., Zhang, J., Feng, J.: PoseAug: a differentiable pose augmentation framework for 3d human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8575–8584 (2021)
Google Scholar
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649 (2017)
Google Scholar
Xu, T., Takano, W.: Graph stacked hourglass networks for 3d human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 16105–16114 (2021)
Google Scholar
Cai, Y., et al.: Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2272–2281 (2019)
Google Scholar
Ci, H., Wang, C., Ma, X., Wang, Y.: Optimizing network structure for 3d human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2262–2271 (2019)
Google Scholar
Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., Luo, J.: Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE Trans. Circ. Syst. Video Technol. 32, 198–209 (2021)
Article Google Scholar
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3d human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 11656–11665 (2021)
Google Scholar
Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3d human pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 69–86. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_5
Chapter Google Scholar
Mehta, D., et al.: XNect: real-time multi-person 3d motion capture with a single RGB camera. ACM Trans. Graph. (TOG) 39, 82–1 (2020)
Article Google Scholar
Cao, X., Zhao, X.: Anatomy and geometry constrained one-stage framework for 3d human pose estimation. In: Proceedings of the Asian Conference on Computer Vision (2020)
Google Scholar
Liu, K., Zou, Z., Tang, W.: Learning global pose features in graph convolutional networks for 3d human pose estimation. In: Proceedings of the Asian Conference on Computer Vision (2020)
Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Google Scholar
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018)
Google Scholar
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2334–2343 (2017)
Google Scholar
Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: CrowdPose: efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10863–10872 (2019)
Google Scholar
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp 3686–3693 (2014)
Google Scholar
Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y.: Learnable triangulation of human pose. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7718–7727 (2019)
Google Scholar
Qiu, H., Wang, C., Wang, J., Wang, N., Zeng, W.: Cross view fusion for 3d human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4342–4351 (2019)
Google Scholar
He, Y., Yan, R., Fragkiadaki, K., Yu, S.I.: Epipolar transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7779–7788 (2020)
Google Scholar
Ma, H., et al.: Transfusion: cross-view fusion with transformer for 3d human pose estimation. In: Proceedings of the British Machine Vision Conference (2021)
Google Scholar
Rhodin, H., et al.: Learning monocular 3d human pose estimation from multi-view images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8437–8446 (2018)
Google Scholar
Spörri, J.: Research dedicated to sports injury prevention - the ‘sequence of prevention’ on the example of alpine ski racing. Habil. Venia Docendi Biomech. 1, 7 (2016)
Google Scholar
Rhodin, H., Salzmann, M., Fua, P.: Unsupervised geometry-aware representation for 3d human pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 765–782. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_46
Chapter Google Scholar
Wandt, B., Rosenhahn, B.: RepNet: weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7782–7791 (2019)
Google Scholar
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2252–2261 (2019)
Google Scholar
Kundu, J.N., Seth, S., Jampani, V., Rakesh, M., Babu, R.V., Chakraborty, A.: Self-supervised 3d human pose estimation via part guided novel image synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6152–6162 (2020)
Google Scholar
Li, Y., et al.: Geometry-driven self-supervised method for 3d human pose estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11442–11449 (2020)
Google Scholar

Download references

Acknowledgements

This work was partially supported by the Institute of Information & communications Technology Planning Evaluation (IITP) funded by the Korea government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program (Korea University)) and the Technology Innovation Program (No. 20017012, Business Model Development for Golf Putting Simulator using AI Video Analysis and Coaching Service at Home) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea).

Author information

Authors and Affiliations

Department of Artificial Intelligence, Korea University, Seoul, Korea
Hyun-Woo Kim & Seong-Whan Lee
Department of Computer Science and Engineering, Korea University, Seoul, Korea
Gun-Hee Lee, Myeong-Seok Oh & Seong-Whan Lee

Authors

Hyun-Woo Kim
View author publications
You can also search for this author in PubMed Google Scholar
Gun-Hee Lee
View author publications
You can also search for this author in PubMed Google Scholar
Myeong-Seok Oh
View author publications
You can also search for this author in PubMed Google Scholar
Seong-Whan Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seong-Whan Lee .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, HW., Lee, GH., Oh, MS., Lee, SW. (2023). Cross-View Self-fusion for Self-supervised 3D Human Pose Estimation in the Wild. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13841. Springer, Cham. https://doi.org/10.1007/978-3-031-26319-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-26319-4_12
Published: 04 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26318-7
Online ISBN: 978-3-031-26319-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cross-View Self-fusion for Self-supervised 3D Human Pose Estimation in the Wild