ABSTRACT
For human action analysis, 3D mesh reconstruction of the human body is very important. Most methods require a large number of training data with 3D human-annotated labels. Such training cost is high and the scales of the existing 3D human-annotated label datasets generally cannot match complex actions in complex environments. Therefore, some recent researchers have begun to study the 3D pseudo-label methods. To have sufficient constraints, most 3D pseudo-label human mesh recovery algorithms rely heavily on 3D pseudo-labels provided by some existing unsupervised 3D human pose estimation algorithms. Unfortunately, it is hard to guarantee an accurate 3D pose estimation with unsupervised learning approaches. The inaccurate 3D pseudo-labels bring negative effects on model training. To solve this problem, we propose an end-to-end 3D-label-free training framework by using multi-view consistency to provide sufficient constraints instead of any 3D human-annotated labels or 3D pseudo-labels. The multi-view consistency exploits the human body consistency attributes in multi-view images to provide self-supervised constraints. Our method is evaluated on two benchmark datasets (Human3.6M and MPI-INF-3DHP) and exhibits competitive experimental results.
- Boulic, R., Bécheiraz, P., Emering, L., & Thalmann, D. 1997. Integration of motion control techniques for virtual human and avatar real-time animation. In Proceedings of the ACM symposium on Virtual reality software and technology. ACM, Lausanne, Switzerland, 111-118. https://doi.org/10.1145/261135.261156Google ScholarDigital Library
- Yong-Lu Li, Liang Xu, Xinpeng Liu, Xijie Huang, Yue Xu, Shiyi Wang, Haoshu Fang, Ze Ma, Mingyang Chen, Cewu Lu. 2020. Pastanet: Toward human activity knowledge engine. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, 382-391. https://doi.org/10.1109/CVPR42600.2020.00046Google ScholarCross Ref
- Seyma Yucer and Yusuf Sinan Akgul. 2018. 3D Human Action Recognition with Siamese-LSTM Based Deep Metric Learning. Journal of Image and Graphics 6, 1 (June 2018), 21-26. http://dx.doi.org/10.18178/joig.6.1.21-26Google ScholarCross Ref
- Naresh Kumar and Nagarajan Sukavanam. 2018. Motion Trajectory for Human Action Recognition Using Fourier Temporal Features of Skeleton Joints. Journal of Image and Graphics 6, 2 (January 2018), 174-180. http://dx.doi.org/10.18178/joig.6.2.174-180Google ScholarCross Ref
- Muhammad Hassan, Tasweer Ahmad, Nudrat Liaqat, Ali Farooq, Syed Asghar Ali, and Syed Rizwan hassan. 2014. A Review on Human Actions Recognition Using Vision Based Techniques. Journal of Image and Graphics, 2, 1, (January 2014), 28-32. http://dx.doi.org/10.12720/joig.2.1.28-32Google ScholarCross Ref
- Tasweer Ahmad, Junaid Rafique, Hassam Muazzam, and Tahir Rizvi. 2015. Using Discrete Cosine Transform Based Features for Human Action Recognition. Journal of Image and Graphics 3, 2, (January 2015), 96-101. http://dx.doi.org/10.18178/joig.3.2.96-101Google ScholarCross Ref
- Yong-Lu Li, Xinpeng Liu, Han Lu, Shiyi Wang, Junqi Liu. 2020. Detailed 2D-3D Joint Representation for Human-Object Interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, 10166-10175. https://doi.org/10.1109/CVPR42600.2020.01018Google ScholarCross Ref
- Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers. 2005. SCAPE: shape completion and animation of people. Acm Transactions on Graphics 24, 3, (July 2005), 408-416. http://dx.doi.org/10.1145/1073204.1073207Google ScholarDigital Library
- Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. Osman, Dimitrios Tzionas, Michael J. Black. 2019. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, 10975-10985. https://doi.org/10.1109/CVPR.2019.01123Google ScholarCross Ref
- Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, Michael J. Black. 2015. SMPL: A skinned multi-person linear model. Acm Transactions on Graphics 34, 6, (November 2015), 1-16. http://dx.doi.org/10.1145/2816795.2818013Google ScholarDigital Library
- Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter V. Gehler, Javier Romero, Michael J Black. 2016. Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, Amsterdam, Netherlands, 561-578. https://doi.org/10.1007/978-3-319-46454-1_34Google ScholarCross Ref
- Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J. Black, Peter V. Gehler. 2017. Unite the People: Closing the Loop Between 3D and 2D Human Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, USA, 6050-6059. https://doi.org/10.1109/CVPR.2017.500Google ScholarCross Ref
- Angjoo Kanazawa, Michael J. Black, David W. Jacobs, Jitendra Malik. 2018. End-to-end Recovery of Human Shape and Pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Salt Lake City, UT, USA, 7122-7131. https://doi.org/10.1109/CVPR.2018.00744Google ScholarCross Ref
- Muhammed Kocabas, Nikos Athanasiou, Michael J. Black. 2020. VIBE: Video Inference for Human Body Pose and Shape Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, 5253-5263. https://doi.org/10.1109/CVPR42600.2020.00530Google ScholarCross Ref
- Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, Cewu Lu. 2021. HybrIK: A Hybrid Analytical-Neural Inverse Kinematics Solution for 3D Human Pose and Shape Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Nashville, TN, USA, 3383-3393. https://doi.org/10.1109/CVPR46437.2021.00339Google ScholarCross Ref
- Georgios Pavlakos, Nikos Kolotouros, Kostas Daniilidis. 2019. Texturepose: Supervising human mesh estimation with texture consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Seoul, Korea (South), 803-812. https://doi.org/10.1109/ICCV.2019.00089Google ScholarCross Ref
- Shashank Tripathi, Siddhant Ranade, Ambrish Tyagi, Amit Agrawal. 2020. Posenet3d: Learning temporally consistent 3d human pose via knowledge distillation. In Proceedings of the International Conference on 3D Vision (3DV). IEEE, Fukuoka, Japan, 311-321. https://doi.org/10.1109/3DV50981.2020.00041Google ScholarCross Ref
- Zhenbo Yu, Junjie Wang, Jingwei Xu, Bingbing Ni, Chenglong Zhao, Minsi Wang, Wenjun Zhang. 2021. Skeleton2Mesh: Kinematics Prior Injected Unsupervised Human Mesh Recovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Montreal, QC, Canada, 8619-8629. https://doi.org/10.1109/ICCV48922.2021.00850Google ScholarCross Ref
- Ching-Hang Chen, Ambrish Tyagi, Amit Agrawal, Dylan Drover, Rohith MV, Stefan Stojanov, James M. Rehg. 2019. Unsupervised 3d pose estimation with geometric self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Long Beach, CA, USA, 5714-5724. https://doi.org/10.1109/CVPR.2019.00586Google ScholarCross Ref
- Xiaodan Hu, Narendra Ahuja. 2021. Unsupervised 3d pose estimation for hierarchical dance video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Montreal, QC, Canada, 11015-11024. https://doi.org/10.1109/ICCV48922.2021.01083Google ScholarCross Ref
- N Dinesh Reddy, Laurent Guigues, Leonid Pishchulin, Jayan Eledath, Srinivasa G. Narasimhan. 2021. Tessetrack: End-to-end learnable multi-person articulated 3d pose tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Nashville, TN, USA, 15190-15200. https://doi.org/10.1109/CVPR46437.2021.01494Google ScholarCross Ref
- Catalin Ionescu, Dragos Papava, Vlad Olaru, Cristian Sminchisescu. 2013. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (December 2013) 1325-1339. https://doi.org/10.1109/TPAMI.2013.248Google ScholarDigital Library
- Dushyant Mehta; Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, Christian Theobalt. 2017. Monocular 3d human pose estimation in the wild using improved cnn supervision. In Proceedings of the International Conference on 3D Vision (3DV). IEEE, Qingdao, China, 506-516. https://doi.org/10.1109/3DV.2017.00064Google ScholarCross Ref
- Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, Y. Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arxiv preprint arxiv:1406.1078, 2014. http://dx.doi.org/10.3115/v1/D14-1179Google ScholarCross Ref
- Hongwen Zhang, Yating Tian, Xinchi Zhou, Wanli Ouyang, Yebin Liu, Limin Wang, Zhenan Sun. 2021. PyMAF: 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Montreal, QC, Canada, 11446-11456. https://doi.org/10.1109/ICCV48922.2021.01125Google ScholarCross Ref
- Muhammed Kocabas, Salih Karagoz, Emre Akbas. 2019. Self-supervised learning of 3d human pose using multi-view geometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Long Beach, CA, USA, 1077-1086. https://doi.org/10.1109/CVPR.2019.00117Google ScholarCross Ref
- Ke Gong, Yiming Gao, Xiaodan Liang, Xiaohui Shen, Meng Wang, Liang Lin. 2019. Graphonomy: Universal human parsing via graph transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Long Beach, CA, USA, 7450-7459. https://doi.org/10.1109/CVPR.2019.00763Google ScholarCross Ref
- Hiroharu Kato, Yoshitaka Ushiku, Tatsuya Harada. 2018. Neural 3d mesh renderer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Salt Lake City, UT, USA, 3907-3916. https://doi.org/10.1109/CVPR.2018.00411Google ScholarCross Ref
- Matthew Loper, Naureen Mahmood, Michael J Black. 2014. MoSh: Motion and shape capture from sparse markers. Acm Transactions on Graphics 33, 6 (December 2014), 1-13. http://dx.doi.org/10.1145/2661229.2661273Google ScholarDigital Library
- Zhongguo Li, Magnus Oskarsson, Anders Heyden. 2021. 3D human pose and shape estimation through collaborative learning and multi-view model-fitting. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Springer, Waikoloa, HI, USA, 1888-1897. https://doi.org/10.1109/WACV48630.2021.00193Google ScholarCross Ref
- Sam Johnson, Mark Everingham. Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation. 2010. In Proceedings of British Machine Vision Conference (BMVC). British Machine Vision Association, Aberystwyth, UK, 1-11. http://dx.doi.org/10.5244/C.24.12Google ScholarCross Ref
- Sam Johnson, Mark Everingham. 2011. Learning effective human pose estimation from inaccurate annotation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Colorado Springs, CO, USA, 1465-1472. https://doi.org/10.1109/CVPR.2011.5995318Google ScholarDigital Library
- Nikos Kolotouros, Georgios Pavlakos, Michael J Black, Kostas Daniilidis. 2019 Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Seoul, Korea (South), 2252-2261. https://doi.org/10.1109/ICCV.2019.00234Google ScholarCross Ref
- Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei Rezvani Nezhad, Hans-Peter Seidel, Weipeng Xu, Dan Casas, Christian Theobalt. 2017. Vnect: Real-time 3d human pose estimation with a single rgb camera. Acm Transactions on Graphics 36, 4 (May 2017), 1-14. http://dx.doi.org/10.1145/3072959.3073596Google ScholarDigital Library
Index Terms
- 3D-Label-Free Human Mesh Recovery Using Multi-view Consistency
Recommendations
Label recovery and label correlation co-learning for multi-view multi-label classification with incomplete labels
AbstractMulti-view multi-label learning (MVML) is an important paradigm in machine learning, where each instance is represented by several heterogeneous views and associated with a set of class labels. However, label incompleteness and the ignorance of ...
Semi-supervised multi-label classification using incomplete label information
Highlights- An inductive semi-supervised method called Smile is proposed for multi-label classification using incomplete label information.
AbstractClassifying multi-label instances using incompletely labeled instances is one of the fundamental tasks in multi-label learning. Most existing methods regard this task as supervised weak-label learning problem and assume sufficient ...
Weakly-supervised multi-view multi-instance multi-label learning
IJCAI'20: Proceedings of the Twenty-Ninth International Joint Conference on Artificial IntelligenceMulti-view, Multi-instance, and Multi-label Learning (M3L) can model complex objects (bags), which are represented with different feature views, made of diverse instances, and annotated with discrete nonexclusive labels. Existing M3L approaches assume a ...
Comments