HandO: a hybrid 3D hand–object reconstruction model for unknown objects

Yu, Hang; Cheang, Chilam; Fu, Yanwei; Xue, Xiangyang

doi:10.1007/s00530-021-00874-7

HandO: a hybrid 3D hand–object reconstruction model for unknown objects

Letter to the Editor
Published: 09 January 2022

Volume 28, pages 1845–1859, (2022)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Hang Yu¹^na1,
Chilam Cheang²^na1,
Yanwei Fu ORCID: orcid.org/0000-0002-6595-6893³ &
…
Xiangyang Xue²

876 Accesses
3 Citations
Explore all metrics

Abstract

In various multimedia applications, it is of great significance to reconstruct 3D meshes of hands and objects from single RGB images. Mesh-based methods mainly resort to mesh displacements by estimating relative positions between hands and objects, while the distance may be inaccurate. Methods based on signed distance function (SDF) learn relative positions by concurrently sampling hand meshes and object meshes; unfortunately, these methods have very limited capability of reconstructing smooth surfaces with rich details. For example, SDF-based methods are inclined to lose the typologies. To the best of our knowledge, only limited works can simultaneously reconstruct the hands and objects with smooth surfaces and accurate relative positions. To this end, we present a novel hybrid model—hand–object Model (HandO) enabling the hand–object 3D reconstruction with smooth surfaces and accurate positions. Critically, our model for the first time makes the hybrid 3D representation for this task by bringing meshes, SDFs, and parametric models together. A feature extractor is employed to extract the image features, and SDF sample points are projected onto these features to extract the local features of each sampled point. Essentially, our model can be naturally extended to reconstruct a whole body holding an object via the new hybrid representation. Additionally, to overcome the lack of training data, a synthetic body-holding dataset is contributed to the community, thus facilitating the research of reconstructing the hand and object. It contains 31763 images of over 50 object categories. Extensive experiments demonstrate that our model can achieve better performance over the competitors on benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The dataset is shown at https://baboon527.github.io/HandO/.

References

Shan, D., Geng, J., Shu, M., Fouhey, D.: Understanding human hands in contact at internet scale. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9869–9878 (2020)
Zhang, J., Pepose, S., Joo, H., Ramanan, D., Malik, J., Kanazawa, A.: Perceiving 3d human-object spatial arrangements from a single image in the wild. In: European conference on computer vision, pp. 34–51 (2020)
Diller, C., Funkhouser, T., Dai, A.: Forecasting characteristic 3D poses of human actions. ArXiv Preprint. arXiv:2011.15079 (2020)
Parger, M., Tang, C., Xu, Y., Twigg, C., Tao, L., Li, Y., Wang, R., Steinberger, M.: UNOC: understanding occlusion for embodied presence in virtual reality. ArXiv Preprint. arXiv:2012.03680 (2020)
Hassan, M., Choutas, V., Tzionas, D., Black, M.: Resolving 3D human pose ambiguities with 3D scene constraints. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2282–2292 (2019)
Monszpart, A., Guerrero, P., Ceylan, D., Yumer, E., Mitra, N.: iMapper: interaction-guided scene mapping from monocular videos. ACM Trans. Graph. 38, 1–15 (2019)
Article Google Scholar
Zhang, Y., Hassan, M., Neumann, H., Black, M., Tang, S.: Generating 3d people in scenes without people. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6194–6204 (2020)
Hassan, M., Ghosh, P., Tesch, J., Tzionas, D., Black, M.: Populating 3D scenes by learning human–scene interaction. ArXiv Preprint. arXiv:2012.11581 (2020)
Liu, M., Pan, Z., Xu, K., Ganguly, K., Manocha, D.: Generating grasp poses for a high-dof gripper using neural networks. ArXiv Preprint. arXiv:1903.00425 (2019)
Karunratanakul, K., Yang, J., Zhang, Y., Black, M., Muandet, K., Tang, S.: Grasping field: learning implicit representations for human grasps. ArXiv Preprint. arXiv:2008.04451 (2020)
Fan, H., Su, H., Guibas, L.: A point set generation network for 3d object reconstruction from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613 (2017)
Tatarchenko, M., Dosovitskiy, A., Brox, T.: Octree generating networks: efficient convolutional architectures for high-resolution 3d outputs. In: Proceedings of the IEEE international conference on computer vision, pp. 2088–2096 (2017)
Groueix, T., Fisher, M., Kim, V., Russell, B., Aubry, M.: A papier-mâché approach to learning 3d surface generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 216–224 (2018)
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.: Pixel2mesh: generating 3d mesh models from single rgb images. In: Proceedings of the European conference on computer vision (ECCV), pp. 52–67 (2018)
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3d reconstruction in function space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4460–4470 (2019)
Han, X., Laga, H., Bennamoun, M.: Image-based 3D object reconstruction: state-of-the-art and trends in the deep learning era. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1578–1604 (2019)
Article Google Scholar
Choy, C., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In: European conference on computer vision, pp. 628–644 (2016)
Wen, C., Zhang, Y., Li, Z., Fu, Y.: Pixel2mesh++: Multi-view 3d mesh generation via deformation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 1042–1051 (2019)
Hasson, Y., Tekin, B., Bogo, F., Laptev, I., Pollefeys, M., Schmid, C.: Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 571–580 (2020)
Corona, E., Pumarola, A., Alenya, G., Moreno-Noguer, F., Rogez, G.: Ganhand: predicting human grasp affordances in multi-object scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5031–5041 (2020)
Cao, Z., Radosavovic, I., Kanazawa, A., Malik, J.: Reconstructing hand–object interactions in the wild. ArXiv Preprint. arXiv:2012.09856 (2020)
Romero, J., Tzionas, D., Black, M.: Embodied hands: modeling and capturing hands and bodies together. ACM Trans. Graph. 36, 1–17 (2017)
Article Google Scholar
Zimmermann, C., Brox, T.: Learning to estimate 3d hand pose from single RGB images. In: Proceedings of the IEEE international conference on computer vision, pp. 4903–4911 (2017)
Tekin, B., Bogo, F., Pollefeys, M.: H+ o: Unified egocentric recognition of 3d hand-object poses and interactions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4511–4520 (2019)
Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M., Laptev, I., Schmid, C.: Learning joint reconstruction of hands and manipulated objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11807–11816 (2019)
Yang, L., Zhan, X., Li, K., Xu, W., Li, J., Lu, C.: CPF: Learning a contact potential field to model the hand–object interaction. ArXiv Preprint. arXiv:2012.00924 (2020)
Taheri, O., Ghorbani, N., Black, M., Tzionas, D.: GRAB: a dataset of whole-body human grasping of objects. In: European conference on computer vision, pp. 581–600 (2020)
Song, C., Song, J., Huang, Q.: Hybridpose: 6d object pose estimation under hybrid representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 431–440 (2020)
Yang, Z., Yan, S., Huang, Q.: Extreme relative pose network under hybrid representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2455–2464 (2020)
Cao, Z., Hidalgo, G., Simon, T., Wei, S., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43, 172–186 (2019)
Article Google Scholar
Rong, Y., Shiratori, T., Joo, H.: FrankMocap: fast monocular 3D hand and body motion capture by regression and integration. ArXiv Preprint. arXiv:2008.08324 (2020)
Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: a method for 3d annotation of hand and object poses. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3196–3206 (2020)
Zhang, T., Huang, B., Wang, Y.: Object-occluded human shape and pose estimation from a single color image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7376–7385 (2020)
Kulon, D., Guler, R.A., Kokkinos, I., et al.: Weakly-supervised mesh-convolutional hand reconstruction in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4990–5000 (2020)
Jiang, H., Liu, S., Wang, J., et al.: Hand-object contact consistency reasoning for human grasps generation. arXiv preprint. arXiv:2104.03304 (2021)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. ArXiv Preprint. arXiv:1406.2283 (2014)
Qi, C., Su, H., Mo, K., Guibas, L.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660 (2017)
Qi, C., Yi, L., Su, H., Guibas, L.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. ArXiv Preprint. arXiv:1706.02413 (2017)
Park, J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: Learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 165–174 (2019)
Xu, Q., Wang, W., Ceylan, D., Mech, R., Neumann, U.: Disn: deep implicit surface network for high-quality single-view 3d reconstruction. ArXiv Preprint. arXiv:1905.10711 (2019)
Shi, Z., Yu, L., El-Latif, A., Ahmed, A., Niu, X.: Skeleton modulated topological perception map for rapid viewpoint selection. IEICE Trans. Inf. Syst. 95, 2585–2588 (2012)
Article Google Scholar
Shi, Z.-F., Yu, L.-Y., El-Latif, A., Ahmed, A., Le, D., Niu, X.-M.: A kinematics significance based skeleton map for rapid viewpoint selection. Res. J. Appl. Sci. Eng. Technol. 4, 2887–2892 (2012)
Google Scholar
Gad, R., Talha, M., El-Latif, A., Ahmed, A., Zorkany, M., El-Sayed, A., El-Fishawy, N., Ghulam, M.: Iris recognition using multi-algorithmic approaches for cognitive internet of things (CIoT) framewok. Future Gener. Comput. Syst. 86, 178–191 (2018)
Article Google Scholar
Kumar, A., Singh, N., Kumar, P., Vijayvergia, A., Kumar, K.: A novel superpixel based color spatial feature for salient object detection. In: 2017 Conference on information and communication technology (CICT), IEEE, pp. 1–5 (2017)
Kumain, S.C., Singh, M., Singh, N., Kumar, K.: An efficient Gaussian noise reduction technique for noisy images using optimized filter approach. In: IEEE in 2018 first international conference on secure cyber computing and communication (ICSCCC), IEEE, pp. 243–248 (2018)
Atrish, A., Singh, N., Kumar, K., Kumar, V.: An automated hierarchical framework for player recognition in sports image. In: Proceedings of the international conference on video and image processing, pp. 103–108 (2017)
Kumar, K., Shrimankar, D.D., Singh, N.: Key-lectures: keyframes extraction in video lectures. In: Machine intelligence and signal analysis, pp. 453–459. Springer, Singapore (2019)
Chapter Google Scholar
Sharma, S., Kumar, K., Singh, N.: Deep eigen space based ASL recognition system. IETE J Res (2020). https://doi.org/10.1080/03772063.2020.1780164
Article Google Scholar
Kumar, K.: Text query based summarized event searching interface system using deep learning over cloud. Multimedia Tools Appl. 80(7), 11079–11094 (2021)
Article Google Scholar
Sharma, S., Kumar, P., Kumar, K.: A-PNR: automatic plate number recognition. In: Proceedings of the 7th international conference on computer and communication technology, pp. 106–110 (2017)
Loper, M., Black, M.: OpenDR: an approximate differentiable renderer. In: European conference on computer vision, pp. 154–169 (2014)
Lorensen, W., Cline, H.: Marching cubes: a high resolution 3D surface construction algorithm. Comput. Graph. 21, 163–169 (1987)
Article Google Scholar
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A., Tzionas, D., Black, M.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10975–10985 (2019)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34, 1–16 (2015)
Article Google Scholar
Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.: First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 409–419 (2018)
Mahmood, N., Ghorbani, N., Troje, N., Pons-Moll, G., Black, M.: AMASS: archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 5442–5451 (2019)
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu: pixel-aligned implicit function for high-resolution clothed human digitization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2304–2314 (2019)
Bhatnagar, B., Tiwari, G., Theobalt, C., Pons-Moll, G.: Multi-garment net: learning to dress 3d people from images. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 5420–5430 (2019)
Brahmbhatt, S., Ham, C., Kemp, C., Hays, J.: Contactdb: analyzing and predicting grasp contact via thermal imaging. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8709–8719 (2019)
Pumarola, A., Sanchez-Riera, J., Choi, G., Sanfeliu, A., Moreno-Noguer, F.: 3dpeople: modeling the geometry of dressed humans. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2242–2251 (2019)
Jakob, W.: Mitsuba renderer (2010). http://www.mitsuba-renderer.org. Accessed 1 Dec 2020
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255 (2009)
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. ArXiv Preprint. arXiv:1412.6980 (2014)

Download references

Acknowledgement

This research was jointly funded by Shanghai Municipal Science and Technology Major Project (No. 2021SHZDZX0103), Shanghai Research and Innovation Functional Program (No. 17DZ2260900).

Author information

Hang Yu and Chilam Cheang have contributed equally to this work.

Authors and Affiliations

Academy for Engineering and Technology, Fudan University, Shanghai, China
Hang Yu
School of Computer Science, Fudan University, Shanghai, China
Chilam Cheang & Xiangyang Xue
School of Data Science, Fudan University, Shanghai, China
Yanwei Fu

Authors

Hang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Chilam Cheang
View author publications
You can also search for this author in PubMed Google Scholar
Yanwei Fu
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyang Xue
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yanwei Fu or Xiangyang Xue.

Additional information

Communicated by Yongdong Zhang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, H., Cheang, C., Fu, Y. et al. HandO: a hybrid 3D hand–object reconstruction model for unknown objects. Multimedia Systems 28, 1845–1859 (2022). https://doi.org/10.1007/s00530-021-00874-7

Download citation

Received: 15 August 2021
Accepted: 15 November 2021
Published: 09 January 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s00530-021-00874-7

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HandO: a hybrid 3D hand–object reconstruction model for unknown objects

Abstract

Access this article

Subscribe and save

Buy Now

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now

Search

Navigation