Skip to main content

S\(^2\)Contact: Graph-Based Network for 3D Hand-Object Contact Estimation with Semi-supervised Learning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13661))

Included in the following conference series:

Abstract

Despite the recent efforts in accurate 3D annotations in hand and object datasets, there still exist gaps in 3D hand and object reconstructions. Existing works leverage contact maps to refine inaccurate hand-object pose estimations and generate grasps given object models. However, they require explicit 3D supervision which is seldom available and therefore, are limited to constrained settings, e.g., where thermal cameras observe residual heat left on manipulated objects. In this paper, we propose a novel semi-supervised framework that allows us to learn contact from monocular images. Specifically, we leverage visual and geometric consistency constraints in large-scale datasets for generating pseudo-labels in semi-supervised learning and propose an efficient graph-based network to infer contact. Our semi-supervised learning framework achieves a favourable improvement over the existing supervised learning methods trained on data with ‘limited’ annotations. Notably, our proposed model is able to achieve superior results with less than half the network parameters and memory access cost when compared with the commonly-used PointNet-based approach. We show benefits from using a contact map that rules hand-object interactions to produce more accurate reconstructions. We further demonstrate that training with pseudo-labels can extend contact map estimations to out-of-domain objects and generalise better across multiple datasets. Project page is available (https://eldentse.github.io/s2contact/).

T. H. E. Tse and Z. Zhang—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Brahmbhatt, S., Ham, C., Kemp, C.C., Hays, J.: ContactDB: analyzing and predicting grasp contact via thermal imaging. In: CVPR (2019)

    Google Scholar 

  2. Brahmbhatt, S., Tang, C., Twigg, C.D., Kemp, C.C., Hays, J.: ContactPose: a dataset of grasps with object contact and hand pose. In: ECCV (2020)

    Google Scholar 

  3. Cao, Z., Radosavovic, I., Kanazawa, A., Malik, J.: Reconstructing hand-object interactions in the wild. In: ICCV (2021)

    Google Scholar 

  4. Chao, Y.W., et al.: DexYCB: a benchmark for capturing hand grasping of objects. In: CVPR (2021)

    Google Scholar 

  5. Chen, W., Jia, X., Chang, H.J., Duan, J., Leonardis, A.: G2L-Net: global to local network for real-time 6D pose estimation with embedding vector features. In: CVPR (2020)

    Google Scholar 

  6. Chen, W., Jia, X., Chang, H.J., Duan, J., Shen, L., Leonardis, A.: FS-Net: fast shape-based network for category-level 6D object pose estimation with decoupled rotation mechanism. In: CVPR (2021)

    Google Scholar 

  7. Chen, Y., Tu, Z., Ge, L., Zhang, D., Chen, R., Yuan, J.: SO-HandNet: self-organizing network for 3D hand pose estimation with semi-supervised learning. In: CVPR (2019)

    Google Scholar 

  8. Corona, E., Pumarola, A., Alenya, G., Moreno-Noguer, F., Rogez, G.: GanHand: predicting human grasp affordances in multi-object scenes. In: CVPR (2020)

    Google Scholar 

  9. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: NeurIPS (2016)

    Google Scholar 

  10. Doosti, B., Naha, S., Mirbagheri, M., Crandall, D.J.: HOPE-Net: a graph-based model for hand-object pose estimation. In: CVPR (2020)

    Google Scholar 

  11. Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.K.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: CVPR (2018)

    Google Scholar 

  12. Grady, P., Tang, C., Twigg, C.D., Vo, M., Brahmbhatt, S., Kemp, C.C.: ContactOpt: optimizing contact to improve grasps. In: CVPR (2021)

    Google Scholar 

  13. Guo, M.-H., Cai, J.-X., Liu, Z.-N., Mu, T.-J., Martin, R.R., Hu, S.-M.: PCT: point cloud transformer. Computational Visual Media 7(2), 187–199 (2021). https://doi.org/10.1007/s41095-021-0229-5

    Article  Google Scholar 

  14. Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: A method for 3D annotation of hand and object poses. In: CVPR (2020)

    Google Scholar 

  15. Han, S., et al.: MEgATrack: monochrome egocentric articulated hand-tracking for virtual reality. In: SIGGRAPH (2020)

    Google Scholar 

  16. Hasson, Y., Tekin, B., Bogo, F., Laptev, I., Pollefeys, M., Schmid, C.: Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In: CVPR (2020)

    Google Scholar 

  17. Hasson, Y., Varol, G., Laptev, I., Schmid, C.: Towards unconstrained joint hand-object reconstruction from RGB videos. In: 3DV (2021)

    Google Scholar 

  18. Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)

    Google Scholar 

  19. Huang, L., Tan, J., Meng, J., Liu, J., Yuan, J.: HOT-Net: non-autoregressive transformer for 3D hand-object pose estimation. In: ACM MM (2020)

    Google Scholar 

  20. Jiang, H., Liu, S., Wang, J., Wang, X.: Hand-object contact consistency reasoning for human grasps generation. In: ICCV (2021)

    Google Scholar 

  21. Karunratanakul, K., Yang, J., Zhang, Y., Black, M.J., Muandet, K., Tang, S.: Grasping field: learning implicit representations for human grasps. In: 3DV (2020)

    Google Scholar 

  22. Kato, H., Ushiku, Y., Harada, T.: Neural 3D mesh renderer. In: CVPR (2018)

    Google Scholar 

  23. Kaviani, S., Rahimi, A., Hartley, R.: Semi-Supervised 3D hand shape and pose estimation with label propagation. arXiv preprint arXiv:2111.15199 (2021)

  24. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  25. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)

    Google Scholar 

  26. Kwon, T., Tekin, B., Stühmer, J., Bogo, F., Pollefeys, M.: H2O: two hands manipulating objects for first person interaction recognition. In: ICCV (2021)

    Google Scholar 

  27. Labbé, Y., Carpentier, J., Aubry, M., Sivic, J.: CosyPose: consistent multi-view multi-object 6D pose estimation. In: ECCV (2020)

    Google Scholar 

  28. Li, G., Muller, M., Thabet, A., Ghanem, B.: DeepGNSs: can GCNs go as deep as CNNs? In: ICCV (2019)

    Google Scholar 

  29. Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: DeepIM: deep iterative matching for 6D pose estimation. In: ECCV (2018)

    Google Scholar 

  30. Lin, Z.H., Huang, S.Y., Wang, Y.C.F.: Convolution in the cloud: learning deformable kernels in 3D graph convolution networks for point cloud analysis. In: CVPR (2020)

    Google Scholar 

  31. Liu, S., Jiang, H., Xu, J., Liu, S., Wang, X.: Semi-supervised 3D hand-object poses estimation with interactions in time. In: CVPR (2021)

    Google Scholar 

  32. Liu, Z., Hu, H., Cao, Y., Zhang, Z., Tong, X.: A closer look at local aggregation operators in point cloud analysis. In: ECCV (2020)

    Google Scholar 

  33. Maturana, D., Scherer, S.: VoxNet: a 3D convolutional neural network for real-time object recognition. In: IROS (2015)

    Google Scholar 

  34. Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., Bronstein, M.M.: Geometric deep learning on graphs and manifolds using mixture model CNNs. In: CVPR (2017)

    Google Scholar 

  35. Mueller, F., et al.: GANerated hands for real-time 3D hand tracking from monocular RGB. In: CVPR (2018)

    Google Scholar 

  36. Mueller, F., et al.: Real-time pose and shape reconstruction of two interacting hands with a single depth camera. In: SIGGRAPH (2019)

    Google Scholar 

  37. Paszke, A., et al.: Automatic Differentiation in Pytorch. In: NeurIPS (2017)

    Google Scholar 

  38. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR (2017)

    Google Scholar 

  39. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)

    Google Scholar 

  40. Qian, G., Hammoud, H., Li, G., Thabet, A., Ghanem, B.: ASSANet: an anisotropic separable set abstraction for efficient point cloud representation learning. NeurIPS (2021)

    Google Scholar 

  41. Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Trans. Graph. (ToG) 36(6), 1–17 (2017)

    Google Scholar 

  42. Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: CVPR (2017)

    Google Scholar 

  43. Spurr, A., Molchanov, P., Iqbal, U., Kautz, J., Hilliges, O.: Adversarial motion modelling helps semi-supervised hand pose estimation. arXiv preprint arXiv:2106.05954 (2021)

  44. Spurr, A., Song, J., Park, S., Hilliges, O.: Cross-modal deep variational hand pose estimation. In: CVPR (2018)

    Google Scholar 

  45. Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)

    Google Scholar 

  46. Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: GRAB: a dataset of whole-body human grasping of objects. In: ECCV (2020)

    Google Scholar 

  47. Tang, D., Chang, H.J., Tejani, A., Kim, T.K.: Latent regression forest: structured estimation of 3D articulated hand posture. In: CVPR (2014)

    Google Scholar 

  48. Tang, D., Yu, T.H., Kim, T.K.: Real-time articulated hand pose estimation using semi-supervised transductive regression forests. In: ICCV (2013)

    Google Scholar 

  49. Tekin, B., Bogo, F., Pollefeys, M.: H+O: unified egocentric recognition of 3D hand-object poses and interactions. In: CVPR (2019)

    Google Scholar 

  50. Ueda, E., Matsumoto, Y., Imai, M., Ogasawara, T.: A hand-pose estimation for vision-based human interfaces. IEEE Trans. Ind. Electron. 50(4), 676–684 (2003)

    Google Scholar 

  51. Wang, H., Cong, Y., Litany, O., Gao, Y., Guibas, L.J.: 3DIoUMatch: leveraging IoU prediction for semi-supervised 3D object detection. In: CVPR (2021)

    Google Scholar 

  52. Wang, J., et al.: RGB2Hands: real-time tracking of 3D hand interactions from monocular RGB video. In: SIGGRAPH (2020)

    Google Scholar 

  53. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. In: SIGGRAPH (2019)

    Google Scholar 

  54. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  55. Wu, W., Qi, Z., Fuxin, L.: PointConv: deep convolutional networks on 3D point clouds. In: CVPR (2019)

    Google Scholar 

  56. Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. In: RSS (2018)

    Google Scholar 

  57. Xu, M., Ding, R., Zhao, H., Qi, X.: PAConv: position adaptive convolution with dynamic kernel assembling on point clouds. In: CVPR (2021)

    Google Scholar 

  58. Yang, J., Chang, H.J., Lee, S., Kwak, N.: SeqHAND: RGB-sequence-based 3D hand pose and shape estimation. In: ECCV (2020)

    Google Scholar 

  59. Yang, L., Chen, S., Yao, A.: SemiHand: semi-supervised hand pose estimation with consistency. In: ICCV (2021)

    Google Scholar 

  60. Yang, L., Zhan, X., Li, K., Xu, W., Li, J., Lu, C.: CPF: learning a contact potential field to model the hand-object interaction. In: ICCV (2021)

    Google Scholar 

  61. You, H., Feng, Y., Ji, R., Gao, Y.: PVNet: a joint convolutional network of point cloud and multi-view for 3D shape recognition. In: ACM Multimedia (2018)

    Google Scholar 

  62. Zhang, T., et al.: Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In: ICRA (2018)

    Google Scholar 

  63. Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: ICCV (2021)

    Google Scholar 

  64. Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: ICCV (2017)

    Google Scholar 

Download references

Acknowledgements

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP–2022–2020–0–01789) supervised by the IITP (Institute of Information & Communications Technology Planning & Evaluation) and the Baskerville Tier 2 HPC service (https://www.baskerville.ac.uk/) funded by the Engineering and Physical Sciences Research Council (EPSRC) and UKRI through the World Class Labs scheme (EP/T022221/1) and the Digital Research Infrastructure programme (EP/W032244/1) operated by Advanced Research Computing at the University of Birmingham. KIK was supported by the National Research Foundation of Korea (NRF) grant (No. 2021R1A2C2012195) and IITP grants (IITP–2021–0–02068 and IITP–2020–0–01336). ZQZ was supported by China Scholarship Council (CSC) Grant No. 202208060266. AL was supported in part by the EPSRC (grant number EP/S032487/1). FZ was supported by the National Natural Science Foundation of China under Grant No. 61972188 and 62122035.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhongqun Zhang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 906 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tse, T.H.E., Zhang, Z., Kim, K.I., Leonardis, A., Zheng, F., Chang, H.J. (2022). S\(^2\)Contact: Graph-Based Network for 3D Hand-Object Contact Estimation with Semi-supervised Learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13661. Springer, Cham. https://doi.org/10.1007/978-3-031-19769-7_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19769-7_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19768-0

  • Online ISBN: 978-3-031-19769-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics