Skip to main content

AttentionHand: Text-Driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Recently, there has been a significant amount of research conducted on 3D hand reconstruction to use various forms of human-computer interaction. However, 3D hand reconstruction in the wild is challenging due to extreme lack of in-the-wild 3D hand datasets. Especially, when hands are in complex pose such as interacting hands, the problems like appearance similarity, self-handed occlusion and depth ambiguity make it more difficult. To overcome these issues, we propose AttentionHand, a novel method for text-driven controllable hand image generation. Since AttentionHand can generate various and numerous in-the-wild hand images well-aligned with 3D hand label, we can acquire a new 3D hand dataset, and can relieve the domain gap between indoor and outdoor scenes. Our method needs easy-to-use four modalities (i.e., an RGB image, a hand mesh image from 3D label, a bounding box, and a text prompt). These modalities are embedded into the latent space by the encoding phase. Then, through the text attention stage, hand-related tokens from the given text prompt are attended to highlight hand-related regions of the latent embedding. After the highlighted embedding is fed to the visual attention stage, hand-related regions in the embedding are attended by conditioning global and local hand mesh images with the diffusion-based pipeline. In the decoding phase, the final feature is decoded to new hand images, which are well-aligned with the given hand mesh image and text prompt. As a result, AttentionHand achieved state-of-the-art among text-to-hand image generation models, and the performance of 3D hand mesh reconstruction was improved by additionally training with hand images generated by AttentionHand.

J. Park and K. Kong—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Hampali, S., Rad, M., Oberweger, M., Lepetit, V., HOnnotate: a method for 3D annotation of hand and object poses. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3196–3206 (2020)

    Google Scholar 

  2. Chao, Y.W.: DexYCB: a benchmark for capturing hand grasping of objects. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9044–9053 (2021)

    Google Scholar 

  3. Ohkawa, T., He, K., Sener, F., Hodan, T., Tran, L., Keskin, C.: AssemblyHands: towards egocentric activity understanding via 3D hand pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 12999–13008 (2023)

    Google Scholar 

  4. Lin, F., Wilhelm, C., Martinez, T.: Two-hand global 3D pose estimation using monocular RGB. In: IEEE, pp. 2373–2381 (2021)

    Google Scholar 

  5. Moon, G., et al.: A dataset of relighted 3D interacting hands. Adv. Neural Inf. Process. Syst. 36 (2023)

    Google Scholar 

  6. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV, pp. 740–755 (2014)

    Google Scholar 

  7. Moon, G., Yu, S.I., Wen, H., Shiratori, T., Lee, K.M.: InterHand2.6M: a dataset and baseline for 3D interacting hand pose estimation from a single RGB image. In: ECCV, pp. 548–564 (2020)

    Google Scholar 

  8. Rong, Y., Wang, J., Liu, Z., Loy, C.C.: Monocular 3D reconstruction of interacting hands via collision-aware factorized refinements. In: 3DV, pp. 432–441 (2021)

    Google Scholar 

  9. Zhang, B., et al.: Interacting two-hand 3D pose and shape reconstruction from single color image. In: ICCV, pp. 11354–11363 (2021)

    Google Scholar 

  10. Li, M., et al.: Interacting attention graph for single image two-hand reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2761–2770 (2022)

    Google Scholar 

  11. Hampali, S., Sarkar, S.D., Rad, M., Lepetit, V.: Keypoint Transformer: solving joint identification in challenging hands and object interactions for accurate 3D pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 11090–11100 (2022)

    Google Scholar 

  12. Meng, H., et al.: 3D interacting hand pose estimation by hand de-occlusion and removal. In: ECCV, pp. 380–397 (2022)

    Google Scholar 

  13. Ren, P., et al.: Decoupled iterative refinement framework for interacting hands reconstruction from a single RGB image. In: ICCV, pp. 8014–8025 (2023)

    Google Scholar 

  14. Zuo, B., Zhao, Z., Sun, W., Xie, W., Xue, Z., Wang, Y.: Reconstructing interacting hands with interaction prior from monocular images. In: ICCV, pp. 9054–9064 (2023)

    Google Scholar 

  15. Li, L., et al.: RenderIH: a large-scale synthetic dataset for 3D interacting hand pose estimation. In: ICCV, pp. 20395–20405 (2023)

    Google Scholar 

  16. Moon, G.: Bringing inputs to shared domains for 3D interacting hands recovery in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17028–17037 (2023)

    Google Scholar 

  17. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  18. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV, pp. 3836–3847 (2023)

    Google Scholar 

  19. Mou, C., et al.: T2I-Adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2024)

  20. Zhao, S., et al.: Uni-ControlNet: all-in-one control to text-to-image diffusion models. Adv. Neural Inf. Process. Syst. 36 (2023)

    Google Scholar 

  21. Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

  22. Mueller, F., et al.: GANerated hands for real-time 3D hand tracking from monocular RGB. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–59 (2018)

    Google Scholar 

  23. Tang, H., Wang, W., Xu, D., Yan, Y., Sebe, N.: GestureGAN for hand gesture-to-gesture translation in the wild. In: ACM International Multimedia Conference, pp. 774–782 (2018)

    Google Scholar 

  24. Hu, H., Wang, W., Zhou, W., Zhao, W., Li, H.: Model-aware gesture-to-gesture translation. In: Conference on Computer Vision and Pattern Recognition, pp. 16428–16437 (2021)

    Google Scholar 

  25. Hu, H., Wang, W., Zhou, W., Li, H.: Hand-object interaction image generation. Adv. Neural Inf. Process. Syst. 35, 23805–23817 (2022)

    Google Scholar 

  26. Goodfellow, I., et al.: Generative adversarial nets. Adv. Neural Inform. Process. Syst. 27, 2672–2680 (2014)

    Google Scholar 

  27. Li, L., Zhuo, L.A., Zhang, B., Bo, L., Chen, C.: DiffHand: end-to-end hand mesh reconstruction via diffusion models. arXiv preprint arXiv:2305.13705 (2023)

  28. Lin, P., et al.: HandDiffuse: generative controllers for two-hand interactions via diffusion models. arXiv preprint arXiv:2312.04867 (2023)

  29. Lu, W., Xu, Y., Zhang, J., Wang, C., Tao, D.: HandRefiner: refining malformed hands in generated images by diffusion-based conditional inpainting. arXiv preprint arXiv:2311.17957 (2023 )

  30. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 12873–12883 (2021)

    Google Scholar 

  31. Radford, A., et al: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)

    Google Scholar 

  32. Chen, C.F.R., Fan, Q., Panda, R.: CrossViT: cross-attention multi-scale vision transformer for image classification. In: ICCV, pp. 357–366 (2021)

    Google Scholar 

  33. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234–241 (2015)

    Google Scholar 

  34. Chiche, A., Yitagesu, B.: Part of speech tagging: a systematic review of deep learning and machine learning approaches. J. Big Data 9(1), 1–25 (2022)

    Article  Google Scholar 

  35. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  36. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)

    Google Scholar 

  37. Tzionas, D., Ballan, L., Srikantha, A., Aponte, P., Pollefeys, M., Gall, J.: Capturing hands in action using discriminative salient points and physics simulation. Int. J. Comput. Vis. 118, 172–193 (2016)

    Article  MathSciNet  Google Scholar 

  38. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Adv. Neural Inf. Process. Syst. 30, 6626–6637 (2017)

    Google Scholar 

  39. Bińkowski, M., Sutherland, D., Arbel, M, Gretton, A.: Demystifying MMD GANs. In: International Conference Learning Representation (2018)

    Google Scholar 

  40. Narasimhaswamy, S., Bhattacharya, U., Chen, X., Dasgupta, I., Mitra, S., Hoai, M.: HanDiffuser: text-to-image generation with realistic hand appearances. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2468–2479 (2024)

    Google Scholar 

  41. Zhou, Y., et al.: Mixture-of-experts with expert choice routing. Adv. Neural Inf. Process. Syst. 35, 7103–7114 (2022)

    Google Scholar 

  42. Fedus, W., Zoph, B., Shazeer, N.: Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 23(120), 1–39 (2022)

    MathSciNet  Google Scholar 

  43. Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph. 42(4), 1–10 (2023)

    Article  Google Scholar 

  44. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008)

    Google Scholar 

Download references

Acknowledgements

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2024-RS-2023-00260091) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) and Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0020535, The Competency Development Program for Industry Specialist) and National Supercomputing Center with supercomputing resources including technical support (KSC-2023-CRE-0444).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suk-Ju Kang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 58477 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Park, J., Kong, K., Kang, SJ. (2025). AttentionHand: Text-Driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15118. Springer, Cham. https://doi.org/10.1007/978-3-031-73027-6_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73027-6_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73026-9

  • Online ISBN: 978-3-031-73027-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics