Abstract
Recently, there has been a significant amount of research conducted on 3D hand reconstruction to use various forms of human-computer interaction. However, 3D hand reconstruction in the wild is challenging due to extreme lack of in-the-wild 3D hand datasets. Especially, when hands are in complex pose such as interacting hands, the problems like appearance similarity, self-handed occlusion and depth ambiguity make it more difficult. To overcome these issues, we propose AttentionHand, a novel method for text-driven controllable hand image generation. Since AttentionHand can generate various and numerous in-the-wild hand images well-aligned with 3D hand label, we can acquire a new 3D hand dataset, and can relieve the domain gap between indoor and outdoor scenes. Our method needs easy-to-use four modalities (i.e., an RGB image, a hand mesh image from 3D label, a bounding box, and a text prompt). These modalities are embedded into the latent space by the encoding phase. Then, through the text attention stage, hand-related tokens from the given text prompt are attended to highlight hand-related regions of the latent embedding. After the highlighted embedding is fed to the visual attention stage, hand-related regions in the embedding are attended by conditioning global and local hand mesh images with the diffusion-based pipeline. In the decoding phase, the final feature is decoded to new hand images, which are well-aligned with the given hand mesh image and text prompt. As a result, AttentionHand achieved state-of-the-art among text-to-hand image generation models, and the performance of 3D hand mesh reconstruction was improved by additionally training with hand images generated by AttentionHand.
J. Park and K. Kong—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Hampali, S., Rad, M., Oberweger, M., Lepetit, V., HOnnotate: a method for 3D annotation of hand and object poses. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3196–3206 (2020)
Chao, Y.W.: DexYCB: a benchmark for capturing hand grasping of objects. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9044–9053 (2021)
Ohkawa, T., He, K., Sener, F., Hodan, T., Tran, L., Keskin, C.: AssemblyHands: towards egocentric activity understanding via 3D hand pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 12999–13008 (2023)
Lin, F., Wilhelm, C., Martinez, T.: Two-hand global 3D pose estimation using monocular RGB. In: IEEE, pp. 2373–2381 (2021)
Moon, G., et al.: A dataset of relighted 3D interacting hands. Adv. Neural Inf. Process. Syst. 36 (2023)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV, pp. 740–755 (2014)
Moon, G., Yu, S.I., Wen, H., Shiratori, T., Lee, K.M.: InterHand2.6M: a dataset and baseline for 3D interacting hand pose estimation from a single RGB image. In: ECCV, pp. 548–564 (2020)
Rong, Y., Wang, J., Liu, Z., Loy, C.C.: Monocular 3D reconstruction of interacting hands via collision-aware factorized refinements. In: 3DV, pp. 432–441 (2021)
Zhang, B., et al.: Interacting two-hand 3D pose and shape reconstruction from single color image. In: ICCV, pp. 11354–11363 (2021)
Li, M., et al.: Interacting attention graph for single image two-hand reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2761–2770 (2022)
Hampali, S., Sarkar, S.D., Rad, M., Lepetit, V.: Keypoint Transformer: solving joint identification in challenging hands and object interactions for accurate 3D pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 11090–11100 (2022)
Meng, H., et al.: 3D interacting hand pose estimation by hand de-occlusion and removal. In: ECCV, pp. 380–397 (2022)
Ren, P., et al.: Decoupled iterative refinement framework for interacting hands reconstruction from a single RGB image. In: ICCV, pp. 8014–8025 (2023)
Zuo, B., Zhao, Z., Sun, W., Xie, W., Xue, Z., Wang, Y.: Reconstructing interacting hands with interaction prior from monocular images. In: ICCV, pp. 9054–9064 (2023)
Li, L., et al.: RenderIH: a large-scale synthetic dataset for 3D interacting hand pose estimation. In: ICCV, pp. 20395–20405 (2023)
Moon, G.: Bringing inputs to shared domains for 3D interacting hands recovery in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17028–17037 (2023)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV, pp. 3836–3847 (2023)
Mou, C., et al.: T2I-Adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2024)
Zhao, S., et al.: Uni-ControlNet: all-in-one control to text-to-image diffusion models. Adv. Neural Inf. Process. Syst. 36 (2023)
Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
Mueller, F., et al.: GANerated hands for real-time 3D hand tracking from monocular RGB. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–59 (2018)
Tang, H., Wang, W., Xu, D., Yan, Y., Sebe, N.: GestureGAN for hand gesture-to-gesture translation in the wild. In: ACM International Multimedia Conference, pp. 774–782 (2018)
Hu, H., Wang, W., Zhou, W., Zhao, W., Li, H.: Model-aware gesture-to-gesture translation. In: Conference on Computer Vision and Pattern Recognition, pp. 16428–16437 (2021)
Hu, H., Wang, W., Zhou, W., Li, H.: Hand-object interaction image generation. Adv. Neural Inf. Process. Syst. 35, 23805–23817 (2022)
Goodfellow, I., et al.: Generative adversarial nets. Adv. Neural Inform. Process. Syst. 27, 2672–2680 (2014)
Li, L., Zhuo, L.A., Zhang, B., Bo, L., Chen, C.: DiffHand: end-to-end hand mesh reconstruction via diffusion models. arXiv preprint arXiv:2305.13705 (2023)
Lin, P., et al.: HandDiffuse: generative controllers for two-hand interactions via diffusion models. arXiv preprint arXiv:2312.04867 (2023)
Lu, W., Xu, Y., Zhang, J., Wang, C., Tao, D.: HandRefiner: refining malformed hands in generated images by diffusion-based conditional inpainting. arXiv preprint arXiv:2311.17957 (2023 )
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 12873–12883 (2021)
Radford, A., et al: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Chen, C.F.R., Fan, Q., Panda, R.: CrossViT: cross-attention multi-scale vision transformer for image classification. In: ICCV, pp. 357–366 (2021)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234–241 (2015)
Chiche, A., Yitagesu, B.: Part of speech tagging: a systematic review of deep learning and machine learning approaches. J. Big Data 9(1), 1–25 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Tzionas, D., Ballan, L., Srikantha, A., Aponte, P., Pollefeys, M., Gall, J.: Capturing hands in action using discriminative salient points and physics simulation. Int. J. Comput. Vis. 118, 172–193 (2016)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Adv. Neural Inf. Process. Syst. 30, 6626–6637 (2017)
Bińkowski, M., Sutherland, D., Arbel, M, Gretton, A.: Demystifying MMD GANs. In: International Conference Learning Representation (2018)
Narasimhaswamy, S., Bhattacharya, U., Chen, X., Dasgupta, I., Mitra, S., Hoai, M.: HanDiffuser: text-to-image generation with realistic hand appearances. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2468–2479 (2024)
Zhou, Y., et al.: Mixture-of-experts with expert choice routing. Adv. Neural Inf. Process. Syst. 35, 7103–7114 (2022)
Fedus, W., Zoph, B., Shazeer, N.: Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 23(120), 1–39 (2022)
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph. 42(4), 1–10 (2023)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008)
Acknowledgements
This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2024-RS-2023-00260091) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) and Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0020535, The Competency Development Program for Industry Specialist) and National Supercomputing Center with supercomputing resources including technical support (KSC-2023-CRE-0444).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Park, J., Kong, K., Kang, SJ. (2025). AttentionHand: Text-Driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15118. Springer, Cham. https://doi.org/10.1007/978-3-031-73027-6_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-73027-6_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73026-9
Online ISBN: 978-3-031-73027-6
eBook Packages: Computer ScienceComputer Science (R0)