Skip to main content

PoseEmbroider: Towards a 3D, Visual, Semantic-Aware Human Pose Representation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15129))

Included in the following conference series:

  • 130 Accesses

Abstract

Aligning multiple modalities in a latent space, such as images and texts, has shown to produce powerful semantic visual representations, fueling tasks like image captioning, text-to-image generation, or image grounding. In the context of human-centric vision, albeit CLIP-like representations encode most standard human poses relatively well (such as standing or sitting), they lack sufficient acuteness to discern detailed or uncommon ones. Actually, while 3D human poses have been often associated with images (e.g. to perform pose estimation or pose-conditioned image generation), or more recently with text (e.g. for text-to-pose generation), they have seldom been paired with both. In this work, we combine 3D poses, person’s pictures and textual pose descriptions to produce an enhanced 3D-, visual- and semantic-aware human pose representation. We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities. When composing modalities, it outperforms a standard multi-modal alignment retrieval model, making it possible to sort out partial information (e.g. image with the lower body occluded). We showcase the potential of such an embroidered pose representation for (1) SMPL regression from image with optional text cue; and (2) on the task of fine-grained instruction generation, which consists in generating a text that describes how to move from one 3D pose to another (as a fitness coach). Unlike prior works, our model can take any kind of input (image and/or pose) without retraining.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahuja, C., Morency, L.P.: Language2pose: natural language grounded pose forecasting. In: 3DV (2019)

    Google Scholar 

  2. Akbari, H., et al.: Vatt: transformers for multimodal self-supervised learning from raw video, audio and text. In: NeurIPS (2021)

    Google Scholar 

  3. Alayrac, J.B., et al.: Self-supervised multimodal versatile networks. In: NeurIPS (2020)

    Google Scholar 

  4. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop (2005)

    Google Scholar 

  5. Barsoum, E., Kender, J., Liu, Z.: HP-GAN: probabilistic 3D human motion prediction via GAN. In: CVPRW (2018)

    Google Scholar 

  6. Black, M.J., Patel, P., Tesch, J., Yang, J.: BEDLAM: a synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: CVPR (2023)

    Google Scholar 

  7. Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep It SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_34

    Chapter  Google Scholar 

  8. Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: CVPR (2023)

    Google Scholar 

  9. Cai, Z., et al.: Smpler-x: scaling up expressive human pose and shape estimation. In: NeurIPS (2024)

    Google Scholar 

  10. Chen, W., et al.: Beyond appearance: a semantic controllable self-supervised learning framework for human-centric visual tasks. In: CVPR (2023)

    Google Scholar 

  11. Chen, Z., Li, Q., Wang, X., Yang, W.: LiftedCL: lifting contrastive learning for human-centric perception. In: ICLR (2022)

    Google Scholar 

  12. Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* chatgpt quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/

  13. Ci, Y., et al.: UniHCP: a unified model for human-centric perceptions. In: CVPR (2023)

    Google Scholar 

  14. Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: PoseScript: 3D human poses from natural language. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 346–362. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_20

    Chapter  Google Scholar 

  15. Delmas, G., Weinzaepfel, P., Moreno-Noguer, F., Rogez, G.: PoseFix: correcting 3D Human Poses with Natural Language. In: ICCV (2023)

    Google Scholar 

  16. Ding, Y., Tian, C., Ding, H., Liu, L.: The clip model is secretly an image-to-prompt converter. In: NeurIPS (2024)

    Google Scholar 

  17. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  18. Driess, D., et al.: Palm-e: an embodied multimodal language model. In: arXiv preprint arXiv:2303.03378 (2023)

  19. Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: improving visual-semantic embeddings with hard negatives (2018)

    Google Scholar 

  20. Feng, Y., Lin, J., Dwivedi, S.K., Sun, Y., Patel, P., Black, M.J.: Chatpose: chatting about 3D human pose. In: CVPR (2024)

    Google Scholar 

  21. Fieraru, M., Zanfir, M., Pirlea, S.C., Olaru, V., Sminchisescu, C.: AIFit: automatic 3D human-interpretable feedback models for fitness training. In: CVPR (2021)

    Google Scholar 

  22. Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: NeurIPS (2013)

    Google Scholar 

  23. Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: ICCV (2021)

    Google Scholar 

  24. Girdhar, R., et al.: ImageBind: one embedding space to bind them all. In: CVPR (2023)

    Google Scholar 

  25. Girdhar, R., Singh, M., Ravi, N., van der Maaten, L., Joulin, A., Misra, I.: Omnivore: a single model for many visual modalities. In: CVPR (2022)

    Google Scholar 

  26. Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa*, A., Malik*, J.: Humans in 4D: reconstructing and tracking humans with transformers. In: ICCV (2023)

    Google Scholar 

  27. Guo, C., et al.: Generating diverse and natural 3d human motions from text. In: CVPR (2022)

    Google Scholar 

  28. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR (1994)

    Google Scholar 

  29. Hong, F., Pan, L., Cai, Z., Liu, Z.: Versatile multi-modal pre-training for human-centric perception. In: CVPR (2022)

    Google Scholar 

  30. Ibrahimi, S., Sun, X., Wang, P., Garg, A., Sanan, A., Omar, M.: Audio-enhanced text-to-video retrieval using text-conditioned feature alignment. In: ICCV (2023)

    Google Scholar 

  31. Jin, Z., Hayat, M., Yang, Y., Guo, Y., Lei, Y.: Context-aware alignment and mutual masking for 3D-language pre-training. In: CVPR (2023)

    Google Scholar 

  32. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)

    Google Scholar 

  33. Kim, G., Kwon, T., Ye, J.C.: Diffusionclip: text-guided diffusion models for robust image manipulation. In: CVPR (2022)

    Google Scholar 

  34. Kim, H., Zala, A., Burri, G., Bansal, M.: FixMyPose: Pose correctional captioning and retrieval. In: AAAI (2021)

    Google Scholar 

  35. Kim, J., Kim, J., Choi, S.: Flame: Free-form language-based motion synthesis & editing. In: AAAI (2023)

    Google Scholar 

  36. Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)

  37. Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV (2019)

    Google Scholar 

  38. Kwon, G., Cai, Z., Ravichandran, A., Bas, E., Bhotika, R., Soatto, S.: Masked vision and language modeling for multi-modal representation learning. In: ICLR (2023)

    Google Scholar 

  39. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)

    Google Scholar 

  40. Lin, J., et al.: Motion-x: a large-scale 3D expressive whole-body human motion dataset. In: NeurIPS (2023)

    Google Scholar 

  41. Lin, J., Zeng, A., Wang, H., Zhang, L., Li, Y.: One-stage 3D whole-body mesh recovery with component aware transformer. In: CVPR (2023)

    Google Scholar 

  42. Lin, Y., Wei, C., Wang, H., Yuille, A., Xie, C.: Smaug: sparse masked autoencoder for efficient video-language pre-training. In: ICCV (2023)

    Google Scholar 

  43. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

    Google Scholar 

  44. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM TOG (2015)

    Google Scholar 

  45. Lucas, T., Baradel, F., Weinzaepfel, P., Rogez, G.: PoseGPT: quantizing human motion for large scale generative modeling. In: ECCV (2022)

    Google Scholar 

  46. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: ICCV (2019)

    Google Scholar 

  47. von Marcard, T., Henschel, R., Black, M., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: ECCV (2018)

    Google Scholar 

  48. Mizrahi, D., et al.: 4m: massively multimodal masked modeling. In: NeurIPS (2024)

    Google Scholar 

  49. Müller, L., Osman, A.A.A., Tang, S., Huang, C.H.P., Black, M.J.: On self-contact and human pose. In: CVPR (2021)

    Google Scholar 

  50. Müller, M., Arzt, A., Balke, S., Dorfer, M., Widmer, G.: Cross-modal music retrieval and applications: an overview of key methodologies. IEEE Signal Process. Mag. (2018)

    Google Scholar 

  51. Oord, A.V.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  52. Papineni, K., Roukos, S., Ward, T., jing Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: ACL (2002)

    Google Scholar 

  53. Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)

    Google Scholar 

  54. Petrovich, M., Black, M.J., Varol, G.: TEMOS: generating diverse human motions from textual descriptions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 480–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_28

    Chapter  Google Scholar 

  55. Petrovich, M., Black, M.J., Varol, G.: TMR: text-to-motion retrieval using contrastive 3D human motion synthesis. In: ICCV (2023)

    Google Scholar 

  56. Plappert, M., Mandery, C., Asfour, T.: The kit motion-language dataset. Big Data (2016)

    Google Scholar 

  57. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  58. Ruan, Y., Lee, H.H., Zhang, Y., Zhang, K., Chang, A.X.: TriCOLo: trimodal contrastive loss for text to shape retrieval. In: WACV (2024)

    Google Scholar 

  59. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  60. Tang, S., et al.: Humanbench: towards general human-centric perception with projector assisted pretraining. In: CVPR (2023)

    Google Scholar 

  61. Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: MotionCLIP: exposing human motion generation to clip space. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 358–374. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_21

    Chapter  Google Scholar 

  62. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  63. Vo, N., et al.: Composing text and image for image retrieval-an empirical odyssey. In: CVPR (2019)

    Google Scholar 

  64. Von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: ECCV (2018)

    Google Scholar 

  65. Wang, Y., et al.: Hulk: a universal knowledge translator for human-centric tasks. arXiv preprint arXiv:2312.01697 (2023)

  66. Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: Next-GPT: any-to-any multimodal LLM. arXiv preprint arXiv:2309.05519 (2023)

  67. Xu, Y., Zhang, J., Zhang, Q., Tao, D.: ViTPose: simple vision transformer baselines for human pose estimation. In: NeurIPS (2022)

    Google Scholar 

  68. Yin, K., Zou, S., Ge, Y., Tian, Z.: Tri-modal motion retrieval by learning a joint embedding space. arXiv preprint arXiv:4030.0691 (2024)

  69. Youwang, K., Ji-Yeon, K., Oh, T.H.: Clip-actor: text-driven recommendation and stylization for animating human meshes. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13663, pp. 173–191. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20062-5_11

    Chapter  Google Scholar 

  70. Yuan, J., et al.: Hap: structure-aware masked image modeling for human-centric perception. In: NeurIPS (2023)

    Google Scholar 

  71. Zhang, S., et al.: EgoBody: Human body shape and motion of interacting people from head-mounted devices. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 180–200. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_11

    Chapter  Google Scholar 

Download references

Acknowledgments

This work is supported by the Spanish government with the project MoHuCo PID2020-120049RB-I00, and by NAVER LABS Europe under technology transfer contract ‘Text4Pose’.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ginger Delmas .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5088 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Delmas, G., Weinzaepfel, P., Moreno-Noguer, F., Rogez, G. (2025). PoseEmbroider: Towards a 3D, Visual, Semantic-Aware Human Pose Representation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15129. Springer, Cham. https://doi.org/10.1007/978-3-031-73209-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73209-6_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73208-9

  • Online ISBN: 978-3-031-73209-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics