Skip to main content

Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

We introduce Nymeria - a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices. The dataset comes with a) full-body ground-truth motion; b) multiple multimodal egocentric data from Project Aria devices with videos, eye tracking, IMUs and etc.; and c) an third-person perspective by an additional “observer”. All devices are precisely synchronized and localized in one metric 3D world. We derive hierarchical protocol to add in-context language descriptions of human motion, from fine-grain motion narrations, to simplified atomic actions and high-level activity summarization. To the best of our knowledge, Nymeria dataset is the world’s largest human motion in the wild; first of its kind to provide synchronized and localized multi-device multimodal egocentric data; and the world’s largest motion-language dataset. It provides 300 hours of daily activities from 264 participants across 50 locations, total travelling distance over 399 Km. The language descriptions contain 310.5K sentences in 8.64M words from a vocabulary size of 6545. To demonstrate the potential of the dataset we evaluate several SOTA algorithms for egocentric body tracking, motion synthesis, and action recognition.

F. Hong, V. Guzov and Y. Jiang—Work done during internships at Meta Reality Labs Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Apple Vision Pro. https://www.apple.com/apple-vision-pro/

  2. HTC VIVE. vive.com

  3. Magic Leap 2. https://www.magicleap.com/magic-leap-2

  4. Meta momentum library. https://github.com/facebookincubator/momentum/

  5. Meta Quest. https://www.meta.com/quest/

  6. Microsoft HoloLens. https://learn.microsoft.com/en-us/hololens/

  7. Movella XSens MVN Link motion capture. https://www.movella.com/products/motion-capture/xsens-mvn-link

  8. Project Aria Machine Perception Services. https://facebookresearch.github.io/projectaria_tools/docs/ARK/mps

  9. Ray-Ban Meta smart glasses. https://www.meta.com/smart-glasses/

  10. Rokoko. https://www.rokoko.com/

  11. Vuzix smart glasses. https://www.vuzix.com/pages/smart-glasses

  12. Akada, H., et al.: UnrealEgo: a new dataset for robust egocentric 3D human motion capture. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 1–17. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_1

    Chapter  Google Scholar 

  13. Araujo, J.P., et al.: CIRCLE: capture in rich contextual environments. In: CVPR (2023)

    Google Scholar 

  14. Banerjee, P., et al.: Introducing HOT3D: an egocentric dataset for 3D hand and object tracking (2024)

    Google Scholar 

  15. Black, M.J., Patel, P., Tesch, J., Yang, J.: BEDLAM: a synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: Proceedings IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  16. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

    Google Scholar 

  17. Cai, Z., et al.: HuMMan: multi-modal 4D human dataset for versatile sensing and modeling. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13667, pp. 557–577. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20071-7_33

    Chapter  Google Scholar 

  18. Cai, Z., et al.: SMPLer-X: scaling up expressive human pose and shape estimation (2023)

    Google Scholar 

  19. Cai, Z., et al.: Playing for 3D human recovery. arXiv preprint arXiv:2110.07588 (2021)

  20. Castillo, A., et al.: BoDiffusion: diffusing sparse observations for full-body human motion synthesis. In: ICCV (2023)

    Google Scholar 

  21. Chen, X., et al.: Executing your commands via motion diffusion in latent space. In: CVPR (2023)

    Google Scholar 

  22. Cong, P., et al.: LaserHuman: language-guided scene-aware human motion generation in free environment (2024)

    Google Scholar 

  23. Dabral, R., Mughal, M.H., Golyanik, V., Theobalt, C.: MoFusion: a framework for denoising-diffusion-based motion synthesis. In: Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  24. Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV (2018)

    Google Scholar 

  25. Damen, D., et al.: The epic-kitchens dataset: collection, challenges and baselines. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 43(11), 4125–4141 (2021)

    Article  Google Scholar 

  26. Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: PoseScript: 3D human poses from natural language. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 346–362. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_20

  27. Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: a generative model for music. arXiv preprint arXiv:2005.00341 (2020)

  28. Du, Y., Kips, R., Pumarola, A., Starke, S., Thabet, A., Sanakoyeu, A.: Avatars grow legs: generating smooth human motion from sparse tracking inputs with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 481–490 (2023)

    Google Scholar 

  29. Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry (2016)

    Google Scholar 

  30. Engel, J., et al.: Project aria: a new tool for egocentric multi-modal AI research (2023)

    Google Scholar 

  31. Feng, Y., Lin, J., Dwivedi, S.K., Sun, Y., Patel, P., Black, M.J.: ChatPose: chatting about 3D human pose. In: CVPR (2024)

    Google Scholar 

  32. Ghorbani, S., et al.: MoVi: a large multi-purpose human motion and video dataset. PLoS ONE 16(6), e0253157 (2021)

    Article  Google Scholar 

  33. Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4D: reconstructing and tracking humans with transformers. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  34. Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: CVPR, pp. 18995–19012 (2022)

    Google Scholar 

  35. Grauman, K., et al.: Ego-Exo4D: understanding skilled human activity from first- and third-person perspectives. In: CVPR (2024)

    Google Scholar 

  36. Guo, C., et al.: Generating diverse and natural 3D human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5152–5161 (2022)

    Google Scholar 

  37. Guo, C., Zuo, X., Wang, S., Cheng, L.: TM2T: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 580–597. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_34

  38. Guo, C., et al.: Action2Motion: conditioned generation of 3D human motions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2021–2029 (2020)

    Google Scholar 

  39. Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human poseitioning system (HPS): 3D human pose estimation and self-localization in large scenes from body-mounted sensors. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  40. Harvey, F.G., Yurick, M., Nowrouzezahrai, D., Pal, C.: Robust motion in-betweening. ACM Trans. Graph. (TOG) 39(4), 60–1 (2020)

    Article  Google Scholar 

  41. Huang, Y., Kaufmann, M., Aksan, E., Black, M.J., Hilliges, O., Pons-Moll, G.: Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans. Graph. (TOG) 37(6), 1–15 (2018)

    Article  Google Scholar 

  42. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)

    Article  Google Scholar 

  43. Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: MotionGPT: human motion as a foreign language. In: Advances in Neural Information Processing Systems (2024)

    Google Scholar 

  44. Jiang, J., Streli, P., Meier, M., Fender, A., Holz, C.: EgoPoser: robust real-time ego-body pose estimation in large scenes. arXiv preprint arXiv:2308.06493 (2023)

  45. Jiang, J., et al.: AvatarPoser: articulated full-body pose tracking from sparse motion sensing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13665, pp. 443–460. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20065-6_26

    Chapter  Google Scholar 

  46. Jiang, N., et al.: Scaling up dynamic human-scene interaction modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Google Scholar 

  47. Jiang, Y., Ye, Y., Gopinath, D., Won, J., Winkler, A.W., Liu, C.K.: Transformer inertial poser: real-time human motion reconstruction from sparse IMUs with simultaneous terrain generation. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–9 (2022)

    Google Scholar 

  48. Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: The IEEE International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  49. Joo, H., et al.: Panoptic studio: a massively multiview system for social interaction capture. IEEE Trans. Pattern Anal. Mach. Intell. (2017)

    Google Scholar 

  50. Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5614–5623 (2019)

    Google Scholar 

  51. Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided motion diffusion for controllable human motion synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2151–2162 (2023)

    Google Scholar 

  52. Kaufmann, M., et al.: EMDB: the electromagnetic database of global 3D human pose and shape in the wild. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  53. Kaufmann, M., et al.: EM-pose: 3D human pose estimation from sparse electromagnetic trackers. In: The IEEE International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  54. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

  55. Khirodkar, R., Bansal, A., Ma, L., Newcombe, R., Vo, M., Kitani, K.: EgoHumans: an egocentric 3D multi-human benchmark. In: ICCV (2023)

    Google Scholar 

  56. Kim, J., Kim, J., Na, J., Joo, H.: ParaHome: parameterizing everyday home activities towards 3D generative modeling of human-object interactions (2024)

    Google Scholar 

  57. Kwon, T., Tekin, B., Stühmer, J., Bogo, F., Pollefeys, M.: H2o: two hands manipulating objects for first person interaction recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10138–10148 (2021)

    Google Scholar 

  58. Lee, J., Joo, H.: Mocap everyone everywhere: lightweight motion capture with smartwatches and a head-mounted camera. arXiv preprint arXiv:2401.00847 (2024)

  59. Li, G., Zhao, et al.: EgoGen: an egocentric synthetic data generator (2024)

    Google Scholar 

  60. Li, J., Liu, K., Wu, J.: Ego-body pose estimation via ego-head pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17142–17151 (2023)

    Google Scholar 

  61. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)

    Google Scholar 

  62. Lin, J., et al.: Motion-X: a large-scale 3D expressive whole-body human motion dataset. In: Advances in Neural Information Processing Systems (2023)

    Google Scholar 

  63. Ling, H.Y., Zinno, F., Cheng, G., van de Panne, M.: Character controllers using motion VAEs. ACM Trans. Graph. 39(4), 1–40 (2020)

    Google Scholar 

  64. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34(6), 248:1–248:16 (2015)

    Google Scholar 

  65. Lucas, T., Baradel, F., Weinzaepfel, P., Rogez, G.: PoseGPT: quantization-based 3D human motion generation and forecasting. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 417–435. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_24

    Chapter  Google Scholar 

  66. Luo, Z., Hachiuma, R., Yuan, Y., Kitani, K.: Dynamics-regulated kinematic policy for egocentric pose estimation. In: Neural Information Processing Systems (2021)

    Google Scholar 

  67. Luvizon, D., Habermann, M., Golyanik, V., Kortylewski, A., Theobalt, C.: Scene-aware 3D multi-human motion capture from a single camera. Comput. Graph. Forum 42(2), 371–383 (2023)

    Article  Google Scholar 

  68. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: International Conference on Computer Vision, pp. 5442–5451 (2019)

    Google Scholar 

  69. von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

  70. Marcard, T., Rosenhahn, B., Black, M., Pons-Moll, G.: Sparse inertial poser: automatic 3D human pose estimation from sparse IMUs. Comput. Graph. Forum 36(2) (2017). Proceedings of the 38th Annual Conference of the European Association for Computer Graphics (Eurographics) (2017)

    Google Scholar 

  71. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24

    Chapter  Google Scholar 

  72. Mollyn, V., Arakawa, R., Goel, M., Harrison, C., Ahuja, K.: IMUPoser: full-body pose estimation using IMUs in phones, watches, and earbuds. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023. Association for Computing Machinery, New York (2023)

    Google Scholar 

  73. Mourikis, A.I., Roumeliotis, S.I.: A multi-state constraint Kalman filter for vision-aided inertial navigation. In: Proceedings 2007 IEEE International Conference on Robotics and Automation, pp. 3565–3572. IEEE (2007)

    Google Scholar 

  74. Movella: MVN user manual. https://www.movella.com/hubfs/MVN_User_Manual.pdf

  75. Mur-Artal, Raúl, M.J.M.M., Tardós, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)

    Google Scholar 

  76. van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (2017)

    Google Scholar 

  77. OpenAI.: Achiam, J., et al.: GPT-4 technical report (2023)

    Google Scholar 

  78. Pan, X., et al.: Aria digital twin: a new benchmark dataset for egocentric 3D machine perception (2023)

    Google Scholar 

  79. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  80. Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10975–10985 (2019)

    Google Scholar 

  81. Petrovich, M., Black, M.J., Varol, G.: TEMOS: generating diverse human motions from textual descriptions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 480–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_28

  82. Plappert, M., Mandery, C., Asfour, T.: The KIT motion-language dataset. Big Data 4(4), 236–252 (2016)

    Article  Google Scholar 

  83. Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: bodies, action and behavior with English labels. In: Proceedings IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 722–731 (2021)

    Google Scholar 

  84. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)

    MathSciNet  Google Scholar 

  85. Raina, N., et al.: EgoBlur: responsible innovation in aria. ArXiv abs/2308.13093 (2023)

    Google Scholar 

  86. Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3D human motion model for robust pose estimation. In: International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  87. Alp Gueler, R., Natalia Neverova, I.K.: Densepose: dense human pose estimation in the wild. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  88. Roetenberg, D., Luinge, H., Slycke, P.: Xsens MVN: full 6DOF human motion tracking using miniature inertial sensors. Xsens Motion Technol. BV Technical report 3 (2009)

    Google Scholar 

  89. Rong, Y., Shiratori, T., Joo, H.: FrankMocap: a monocular 3D whole-body pose estimation system via regression and integration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1749–1759 (2021)

    Google Scholar 

  90. Sener, F., et al.: Assembly101: a large-scale multi-view video dataset for understanding procedural activities. In: CVPR (2022)

    Google Scholar 

  91. Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior. In: ICLR (2023)

    Google Scholar 

  92. Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)

    Google Scholar 

  93. Shimada, S., Golyanik, V., Xu, W., Theobalt, C.: PhysCap: physically plausible monocular 3d motion capture in real time. ACM Trans. Graph. (ToG) 39(6), 1–16 (2020)

    Article  Google Scholar 

  94. Sorkine-Hornung, O., Rabinovich, M.: Least-squares rigid motion using SVD. Computing 1(1), 1–5 (2017)

    Google Scholar 

  95. Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: MotionCLIP: exposing human motion generation to CLIP Space. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXII. LNCS, vol. 13682, pp. 358–374. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_21

  96. Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2023)

    Google Scholar 

  97. Tiwari, G., Antic, D., Lenssen, J.E., Sarafianos, N., Tung, T., Pons-Moll, G.: Pose-NDF: modeling human pose manifolds with neural distance fields. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13665, pp. 572–589. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20065-6_33

    Chapter  Google Scholar 

  98. Tome, D., et al.: SelfPose: 3D egocentric pose estimation from a headset mounted camera. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 6794–6806 (2020)

    Article  Google Scholar 

  99. Touvron, H., et al.: LLaMA: open and efficient foundation language models (2023)

    Google Scholar 

  100. Trumble, M., Gilbert, A., Malleson, C., Hilton, A., Collomosse, J.: Total capture: 3D human pose estimation fusing video and inertial sensors. In: Proceedings of 28th British Machine Vision Conference, pp. 1–13 (2017)

    Google Scholar 

  101. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CiDER: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

    Google Scholar 

  102. Wang, J., Liu, L., Xu, W., Sarkar, K., Luvizon, D., Theobalt, C.: Estimating egocentric 3D human pose in the wild with external weak supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13157–13166 (2022)

    Google Scholar 

  103. Wang, J., Luvizon, D., Xu, W., Liu, L., Sarkar, K., Theobalt, C.: Scene-aware egocentric 3D human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13031–13040 (2023)

    Google Scholar 

  104. Wouwe, T., Lee, S., Falisse, A., Delp, S., Liu, C.: Diffusion inertial poser: human motion reconstruction from arbitrary sparse IMU configurations. In: CVPR (2024)

    Google Scholar 

  105. Yang, D., Kang, J., Ma, L., Greer, J., Ye, Y., Lee, S.H.: DivaTrack: diverse bodies and motions from acceleration-enhanced three-point trackers. In: EuroGraphics (2024)

    Google Scholar 

  106. Yang, D., Kim, D., Lee, S.H.: LoBSTr: real-time lower-body pose prediction from sparse upper-body tracking signals. In: Computer Graphics Forum, vol. 40, pp. 265–275. Wiley Online Library (2021)

    Google Scholar 

  107. Ye, V., Pavlakos, G., Malik, J., Kanazawa, A.: Decoupling human and camera motion from videos in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  108. Yi, H., Huang, C.H.P., Tripathi, S., Hering, L., Thies, J., Black, M.J.: MIME: human-aware 3D scene generation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12965–12976, June 2023

    Google Scholar 

  109. Yi, X., et al.: EgoLocate: real-time motion capture, localization, and mapping with sparse body-mounted sensors. ACM Trans. Graph. (TOG) 42(4), 1–17 (2023)

    Article  Google Scholar 

  110. Yi, X., et al.: Physical inertial poser (PIP): physics-aware real-time human motion tracking from sparse inertial sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13167–13178 (2022)

    Google Scholar 

  111. Yi, X., Zhou, Y., Xu, F.: Transpose: real-time 3D human translation and pose estimation with six inertial sensors. ACM Trans. Graph. (TOG) 40(4), 1–13 (2021)

    Article  Google Scholar 

  112. Zhang, S., et al.: EgoBody: human body shape and motion of interacting people from head-mounted devices. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 180–200. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_11

    Chapter  Google Scholar 

  113. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)

  114. Zhang, Y., et al.: MotionGPT: finetuned LLMs are general-purpose motion generators. arXiv preprint arXiv:2306.10900 (2023)

  115. Zhang, Z., Liu, R., Aberman, K., Hanocka, R.: TEDi: temporally-entangled diffusion for long-term motion synthesis (2023)

    Google Scholar 

  116. Zheng, Y., et al.: GIMO: gaze-informed human motion prediction in context. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13673, pp. 676–694. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19778-9_39

  117. Zheng, Z., et al.: HybridFusion: real-time performance capture using a single depth sensor and sparse IMUs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 389–406. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_24

    Chapter  Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge the following colleagues for their valuable discussions and technical support. Genesis Mendoza, Jacob Alibadi, Ivan Soeria-Atmadja, Elena Shchetinina, and Atishi Bali worked on data collection. Yusuf Mansour supported gaze estimation on Project Aria. Ahmed Elabbasy, Guru Somasundaram, Omkar Pakhi, and Nikhil Raina supported EgoBlur as the solution to anonymize video and explored bounding box annotation. Evgeniy Oleinik, Maien Hamed, and Mark Schwesinger supported onboarding Nymeria dataset into Project Aria dataset explorer and data release. Melissa Hebra helped with coordinating narration annotation. Edward Miller served as research program manager. Pierre Moulon provided valuable guidance to open source code repository. Tassos Mourikis, Maurizio Monge, David Caruso, Duncan Frost, and Harry Lanaras provided technical support for SLAM. Daniel DeTone, Dan Barnes, Raul Mur Artal, Thomas Whelan, and Austin Kukay provided valuable discussions on annotating semantic bounding box. Julian Nubert adopted the dataset for early dogfooding. Pedro Cancel Rivera, Gustavo Solaira, Yang Lou, and Yuyang Zou provided support from Project Aria program. Svetoslav Kolev provided frequent feedback. Arjang Talattof supported MPS. Gerard Pons-Moll served as senior advisor. Carl Ren and Mingfei Yan served as senior managers.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lingni Ma .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 22017 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ma, L. et al. (2025). Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15082. Springer, Cham. https://doi.org/10.1007/978-3-031-72691-0_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72691-0_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72690-3

  • Online ISBN: 978-3-031-72691-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics