Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

Ma, Lingni; Ye, Yuting; Hong, Fangzhou; Guzov, Vladimir; Jiang, Yifeng; Postyeni, Rowan; Pesqueira, Luis; Gamino, Alexander; Baiyya, Vijay; Kim, Hyo Jin; Bailey, Kevin; Fosas, David S.; Liu, C. Karen; Liu, Ziwei; Engel, Jakob; Nardi, Renzo De; Newcombe, Richard

doi:10.1007/978-3-031-72691-0_25

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15082))

Included in the following conference series:

European Conference on Computer Vision

291 Accesses

Abstract

We introduce Nymeria - a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices. The dataset comes with a) full-body ground-truth motion; b) multiple multimodal egocentric data from Project Aria devices with videos, eye tracking, IMUs and etc.; and c) an third-person perspective by an additional “observer”. All devices are precisely synchronized and localized in one metric 3D world. We derive hierarchical protocol to add in-context language descriptions of human motion, from fine-grain motion narrations, to simplified atomic actions and high-level activity summarization. To the best of our knowledge, Nymeria dataset is the world’s largest human motion in the wild; first of its kind to provide synchronized and localized multi-device multimodal egocentric data; and the world’s largest motion-language dataset. It provides 300 hours of daily activities from 264 participants across 50 locations, total travelling distance over 399 Km. The language descriptions contain 310.5K sentences in 8.64M words from a vocabulary size of 6545. To demonstrate the potential of the dataset we evaluate several SOTA algorithms for egocentric body tracking, motion synthesis, and action recognition.

F. Hong, V. Guzov and Y. Jiang—Work done during internships at Meta Reality Labs Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

PLAViMoP database: A new continuously assessed and collaborative 3D point-light display dataset

Article 19 April 2022

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

Motion Processing and Big Data

References

Apple Vision Pro. https://www.apple.com/apple-vision-pro/
HTC VIVE. vive.com
Magic Leap 2. https://www.magicleap.com/magic-leap-2
Meta momentum library. https://github.com/facebookincubator/momentum/
Meta Quest. https://www.meta.com/quest/
Microsoft HoloLens. https://learn.microsoft.com/en-us/hololens/
Movella XSens MVN Link motion capture. https://www.movella.com/products/motion-capture/xsens-mvn-link
Project Aria Machine Perception Services. https://facebookresearch.github.io/projectaria_tools/docs/ARK/mps
Ray-Ban Meta smart glasses. https://www.meta.com/smart-glasses/
Rokoko. https://www.rokoko.com/
Vuzix smart glasses. https://www.vuzix.com/pages/smart-glasses
Akada, H., et al.: UnrealEgo: a new dataset for robust egocentric 3D human motion capture. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 1–17. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_1
Chapter Google Scholar
Araujo, J.P., et al.: CIRCLE: capture in rich contextual environments. In: CVPR (2023)
Google Scholar
Banerjee, P., et al.: Introducing HOT3D: an egocentric dataset for 3D hand and object tracking (2024)
Google Scholar
Black, M.J., Patel, P., Tesch, J., Yang, J.: BEDLAM: a synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: Proceedings IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Cai, Z., et al.: HuMMan: multi-modal 4D human dataset for versatile sensing and modeling. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13667, pp. 557–577. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20071-7_33
Chapter Google Scholar
Cai, Z., et al.: SMPLer-X: scaling up expressive human pose and shape estimation (2023)
Google Scholar
Cai, Z., et al.: Playing for 3D human recovery. arXiv preprint arXiv:2110.07588 (2021)
Castillo, A., et al.: BoDiffusion: diffusing sparse observations for full-body human motion synthesis. In: ICCV (2023)
Google Scholar
Chen, X., et al.: Executing your commands via motion diffusion in latent space. In: CVPR (2023)
Google Scholar
Cong, P., et al.: LaserHuman: language-guided scene-aware human motion generation in free environment (2024)
Google Scholar
Dabral, R., Mughal, M.H., Golyanik, V., Theobalt, C.: MoFusion: a framework for denoising-diffusion-based motion synthesis. In: Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV (2018)
Google Scholar
Damen, D., et al.: The epic-kitchens dataset: collection, challenges and baselines. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 43(11), 4125–4141 (2021)
Article Google Scholar
Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: PoseScript: 3D human poses from natural language. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 346–362. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_20
Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: a generative model for music. arXiv preprint arXiv:2005.00341 (2020)
Du, Y., Kips, R., Pumarola, A., Starke, S., Thabet, A., Sanakoyeu, A.: Avatars grow legs: generating smooth human motion from sparse tracking inputs with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 481–490 (2023)
Google Scholar
Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry (2016)
Google Scholar
Engel, J., et al.: Project aria: a new tool for egocentric multi-modal AI research (2023)
Google Scholar
Feng, Y., Lin, J., Dwivedi, S.K., Sun, Y., Patel, P., Black, M.J.: ChatPose: chatting about 3D human pose. In: CVPR (2024)
Google Scholar
Ghorbani, S., et al.: MoVi: a large multi-purpose human motion and video dataset. PLoS ONE 16(6), e0253157 (2021)
Article Google Scholar
Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4D: reconstructing and tracking humans with transformers. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: CVPR, pp. 18995–19012 (2022)
Google Scholar
Grauman, K., et al.: Ego-Exo4D: understanding skilled human activity from first- and third-person perspectives. In: CVPR (2024)
Google Scholar
Guo, C., et al.: Generating diverse and natural 3D human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5152–5161 (2022)
Google Scholar
Guo, C., Zuo, X., Wang, S., Cheng, L.: TM2T: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 580–597. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_34
Guo, C., et al.: Action2Motion: conditioned generation of 3D human motions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2021–2029 (2020)
Google Scholar
Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human poseitioning system (HPS): 3D human pose estimation and self-localization in large scenes from body-mounted sensors. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Harvey, F.G., Yurick, M., Nowrouzezahrai, D., Pal, C.: Robust motion in-betweening. ACM Trans. Graph. (TOG) 39(4), 60–1 (2020)
Article Google Scholar
Huang, Y., Kaufmann, M., Aksan, E., Black, M.J., Hilliges, O., Pons-Moll, G.: Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans. Graph. (TOG) 37(6), 1–15 (2018)
Article Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Article Google Scholar
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: MotionGPT: human motion as a foreign language. In: Advances in Neural Information Processing Systems (2024)
Google Scholar
Jiang, J., Streli, P., Meier, M., Fender, A., Holz, C.: EgoPoser: robust real-time ego-body pose estimation in large scenes. arXiv preprint arXiv:2308.06493 (2023)
Jiang, J., et al.: AvatarPoser: articulated full-body pose tracking from sparse motion sensing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13665, pp. 443–460. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20065-6_26
Chapter Google Scholar
Jiang, N., et al.: Scaling up dynamic human-scene interaction modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
Google Scholar
Jiang, Y., Ye, Y., Gopinath, D., Won, J., Winkler, A.W., Liu, C.K.: Transformer inertial poser: real-time human motion reconstruction from sparse IMUs with simultaneous terrain generation. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–9 (2022)
Google Scholar
Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: The IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Joo, H., et al.: Panoptic studio: a massively multiview system for social interaction capture. IEEE Trans. Pattern Anal. Mach. Intell. (2017)
Google Scholar
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5614–5623 (2019)
Google Scholar
Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided motion diffusion for controllable human motion synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2151–2162 (2023)
Google Scholar
Kaufmann, M., et al.: EMDB: the electromagnetic database of global 3D human pose and shape in the wild. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Kaufmann, M., et al.: EM-pose: 3D human pose estimation from sparse electromagnetic trackers. In: The IEEE International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
Khirodkar, R., Bansal, A., Ma, L., Newcombe, R., Vo, M., Kitani, K.: EgoHumans: an egocentric 3D multi-human benchmark. In: ICCV (2023)
Google Scholar
Kim, J., Kim, J., Na, J., Joo, H.: ParaHome: parameterizing everyday home activities towards 3D generative modeling of human-object interactions (2024)
Google Scholar
Kwon, T., Tekin, B., Stühmer, J., Bogo, F., Pollefeys, M.: H2o: two hands manipulating objects for first person interaction recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10138–10148 (2021)
Google Scholar
Lee, J., Joo, H.: Mocap everyone everywhere: lightweight motion capture with smartwatches and a head-mounted camera. arXiv preprint arXiv:2401.00847 (2024)
Li, G., Zhao, et al.: EgoGen: an egocentric synthetic data generator (2024)
Google Scholar
Li, J., Liu, K., Wu, J.: Ego-body pose estimation via ego-head pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17142–17151 (2023)
Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Lin, J., et al.: Motion-X: a large-scale 3D expressive whole-body human motion dataset. In: Advances in Neural Information Processing Systems (2023)
Google Scholar
Ling, H.Y., Zinno, F., Cheng, G., van de Panne, M.: Character controllers using motion VAEs. ACM Trans. Graph. 39(4), 1–40 (2020)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34(6), 248:1–248:16 (2015)
Google Scholar
Lucas, T., Baradel, F., Weinzaepfel, P., Rogez, G.: PoseGPT: quantization-based 3D human motion generation and forecasting. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 417–435. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_24
Chapter Google Scholar
Luo, Z., Hachiuma, R., Yuan, Y., Kitani, K.: Dynamics-regulated kinematic policy for egocentric pose estimation. In: Neural Information Processing Systems (2021)
Google Scholar
Luvizon, D., Habermann, M., Golyanik, V., Kortylewski, A., Theobalt, C.: Scene-aware 3D multi-human motion capture from a single camera. Comput. Graph. Forum 42(2), 371–383 (2023)
Article Google Scholar
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: International Conference on Computer Vision, pp. 5442–5451 (2019)
Google Scholar
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Marcard, T., Rosenhahn, B., Black, M., Pons-Moll, G.: Sparse inertial poser: automatic 3D human pose estimation from sparse IMUs. Comput. Graph. Forum 36(2) (2017). Proceedings of the 38th Annual Conference of the European Association for Computer Graphics (Eurographics) (2017)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Chapter Google Scholar
Mollyn, V., Arakawa, R., Goel, M., Harrison, C., Ahuja, K.: IMUPoser: full-body pose estimation using IMUs in phones, watches, and earbuds. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023. Association for Computing Machinery, New York (2023)
Google Scholar
Mourikis, A.I., Roumeliotis, S.I.: A multi-state constraint Kalman filter for vision-aided inertial navigation. In: Proceedings 2007 IEEE International Conference on Robotics and Automation, pp. 3565–3572. IEEE (2007)
Google Scholar
Movella: MVN user manual. https://www.movella.com/hubfs/MVN_User_Manual.pdf
Mur-Artal, Raúl, M.J.M.M., Tardós, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)
Google Scholar
van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (2017)
Google Scholar
OpenAI.: Achiam, J., et al.: GPT-4 technical report (2023)
Google Scholar
Pan, X., et al.: Aria digital twin: a new benchmark dataset for egocentric 3D machine perception (2023)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10975–10985 (2019)
Google Scholar
Petrovich, M., Black, M.J., Varol, G.: TEMOS: generating diverse human motions from textual descriptions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 480–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_28
Plappert, M., Mandery, C., Asfour, T.: The KIT motion-language dataset. Big Data 4(4), 236–252 (2016)
Article Google Scholar
Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: bodies, action and behavior with English labels. In: Proceedings IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 722–731 (2021)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Raina, N., et al.: EgoBlur: responsible innovation in aria. ArXiv abs/2308.13093 (2023)
Google Scholar
Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3D human motion model for robust pose estimation. In: International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Alp Gueler, R., Natalia Neverova, I.K.: Densepose: dense human pose estimation in the wild. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Roetenberg, D., Luinge, H., Slycke, P.: Xsens MVN: full 6DOF human motion tracking using miniature inertial sensors. Xsens Motion Technol. BV Technical report 3 (2009)
Google Scholar
Rong, Y., Shiratori, T., Joo, H.: FrankMocap: a monocular 3D whole-body pose estimation system via regression and integration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1749–1759 (2021)
Google Scholar
Sener, F., et al.: Assembly101: a large-scale multi-view video dataset for understanding procedural activities. In: CVPR (2022)
Google Scholar
Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior. In: ICLR (2023)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
Google Scholar
Shimada, S., Golyanik, V., Xu, W., Theobalt, C.: PhysCap: physically plausible monocular 3d motion capture in real time. ACM Trans. Graph. (ToG) 39(6), 1–16 (2020)
Article Google Scholar
Sorkine-Hornung, O., Rabinovich, M.: Least-squares rigid motion using SVD. Computing 1(1), 1–5 (2017)
Google Scholar
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: MotionCLIP: exposing human motion generation to CLIP Space. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXII. LNCS, vol. 13682, pp. 358–374. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_21
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2023)
Google Scholar
Tiwari, G., Antic, D., Lenssen, J.E., Sarafianos, N., Tung, T., Pons-Moll, G.: Pose-NDF: modeling human pose manifolds with neural distance fields. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13665, pp. 572–589. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20065-6_33
Chapter Google Scholar
Tome, D., et al.: SelfPose: 3D egocentric pose estimation from a headset mounted camera. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 6794–6806 (2020)
Article Google Scholar
Touvron, H., et al.: LLaMA: open and efficient foundation language models (2023)
Google Scholar
Trumble, M., Gilbert, A., Malleson, C., Hilton, A., Collomosse, J.: Total capture: 3D human pose estimation fusing video and inertial sensors. In: Proceedings of 28th British Machine Vision Conference, pp. 1–13 (2017)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CiDER: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Wang, J., Liu, L., Xu, W., Sarkar, K., Luvizon, D., Theobalt, C.: Estimating egocentric 3D human pose in the wild with external weak supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13157–13166 (2022)
Google Scholar
Wang, J., Luvizon, D., Xu, W., Liu, L., Sarkar, K., Theobalt, C.: Scene-aware egocentric 3D human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13031–13040 (2023)
Google Scholar
Wouwe, T., Lee, S., Falisse, A., Delp, S., Liu, C.: Diffusion inertial poser: human motion reconstruction from arbitrary sparse IMU configurations. In: CVPR (2024)
Google Scholar
Yang, D., Kang, J., Ma, L., Greer, J., Ye, Y., Lee, S.H.: DivaTrack: diverse bodies and motions from acceleration-enhanced three-point trackers. In: EuroGraphics (2024)
Google Scholar
Yang, D., Kim, D., Lee, S.H.: LoBSTr: real-time lower-body pose prediction from sparse upper-body tracking signals. In: Computer Graphics Forum, vol. 40, pp. 265–275. Wiley Online Library (2021)
Google Scholar
Ye, V., Pavlakos, G., Malik, J., Kanazawa, A.: Decoupling human and camera motion from videos in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Yi, H., Huang, C.H.P., Tripathi, S., Hering, L., Thies, J., Black, M.J.: MIME: human-aware 3D scene generation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12965–12976, June 2023
Google Scholar
Yi, X., et al.: EgoLocate: real-time motion capture, localization, and mapping with sparse body-mounted sensors. ACM Trans. Graph. (TOG) 42(4), 1–17 (2023)
Article Google Scholar
Yi, X., et al.: Physical inertial poser (PIP): physics-aware real-time human motion tracking from sparse inertial sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13167–13178 (2022)
Google Scholar
Yi, X., Zhou, Y., Xu, F.: Transpose: real-time 3D human translation and pose estimation with six inertial sensors. ACM Trans. Graph. (TOG) 40(4), 1–13 (2021)
Article Google Scholar
Zhang, S., et al.: EgoBody: human body shape and motion of interacting people from head-mounted devices. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 180–200. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_11
Chapter Google Scholar
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)
Zhang, Y., et al.: MotionGPT: finetuned LLMs are general-purpose motion generators. arXiv preprint arXiv:2306.10900 (2023)
Zhang, Z., Liu, R., Aberman, K., Hanocka, R.: TEDi: temporally-entangled diffusion for long-term motion synthesis (2023)
Google Scholar
Zheng, Y., et al.: GIMO: gaze-informed human motion prediction in context. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13673, pp. 676–694. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19778-9_39
Zheng, Z., et al.: HybridFusion: real-time performance capture using a single depth sensor and sparse IMUs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 389–406. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_24
Chapter Google Scholar

Download references

Acknowledgements

We gratefully acknowledge the following colleagues for their valuable discussions and technical support. Genesis Mendoza, Jacob Alibadi, Ivan Soeria-Atmadja, Elena Shchetinina, and Atishi Bali worked on data collection. Yusuf Mansour supported gaze estimation on Project Aria. Ahmed Elabbasy, Guru Somasundaram, Omkar Pakhi, and Nikhil Raina supported EgoBlur as the solution to anonymize video and explored bounding box annotation. Evgeniy Oleinik, Maien Hamed, and Mark Schwesinger supported onboarding Nymeria dataset into Project Aria dataset explorer and data release. Melissa Hebra helped with coordinating narration annotation. Edward Miller served as research program manager. Pierre Moulon provided valuable guidance to open source code repository. Tassos Mourikis, Maurizio Monge, David Caruso, Duncan Frost, and Harry Lanaras provided technical support for SLAM. Daniel DeTone, Dan Barnes, Raul Mur Artal, Thomas Whelan, and Austin Kukay provided valuable discussions on annotating semantic bounding box. Julian Nubert adopted the dataset for early dogfooding. Pedro Cancel Rivera, Gustavo Solaira, Yang Lou, and Yuyang Zou provided support from Project Aria program. Svetoslav Kolev provided frequent feedback. Arjang Talattof supported MPS. Gerard Pons-Moll served as senior advisor. Carl Ren and Mingfei Yan served as senior managers.

Author information

Authors and Affiliations

Meta Reality Labs Research, Redmond, USA
Lingni Ma, Yuting Ye, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, Kevin Bailey, David S. Fosas, Jakob Engel, Renzo De Nardi & Richard Newcombe
Nanyang Technological University, Singapore, Singapore
Fangzhou Hong & Ziwei Liu
University of Tübingen and Max Planck Institute for Informatics, Tübingen, Germany
Vladimir Guzov
Stanford University, Stanford, USA
Yifeng Jiang & C. Karen Liu

Authors

Lingni Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yuting Ye
View author publications
You can also search for this author in PubMed Google Scholar
Fangzhou Hong
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Guzov
View author publications
You can also search for this author in PubMed Google Scholar
Yifeng Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Rowan Postyeni
View author publications
You can also search for this author in PubMed Google Scholar
Luis Pesqueira
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Gamino
View author publications
You can also search for this author in PubMed Google Scholar
Vijay Baiyya
View author publications
You can also search for this author in PubMed Google Scholar
Hyo Jin Kim
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Bailey
View author publications
You can also search for this author in PubMed Google Scholar
David S. Fosas
View author publications
You can also search for this author in PubMed Google Scholar
C. Karen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ziwei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jakob Engel
View author publications
You can also search for this author in PubMed Google Scholar
Renzo De Nardi
View author publications
You can also search for this author in PubMed Google Scholar
Richard Newcombe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lingni Ma .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 22017 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, L. et al. (2025). Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15082. Springer, Cham. https://doi.org/10.1007/978-3-031-72691-0_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-72691-0_25
Published: 03 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72690-3
Online ISBN: 978-3-031-72691-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

PLAViMoP database: A new continuously assessed and collaborative 3D point-light display dataset

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

Motion Processing and Big Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 22017 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

PLAViMoP database: A new continuously assessed and collaborative 3D point-light display dataset

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

Motion Processing and Big Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 22017 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation