Abstract
In this paper, we address the issue of scene-aware 3D human motion forecasting. A key challenge in this task is to predict future human motions that are coherent with the scene by modeling human-scene interactions. While recent works have demonstrated that explicit constraints on human-scene interactions can prevent the occurrence of ghost motion, they only provide constraints on partial human motion e.g., the global motion of the human or a few joints contacting the scene, leaving the rest of unconstrained. To address this limitation, we propose to represent the human-scene interaction using the mutual distance between the human body and the scene. Such mutual distances constrain both the local and global human motion, resulting in a whole-body motion constrained prediction. In particular, mutual distance constraints consist of two components, the signed distance of each vertex on the human mesh to the scene surface and the distance of basis scene points to the human mesh. We further introduce a global scene representation learned from a signed distance function (SDF) volume to ensure coherence between the global scene representation and the explicit constraint from the mutual distance. We develop a pipeline with two sequential steps: predicting the future mutual distances first, followed by forecasting future human motion. We explicitly ensure consistency between predicted poses and mutual distances during training. Extensive testing on both synthetic and real datasets demonstrates that our method consistently surpasses the performance of current state-of-the-art techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Unless otherwise specified, in this paper, we use column vector convention.
- 2.
The first GCN layer will take the DCT coefficient matrix as input thus, \(\textbf{F}^{(1)} = \textbf{H}\).
References
Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: ICCV, pp. 7144–7153 (2019)
Brand, M., Hertzmann, A.: Style machines. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 183–192. ACM Press/Addison-Wesley Publishing Co. (2000)
Cai, Y., et al.: Learning progressive joint propagation for human motion prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 226–242. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_14
Cao, Z., Gao, H., Mangalam, K., Cai, Q.-Z., Vo, M., Malik, J.: Long-term human motion prediction with scene context. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 387–404. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_23
Corona, E., Pumarola, A., Alenya, G., Moreno-Noguer, F.: Context-aware human motion prediction. In: CVPR, pp. 6992–7001 (2020)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In: CVPR, pp. 5828–5839 (2017)
González, Á.: Measurement of areas on a sphere using Fibonacci and latitude-longitude lattices. Math. Geosci. 42, 49–64 (2010)
Gopalakrishnan, A., Mali, A., Kifer, D., Giles, L., Ororbia, A.G.: A neural temporal model for human motion prediction. In: CVPR, pp. 12116–12125 (2019)
Hassan, M., et al.: Stochastic scene-aware motion prediction. In: ICCV, pp. 11374–11384 (2021)
Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3D human pose ambiguities with 3D scene constraints. In: ICCV, pp. 2282–2292 (2019)
Hassan, M., Ghosh, P., Tesch, J., Tzionas, D., Black, M.J.: Populating 3D scenes by learning human-scene interaction. In: CVPR, pp. 14708–14718 (2021)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI 36(7), 1325–1339 (2014)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)
Koppula, H.S., Saxena, A.: Anticipating human activities for reactive robotic response. In: IROS, p. 2071. Tokyo (2013)
Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., Tian, Q.: Dynamic multiscale graph neural networks for 3D skeleton based human motion prediction. In: CVPR, pp. 214–223 (2020)
Li, X., Li, H., Joo, H., Liu, Y., Sheikh, Y.: Structure from recurrent motion: from rigidity to recurrency. In: CVPR (2018)
Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel CNN for efficient 3D deep learning. Adv. Neural. Inf. Process. Syst. 32 (2019)
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: ICCV (2019)
Mao, W., Hartley, R.I., Salzmann, M., et al.: Contact-aware human motion forecasting. Adv. Neural. Inf. Process. Syst. 35, 7356–7367 (2022)
Mao, W., Liu, M., Salzmann, M.: History repeats itself: human motion prediction via motion attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 474–489. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_28
Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion prediction. In: ICCV, pp. 9489–9497 (2019)
Paden, B., Čáp, M., Yong, S.Z., Yershov, D., Frazzoli, E.: A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell. Veh. 1(1), 33–55 (2016)
Paszke, A., et al.: Automatic differentiation in PyTorch. In: NeurIPS-W (2017)
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR, pp. 10975–10985 (2019)
Prokudin, S., Lassner, C., Romero, J.: Efficient learning on point clouds with basis point sets. In: ICCV, pp. 4332–4341 (2019)
Scofano, L., Sampieri, A., Schiele, E., De Matteis, E., Leal-Taixé, L., Galasso, F.: Staged contact-aware global human motion forecasting. In: BMVC (2023)
Sidenbladh, H., Black, M.J., Sigal, L.: Implicit probabilistic models of human motion for synthesis and tracking. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 784–800. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-47969-4_52
Starke, S., Zhang, H., Komura, T., Saito, J.: Neural state machine for character-scene interactions. ACM Trans. Graph. 38(6), 209–1 (2019)
Taylor, G.W., Hinton, G.E., Roweis, S.T.: Modeling human motion using binary latent variables. In: NeurIPS, pp. 1345–1352 (2006)
Van Welbergen, H., Van Basten, B.J., Egges, A., Ruttkay, Z.M., Overmars, M.H.: Real time animation of virtual humans: a trade-off between naturalness and control. In: Computer Graphics Forum, vol. 29, pp. 2530–2554. Wiley Online Library (2010)
Wang, B., Adeli, E., Chiu, H.k., Huang, D.A., Niebles, J.C.: Imitation learning for human pose prediction. In: ICCV, pp. 7124–7133 (2019)
Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models for human motion. TPAMI 30(2), 283–298 (2008)
Wang, J., Xu, H., Xu, J., Liu, S., Wang, X.: Synthesizing long-term 3D human motion and interaction in 3D scenes. In: CVPR, pp. 9401–9411 (2021)
Wang, J., Yan, S., Dai, B., Lin, D.: Scene-aware generative network for human motion synthesis. In: CVPR, pp. 12206–12215 (2021)
Wang, Z., Chen, Y., Liu, T., Zhu, Y., Liang, W., Huang, S.: HUMANISE: language-conditioned human motion generation in 3D scenes. Adv. Neural. Inf. Process. Syst. 35, 14959–14971 (2022)
Zhang, S., Zhang, Y., Ma, Q., Black, M.J., Tang, S.: PLACE: proximity learning of articulation and contact in 3D environments. In: 3DV, pp. 642–651. IEEE (2020)
Zhang, Y., Black, M.J., Tang, S.: We are more than our joints: predicting how 3D bodies move. In: CVPR, pp. 3372–3382 (2021)
Zhang, Y., Hassan, M., Neumann, H., Black, M.J., Tang, S.: Generating 3D people in scenes without people. In: CVPR, pp. 6194–6204 (2020)
Zheng, Y., et al.: GIMO: gaze-informed human motion prediction in context. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13673, pp. 676–694. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19778-9_39
Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: CVPR, pp. 5745–5753 (2019)
Acknowledgements
This research was supported in part by the Australia Research Council DECRA Fellowship (DE180100628) and ARC Discovery Grant (DP200102274).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 2 (mp4 65667 KB)
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xing, C., Mao, W., Liu, M. (2025). Scene-Aware Human Motion Forecasting via Mutual Distance Prediction. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15097. Springer, Cham. https://doi.org/10.1007/978-3-031-72933-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-72933-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72932-4
Online ISBN: 978-3-031-72933-1
eBook Packages: Computer ScienceComputer Science (R0)