Skip to main content
Log in

PCMG:3D point cloud human motion generation based on self-attention and transformer

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Previous methods for human motion generation have predominantly relied on skeleton representations to depict human poses and motion. These methods typically use a series of skeletons to represent the motion of a human. However, they are not directly suitable for handling the 3D point cloud sequences obtained from optical motion capture. To address this limitation, we propose a novel network called point cloud motion generation (PCMG) that can handle both skeleton-based motion representation and point cloud data from the human surface. PCMG is trained on finite point cloud sequences and is capable of generating infinite new point cloud sequences. By providing a predefined action label and shape label as input, PCMG generates a point cloud sequence that captures the semantics associated with these labels. PCMG achieves comparable results to state-of-the-art methods for action-conditional human motion generation, while outperforming previous approaches in terms of generation efficiency. The code for PCMG will be available at https://github.com/gxucg/PCMG

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Availability of data and materials

The data that support the findings of this study are openly available in ACTOR at https://github.com/Mathux/ACTOR/blob/master/DATASETS.md and Action2Motion at https://ericguo5513.github.io/action-to-motion/.

Code Availability

The code for PCMG will be available at https://github.com/gxucg/PCMG

References

  1. Chen, K., Wang, Y., Zhang, S.-H., Xu, S.-Z., Zhang, W., Hu, S.-M.: Mocap-solver: a neural solver for optical motion capture data. ACM Trans. Graph. 40(4) (2021). https://doi.org/10.1145/3450626.3459681

  2. Ghorbani, N., Black, M.J.: Soma: solving optical marker-based mocap automatically. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11097–11106 (2021). https://doi.org/10.1109/ICCV48922.2021.01093

  3. Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Action2motion: conditioned generation of 3d human motions. In: Proceedings of the 28th ACM International Conference on Multimedia. MM ’20, pp. 2021–2029. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3413635

  4. Maeda, T., Ukita, N.: Motionaug: augmentation with physical correction for human motion prediction. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6417–6426 (2022). https://doi.org/10.1109/CVPR52688.2022.00632

  5. Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3d human motion synthesis with transformer VAE. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10965–10975 (2021). https://doi.org/10.1109/ICCV48922.2021.01080

  6. Siyao, L., Yu, W., Gu, T., Lin, C., Wang, Q., Qian, C., Loy, C.C., Liu, Z.: Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11040–11049 (2022). https://doi.org/10.1109/CVPR52688.2022.01077

  7. Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: In: MotionCLIP: Exposing Human Motion Generation to CLIP Space, pp. 2203–08063 (2022). https://doi.org/10.48550/arXiv.2203.08063

  8. Xin, C., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, J., Yu, G.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

  9. Cervantes, P., Sekikawa, Y., Sato, I., Shinoda, K.: Implicit neural representations for variable length human motion generation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, pp. 356–372. Springer, Cham (2022)

    Chapter  Google Scholar 

  10. Tevet, G., Raab, S., Gordon, B., Shafir, Y., Bermano, A.H., Cohen-Or, D.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)

  11. Henter, G.E., Alexanderson, S., Beskow, J.: Moglow: probabilistic and controllable motion synthesis using normalising flows. ACM Trans. Graph. 39(6) (2020). https://doi.org/10.1145/3414685.3417836

  12. Zhang, H., Starke, S., Komura, T., Saito, J.: Mode-adaptive neural networks for quadruped motion control. ACM Trans. Graph. 37(4) (2018). https://doi.org/10.1145/3197517.3201366

  13. Charles, R.Q., Su, H., Kaichun, M., Guibas, L.J.: Pointnet: deep learning on point sets for 3d classification and segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85 (2017). https://doi.org/10.1109/CVPR.2017.16

  14. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 38(5) (2019). https://doi.org/10.1145/3326362

  15. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space 30 (2017)

  16. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need 30 (2017)

  17. Petrovich, M., Black, M.J., Varol, G.: TEMOS: generating diverse human motions from textual descriptions. arXiv e-prints, 2204-14109 (2022). arXiv:2204.14109 [cs.CV]

  18. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6) (2015). https://doi.org/10.1145/2816795.2818013

  19. Yang, S., Heng, W., Liu, G., Luo, G., Yang, W., Yu, G.: Capturing the motion of every joint: 3d human pose and shape estimation with independent tokens. In: The Eleventh International Conference on Learning Representations (ICLR) (2023). https://openreview.net/forum?id=0Vv4H4Ch0la

  20. Ghorbani, S., Etemad, A., Troje, N.F.: Auto-labelling of markers in optical motion capture by permutation learning. In: Gavrilova, M., Chang, J., Thalmann, N.M., Hitzer, E., Ishikawa, H. (eds.) Advances in Computer Graphics, pp. 167–178. Springer, Cham (2019)

    Chapter  Google Scholar 

  21. Chatzitofis, A., Zarpalas, D., Kollias, S., Daras, P.: Deepmocap: deep optical motion capture using multiple depth sensors and retro-reflectors. Sensors 19(2), 282 (2019)

    Article  Google Scholar 

  22. Yan, S., Li, Z., Xiong, Y., Yan, H., Lin, D.: Convolutional sequence generation for skeleton-based action synthesis. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4393–4401 (2019). https://doi.org/10.1109/ICCV.2019.00449

  23. Zhang, Y., Black, M.J., Tang, S.: Perpetual motion: generating unbounded human motion. arXiv e-prints, 2007-13886 (2020) arXiv:2007.13886 [cs.CV]

  24. Zhao, R., Su, H., Ji, Q.: Bayesian adversarial human motion synthesis. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6224–6233 (2020). https://doi.org/10.1109/CVPR42600.2020.00626

  25. Lee, H.-Y., Yang, X., Liu, M.-Y., Wang, T.-C., Lu, Y.-D., Yang, M.-H., Kautz, J.: Dancing to music 32 (2019)

  26. Li, J., Yin, Y., Chu, H., Zhou, Y., Wang, T., Fidler, S., Li, H.: Learning to generate diverse dance motions with transformer. arXiv e-prints, 2008–08171 (2020) arXiv:2008.08171 [cs.CV]

  27. Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3d dance generation with aist++. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13381–13392 (2021). https://doi.org/10.1109/ICCV48922.2021.01315

  28. Bhattacharya, U., Childs, E., Rewkowski, N., Manocha, D.: Speech2affectivegestures: synthesizing co-speech gestures with generative adversarial affective expression learning. In: Proceedings of the 29th ACM International Conference on Multimedia. MM ’21, pp. 2027–2036. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3474085.3475223

  29. Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3492–3501 (2019). https://doi.org/10.1109/CVPR.2019.00361

  30. Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2action: generative adversarial synthesis from language to action. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5915–5920 (2018). https://doi.org/10.1109/ICRA.2018.8460608

  31. Hong, F., Zhang, M., Pan, L., Cai, Z., Yang, L., Liu, Z.: AvatarCLIP: zero-shot text-driven generation and animation of 3D avatars. arXiv e-prints, 2205–08535 (2022) arXiv:2205.08535 [cs.CV]

  32. Lin, X., Amer, M.R.: Human motion modeling using DVGANs. arXiv e-prints, 1804-10652 (2018) arXiv:1804.10652 [cs.CV]

  33. Fang, L., Zeng, T., Liu, C., Bo, L., Dong, W., Chen, C.: Transformer-based conditional variational autoencoder for controllable story generation. arXiv e-prints, 2101-00828 (2021) arXiv:2101.00828 [cs.CL]

  34. Jiang, J., Xia, G.G., Carlton, D.B., Anderson, C.N., Miyakawa, R.H.: Transformer VAE: a hierarchical model for structure-aware and interpretable music representation learning. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 516–520 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054554

  35. Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: Video generation using VQ-VAE and transformers. arXiv e-prints, 2104-10157 (2021) arXiv:2104.10157 [cs.CV]

  36. Guo, M.-H., Cai, J.-X., Liu, Z.-N., Mu, T.-J., Martin, R.R., Hu, S.-M.: PCT: point cloud transformer. Comput. Vis. Media 7(2), 187–199 (2021). https://doi.org/10.1007/s41095-021-0229-5

    Article  Google Scholar 

  37. Zhao, H., Jiang, L., Jia, J., Torr, P., Koltun, V.: Point transformer. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16239–16248 (2021). https://doi.org/10.1109/ICCV48922.2021.01595

  38. Tang, K., Chen, Y., Peng, W., Zhang, Y., Fang, M., Wang, Z., Song, P.: Reppvconv: attentively fusing reparameterized voxel features for efficient 3d point cloud perception. Vis. Comput. (2022). https://doi.org/10.1007/s00371-022-02682-0

    Article  Google Scholar 

  39. Li, H., Sun, Z.: A structural-constraint 3d point clouds segmentation adversarial method. Vis. Comput. 37(2), 325–340 (2021). https://doi.org/10.1007/s00371-020-01801-z

    Article  Google Scholar 

  40. Sun, Y., Miao, Y., Chen, J., Pajarola, R.: PGCNet: patch graph convolutional network for point cloud segmentation of indoor scenes. Vis. Comput. 36(10), 2407–2418 (2020). https://doi.org/10.1007/s00371-020-01892-8

    Article  Google Scholar 

  41. Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel cnn for efficient 3d deep learning. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., (2019). https://proceedings.neurips.cc/paper_files/paper/2019/file/5737034557ef5b8c02c0e46513b98f90-Paper.pdf

  42. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020). https://doi.org/10.1145/3422622

    Article  MathSciNet  Google Scholar 

  43. Hui, L., Xu, R., Xie, J., Qian, J., Yang, J.: In: Progressive Point Cloud Deconvolution Generation Network, pp. 2007–05361 (2020). arXiv:2007.05361

  44. Li, R., Li, X., Hui, K.-H., Fu, C.-W.: SP-GAN: sphere-guided 3d shape generation and manipulation. ACM Trans. Graph. 40(4) (2021). https://doi.org/10.1145/3450626.3459766

  45. Tang, Y., Qian, Y., Zhang, Q., Zeng, Y., Hou, J., Zhe, X.: Warpinggan: Warping multiple uniform priors for adversarial 3d point cloud generation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6387–6395 (2022). https://doi.org/10.1109/CVPR52688.2022.00629

  46. Luo, S., Hu, W.: Diffusion probabilistic models for 3d point cloud generation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2836–2844 (2021). https://doi.org/10.1109/CVPR46437.2021.00286

  47. Yang, G., Huang, X., Hao, Z., Liu, M.-Y., Belongie, S., Hariharan, B.: Pointflow: 3d point cloud generation with continuous normalizing flows. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4540–4549 (2019). https://doi.org/10.1109/ICCV.2019.00464

  48. Zhang, K., Yang, X., Wu, Y., Jin, C.: In: Attention-Based Transformation from Latent Features to Point Clouds, vol. 36, pp. 3291–3299 (2022). https://doi.org/10.1609/aaai.v36i3.20238. https://ojs.aaai.org/index.php/AAAI/article/view/20238

  49. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv e-prints, 1312–6114 (2013) arXiv:1312.6114 [stat.ML]

  50. Zhang, Y., Zhao, W., Sun, B., Zhang, Y., Wen, W.: Point cloud upsampling algorithm: a systematic review. Algorithms 15(4) (2022). https://doi.org/10.3390/a15040124

  51. Hodgins, J.: CMU graphics lab motion capture database. Jessica Hodgins (2015)

  52. Ji, Y., Xu, F., Yang, Y., Shen, F., Shen, H.T., Zheng, W.-S.: A large-scale RGB-D database for arbitrary-view human action recognition. In: Proceedings of the 26th ACM International Conference on Multimedia. MM ’18, pp. 1510–1518. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3240508.3240675

  53. Zou, S., Zuo, X., Qian, Y., Wang, S., Xu, C., Gong, M., Cheng, L.: In: 3D Human Shape Reconstruction from a Polarization Image, pp. 2007–09268 (2020). arXiv:2007.09268

  54. Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: Video inference for human body pose and shape estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5252–5262 (2020). https://doi.org/10.1109/CVPR42600.2020.00530

  55. Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., Kot, A.C.: Ntu rgb+d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2020). https://doi.org/10.1109/TPAMI.2019.2916873

    Article  Google Scholar 

  56. Yan, S., Xiong, Y., Lin, D.: In: Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, vol. 32 (2018). https://doi.org/10.1609/aaai.v32i1.12328. https://ojs.aaai.org/index.php/AAAI/article/view/12328

  57. Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5738–5746 (2019). https://doi.org/10.1109/CVPR.2019.00589

  58. Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., Lu, C.: Hybrik: a hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3382–3392 (2021). https://doi.org/10.1109/CVPR46437.2021.00339

  59. Fan, H., Yang, Y., Kankanhalli, M.: Point 4d transformer networks for spatio-temporal modeling in point cloud videos. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14199–14208 (2021). https://doi.org/10.1109/CVPR46437.2021.01398

  60. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.: Amass: archive of motion capture as surface shapes. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5441–5450 (2019). https://doi.org/10.1109/ICCV.2019.00554

  61. Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10967–10977 (2019). https://doi.org/10.1109/CVPR.2019.01123

Download references

Funding

This work has received partial support from the National Natural Science Foundation of China under Grants 61972160 and 62171145.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mengxiao Yin.

Ethics declarations

Conflict of interest

The authors declare that they have no known financial or personal conflicts of interest that could have influenced the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 49815 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, W., Yin, M., Li, G. et al. PCMG:3D point cloud human motion generation based on self-attention and transformer. Vis Comput 40, 3765–3780 (2024). https://doi.org/10.1007/s00371-023-03063-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-03063-x

Keywords

Navigation