PCMG:3D point cloud human motion generation based on self-attention and transformer

Ma, Weizhao; Yin, Mengxiao; Li, Guiqing; Yang, Feng; Chang, Kan

doi:10.1007/s00371-023-03063-x

PCMG:3D point cloud human motion generation based on self-attention and transformer

Original article
Published: 05 September 2023

Volume 40, pages 3765–3780, (2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Weizhao Ma¹,
Mengxiao Yin ORCID: orcid.org/0000-0001-8327-4813^1,2,
Guiqing Li³,
Feng Yang^1,2 &
…
Kan Chang^1,2

573 Accesses
Explore all metrics

Abstract

Previous methods for human motion generation have predominantly relied on skeleton representations to depict human poses and motion. These methods typically use a series of skeletons to represent the motion of a human. However, they are not directly suitable for handling the 3D point cloud sequences obtained from optical motion capture. To address this limitation, we propose a novel network called point cloud motion generation (PCMG) that can handle both skeleton-based motion representation and point cloud data from the human surface. PCMG is trained on finite point cloud sequences and is capable of generating infinite new point cloud sequences. By providing a predefined action label and shape label as input, PCMG generates a point cloud sequence that captures the semantics associated with these labels. PCMG achieves comparable results to state-of-the-art methods for action-conditional human motion generation, while outperforming previous approaches in terms of generation efficiency. The code for PCMG will be available at https://github.com/gxucg/PCMG

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

SPiKE: 3D Human Pose from Point Cloud Sequences

Serial Spatial and Temporal Transformer for Point Cloud Sequences Recognition

CPCS: Critical Points Guided Clustering and Sampling for Point Cloud Analysis

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of data and materials

The data that support the findings of this study are openly available in ACTOR at https://github.com/Mathux/ACTOR/blob/master/DATASETS.md and Action2Motion at https://ericguo5513.github.io/action-to-motion/.

Code Availability

The code for PCMG will be available at https://github.com/gxucg/PCMG

References

Chen, K., Wang, Y., Zhang, S.-H., Xu, S.-Z., Zhang, W., Hu, S.-M.: Mocap-solver: a neural solver for optical motion capture data. ACM Trans. Graph. 40(4) (2021). https://doi.org/10.1145/3450626.3459681
Ghorbani, N., Black, M.J.: Soma: solving optical marker-based mocap automatically. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11097–11106 (2021). https://doi.org/10.1109/ICCV48922.2021.01093
Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Action2motion: conditioned generation of 3d human motions. In: Proceedings of the 28th ACM International Conference on Multimedia. MM ’20, pp. 2021–2029. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3413635
Maeda, T., Ukita, N.: Motionaug: augmentation with physical correction for human motion prediction. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6417–6426 (2022). https://doi.org/10.1109/CVPR52688.2022.00632
Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3d human motion synthesis with transformer VAE. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10965–10975 (2021). https://doi.org/10.1109/ICCV48922.2021.01080
Siyao, L., Yu, W., Gu, T., Lin, C., Wang, Q., Qian, C., Loy, C.C., Liu, Z.: Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11040–11049 (2022). https://doi.org/10.1109/CVPR52688.2022.01077
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: In: MotionCLIP: Exposing Human Motion Generation to CLIP Space, pp. 2203–08063 (2022). https://doi.org/10.48550/arXiv.2203.08063
Xin, C., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, J., Yu, G.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Cervantes, P., Sekikawa, Y., Sato, I., Shinoda, K.: Implicit neural representations for variable length human motion generation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, pp. 356–372. Springer, Cham (2022)
Chapter Google Scholar
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Bermano, A.H., Cohen-Or, D.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)
Henter, G.E., Alexanderson, S., Beskow, J.: Moglow: probabilistic and controllable motion synthesis using normalising flows. ACM Trans. Graph. 39(6) (2020). https://doi.org/10.1145/3414685.3417836
Zhang, H., Starke, S., Komura, T., Saito, J.: Mode-adaptive neural networks for quadruped motion control. ACM Trans. Graph. 37(4) (2018). https://doi.org/10.1145/3197517.3201366
Charles, R.Q., Su, H., Kaichun, M., Guibas, L.J.: Pointnet: deep learning on point sets for 3d classification and segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85 (2017). https://doi.org/10.1109/CVPR.2017.16
Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 38(5) (2019). https://doi.org/10.1145/3326362
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space 30 (2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need 30 (2017)
Petrovich, M., Black, M.J., Varol, G.: TEMOS: generating diverse human motions from textual descriptions. arXiv e-prints, 2204-14109 (2022). arXiv:2204.14109 [cs.CV]
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6) (2015). https://doi.org/10.1145/2816795.2818013
Yang, S., Heng, W., Liu, G., Luo, G., Yang, W., Yu, G.: Capturing the motion of every joint: 3d human pose and shape estimation with independent tokens. In: The Eleventh International Conference on Learning Representations (ICLR) (2023). https://openreview.net/forum?id=0Vv4H4Ch0la
Ghorbani, S., Etemad, A., Troje, N.F.: Auto-labelling of markers in optical motion capture by permutation learning. In: Gavrilova, M., Chang, J., Thalmann, N.M., Hitzer, E., Ishikawa, H. (eds.) Advances in Computer Graphics, pp. 167–178. Springer, Cham (2019)
Chapter Google Scholar
Chatzitofis, A., Zarpalas, D., Kollias, S., Daras, P.: Deepmocap: deep optical motion capture using multiple depth sensors and retro-reflectors. Sensors 19(2), 282 (2019)
Article Google Scholar
Yan, S., Li, Z., Xiong, Y., Yan, H., Lin, D.: Convolutional sequence generation for skeleton-based action synthesis. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4393–4401 (2019). https://doi.org/10.1109/ICCV.2019.00449
Zhang, Y., Black, M.J., Tang, S.: Perpetual motion: generating unbounded human motion. arXiv e-prints, 2007-13886 (2020) arXiv:2007.13886 [cs.CV]
Zhao, R., Su, H., Ji, Q.: Bayesian adversarial human motion synthesis. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6224–6233 (2020). https://doi.org/10.1109/CVPR42600.2020.00626
Lee, H.-Y., Yang, X., Liu, M.-Y., Wang, T.-C., Lu, Y.-D., Yang, M.-H., Kautz, J.: Dancing to music 32 (2019)
Li, J., Yin, Y., Chu, H., Zhou, Y., Wang, T., Fidler, S., Li, H.: Learning to generate diverse dance motions with transformer. arXiv e-prints, 2008–08171 (2020) arXiv:2008.08171 [cs.CV]
Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3d dance generation with aist++. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13381–13392 (2021). https://doi.org/10.1109/ICCV48922.2021.01315
Bhattacharya, U., Childs, E., Rewkowski, N., Manocha, D.: Speech2affectivegestures: synthesizing co-speech gestures with generative adversarial affective expression learning. In: Proceedings of the 29th ACM International Conference on Multimedia. MM ’21, pp. 2027–2036. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3474085.3475223
Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3492–3501 (2019). https://doi.org/10.1109/CVPR.2019.00361
Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2action: generative adversarial synthesis from language to action. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5915–5920 (2018). https://doi.org/10.1109/ICRA.2018.8460608
Hong, F., Zhang, M., Pan, L., Cai, Z., Yang, L., Liu, Z.: AvatarCLIP: zero-shot text-driven generation and animation of 3D avatars. arXiv e-prints, 2205–08535 (2022) arXiv:2205.08535 [cs.CV]
Lin, X., Amer, M.R.: Human motion modeling using DVGANs. arXiv e-prints, 1804-10652 (2018) arXiv:1804.10652 [cs.CV]
Fang, L., Zeng, T., Liu, C., Bo, L., Dong, W., Chen, C.: Transformer-based conditional variational autoencoder for controllable story generation. arXiv e-prints, 2101-00828 (2021) arXiv:2101.00828 [cs.CL]
Jiang, J., Xia, G.G., Carlton, D.B., Anderson, C.N., Miyakawa, R.H.: Transformer VAE: a hierarchical model for structure-aware and interpretable music representation learning. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 516–520 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054554
Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: Video generation using VQ-VAE and transformers. arXiv e-prints, 2104-10157 (2021) arXiv:2104.10157 [cs.CV]
Guo, M.-H., Cai, J.-X., Liu, Z.-N., Mu, T.-J., Martin, R.R., Hu, S.-M.: PCT: point cloud transformer. Comput. Vis. Media 7(2), 187–199 (2021). https://doi.org/10.1007/s41095-021-0229-5
Article Google Scholar
Zhao, H., Jiang, L., Jia, J., Torr, P., Koltun, V.: Point transformer. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16239–16248 (2021). https://doi.org/10.1109/ICCV48922.2021.01595
Tang, K., Chen, Y., Peng, W., Zhang, Y., Fang, M., Wang, Z., Song, P.: Reppvconv: attentively fusing reparameterized voxel features for efficient 3d point cloud perception. Vis. Comput. (2022). https://doi.org/10.1007/s00371-022-02682-0
Article Google Scholar
Li, H., Sun, Z.: A structural-constraint 3d point clouds segmentation adversarial method. Vis. Comput. 37(2), 325–340 (2021). https://doi.org/10.1007/s00371-020-01801-z
Article Google Scholar
Sun, Y., Miao, Y., Chen, J., Pajarola, R.: PGCNet: patch graph convolutional network for point cloud segmentation of indoor scenes. Vis. Comput. 36(10), 2407–2418 (2020). https://doi.org/10.1007/s00371-020-01892-8
Article Google Scholar
Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel cnn for efficient 3d deep learning. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., (2019). https://proceedings.neurips.cc/paper_files/paper/2019/file/5737034557ef5b8c02c0e46513b98f90-Paper.pdf
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020). https://doi.org/10.1145/3422622
Article MathSciNet Google Scholar
Hui, L., Xu, R., Xie, J., Qian, J., Yang, J.: In: Progressive Point Cloud Deconvolution Generation Network, pp. 2007–05361 (2020). arXiv:2007.05361
Li, R., Li, X., Hui, K.-H., Fu, C.-W.: SP-GAN: sphere-guided 3d shape generation and manipulation. ACM Trans. Graph. 40(4) (2021). https://doi.org/10.1145/3450626.3459766
Tang, Y., Qian, Y., Zhang, Q., Zeng, Y., Hou, J., Zhe, X.: Warpinggan: Warping multiple uniform priors for adversarial 3d point cloud generation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6387–6395 (2022). https://doi.org/10.1109/CVPR52688.2022.00629
Luo, S., Hu, W.: Diffusion probabilistic models for 3d point cloud generation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2836–2844 (2021). https://doi.org/10.1109/CVPR46437.2021.00286
Yang, G., Huang, X., Hao, Z., Liu, M.-Y., Belongie, S., Hariharan, B.: Pointflow: 3d point cloud generation with continuous normalizing flows. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4540–4549 (2019). https://doi.org/10.1109/ICCV.2019.00464
Zhang, K., Yang, X., Wu, Y., Jin, C.: In: Attention-Based Transformation from Latent Features to Point Clouds, vol. 36, pp. 3291–3299 (2022). https://doi.org/10.1609/aaai.v36i3.20238. https://ojs.aaai.org/index.php/AAAI/article/view/20238
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv e-prints, 1312–6114 (2013) arXiv:1312.6114 [stat.ML]
Zhang, Y., Zhao, W., Sun, B., Zhang, Y., Wen, W.: Point cloud upsampling algorithm: a systematic review. Algorithms 15(4) (2022). https://doi.org/10.3390/a15040124
Hodgins, J.: CMU graphics lab motion capture database. Jessica Hodgins (2015)
Ji, Y., Xu, F., Yang, Y., Shen, F., Shen, H.T., Zheng, W.-S.: A large-scale RGB-D database for arbitrary-view human action recognition. In: Proceedings of the 26th ACM International Conference on Multimedia. MM ’18, pp. 1510–1518. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3240508.3240675
Zou, S., Zuo, X., Qian, Y., Wang, S., Xu, C., Gong, M., Cheng, L.: In: 3D Human Shape Reconstruction from a Polarization Image, pp. 2007–09268 (2020). arXiv:2007.09268
Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: Video inference for human body pose and shape estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5252–5262 (2020). https://doi.org/10.1109/CVPR42600.2020.00530
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., Kot, A.C.: Ntu rgb+d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2020). https://doi.org/10.1109/TPAMI.2019.2916873
Article Google Scholar
Yan, S., Xiong, Y., Lin, D.: In: Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, vol. 32 (2018). https://doi.org/10.1609/aaai.v32i1.12328. https://ojs.aaai.org/index.php/AAAI/article/view/12328
Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5738–5746 (2019). https://doi.org/10.1109/CVPR.2019.00589
Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., Lu, C.: Hybrik: a hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3382–3392 (2021). https://doi.org/10.1109/CVPR46437.2021.00339
Fan, H., Yang, Y., Kankanhalli, M.: Point 4d transformer networks for spatio-temporal modeling in point cloud videos. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14199–14208 (2021). https://doi.org/10.1109/CVPR46437.2021.01398
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.: Amass: archive of motion capture as surface shapes. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5441–5450 (2019). https://doi.org/10.1109/ICCV.2019.00554
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10967–10977 (2019). https://doi.org/10.1109/CVPR.2019.01123

Download references

Funding

This work has received partial support from the National Natural Science Foundation of China under Grants 61972160 and 62171145.

Author information

Authors and Affiliations

School of Computer Electronics and Information, Guangxi University, 100 East University Road, Nanning, 530004, Guangxi, China
Weizhao Ma, Mengxiao Yin, Feng Yang & Kan Chang
Guangxi Key Laboratory of Multimedia Communications Network Technology, Guangxi University, 100 East University Road, Nanning, 530004, Guangxi, China
Mengxiao Yin, Feng Yang & Kan Chang
School of Computer Science and Engineering, South China University of Technology, 381 Wushan Road, Tianhe District, Guangzhou, 510006, Guangdong, China
Guiqing Li

Authors

Weizhao Ma
View author publications
You can also search for this author inPubMed Google Scholar
Mengxiao Yin
View author publications
You can also search for this author inPubMed Google Scholar
Guiqing Li
View author publications
You can also search for this author inPubMed Google Scholar
Feng Yang
View author publications
You can also search for this author inPubMed Google Scholar
Kan Chang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Mengxiao Yin.

Ethics declarations

Conflict of interest

The authors declare that they have no known financial or personal conflicts of interest that could have influenced the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 49815 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ma, W., Yin, M., Li, G. et al. PCMG:3D point cloud human motion generation based on self-attention and transformer. Vis Comput 40, 3765–3780 (2024). https://doi.org/10.1007/s00371-023-03063-x

Download citation

Accepted: 05 August 2023
Published: 05 September 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s00371-023-03063-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PCMG:3D point cloud human motion generation based on self-attention and transformer

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SPiKE: 3D Human Pose from Point Cloud Sequences

Serial Spatial and Temporal Transformer for Point Cloud Sequences Recognition

CPCS: Critical Points Guided Clustering and Sampling for Point Cloud Analysis

Explore related subjects

Availability of data and materials

Code Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now