Abstract
Previous methods for human motion generation have predominantly relied on skeleton representations to depict human poses and motion. These methods typically use a series of skeletons to represent the motion of a human. However, they are not directly suitable for handling the 3D point cloud sequences obtained from optical motion capture. To address this limitation, we propose a novel network called point cloud motion generation (PCMG) that can handle both skeleton-based motion representation and point cloud data from the human surface. PCMG is trained on finite point cloud sequences and is capable of generating infinite new point cloud sequences. By providing a predefined action label and shape label as input, PCMG generates a point cloud sequence that captures the semantics associated with these labels. PCMG achieves comparable results to state-of-the-art methods for action-conditional human motion generation, while outperforming previous approaches in terms of generation efficiency. The code for PCMG will be available at https://github.com/gxucg/PCMG











Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and materials
The data that support the findings of this study are openly available in ACTOR at https://github.com/Mathux/ACTOR/blob/master/DATASETS.md and Action2Motion at https://ericguo5513.github.io/action-to-motion/.
Code Availability
The code for PCMG will be available at https://github.com/gxucg/PCMG
References
Chen, K., Wang, Y., Zhang, S.-H., Xu, S.-Z., Zhang, W., Hu, S.-M.: Mocap-solver: a neural solver for optical motion capture data. ACM Trans. Graph. 40(4) (2021). https://doi.org/10.1145/3450626.3459681
Ghorbani, N., Black, M.J.: Soma: solving optical marker-based mocap automatically. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11097–11106 (2021). https://doi.org/10.1109/ICCV48922.2021.01093
Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Action2motion: conditioned generation of 3d human motions. In: Proceedings of the 28th ACM International Conference on Multimedia. MM ’20, pp. 2021–2029. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3413635
Maeda, T., Ukita, N.: Motionaug: augmentation with physical correction for human motion prediction. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6417–6426 (2022). https://doi.org/10.1109/CVPR52688.2022.00632
Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3d human motion synthesis with transformer VAE. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10965–10975 (2021). https://doi.org/10.1109/ICCV48922.2021.01080
Siyao, L., Yu, W., Gu, T., Lin, C., Wang, Q., Qian, C., Loy, C.C., Liu, Z.: Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11040–11049 (2022). https://doi.org/10.1109/CVPR52688.2022.01077
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: In: MotionCLIP: Exposing Human Motion Generation to CLIP Space, pp. 2203–08063 (2022). https://doi.org/10.48550/arXiv.2203.08063
Xin, C., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, J., Yu, G.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Cervantes, P., Sekikawa, Y., Sato, I., Shinoda, K.: Implicit neural representations for variable length human motion generation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, pp. 356–372. Springer, Cham (2022)
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Bermano, A.H., Cohen-Or, D.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)
Henter, G.E., Alexanderson, S., Beskow, J.: Moglow: probabilistic and controllable motion synthesis using normalising flows. ACM Trans. Graph. 39(6) (2020). https://doi.org/10.1145/3414685.3417836
Zhang, H., Starke, S., Komura, T., Saito, J.: Mode-adaptive neural networks for quadruped motion control. ACM Trans. Graph. 37(4) (2018). https://doi.org/10.1145/3197517.3201366
Charles, R.Q., Su, H., Kaichun, M., Guibas, L.J.: Pointnet: deep learning on point sets for 3d classification and segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85 (2017). https://doi.org/10.1109/CVPR.2017.16
Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 38(5) (2019). https://doi.org/10.1145/3326362
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space 30 (2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need 30 (2017)
Petrovich, M., Black, M.J., Varol, G.: TEMOS: generating diverse human motions from textual descriptions. arXiv e-prints, 2204-14109 (2022). arXiv:2204.14109 [cs.CV]
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6) (2015). https://doi.org/10.1145/2816795.2818013
Yang, S., Heng, W., Liu, G., Luo, G., Yang, W., Yu, G.: Capturing the motion of every joint: 3d human pose and shape estimation with independent tokens. In: The Eleventh International Conference on Learning Representations (ICLR) (2023). https://openreview.net/forum?id=0Vv4H4Ch0la
Ghorbani, S., Etemad, A., Troje, N.F.: Auto-labelling of markers in optical motion capture by permutation learning. In: Gavrilova, M., Chang, J., Thalmann, N.M., Hitzer, E., Ishikawa, H. (eds.) Advances in Computer Graphics, pp. 167–178. Springer, Cham (2019)
Chatzitofis, A., Zarpalas, D., Kollias, S., Daras, P.: Deepmocap: deep optical motion capture using multiple depth sensors and retro-reflectors. Sensors 19(2), 282 (2019)
Yan, S., Li, Z., Xiong, Y., Yan, H., Lin, D.: Convolutional sequence generation for skeleton-based action synthesis. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4393–4401 (2019). https://doi.org/10.1109/ICCV.2019.00449
Zhang, Y., Black, M.J., Tang, S.: Perpetual motion: generating unbounded human motion. arXiv e-prints, 2007-13886 (2020) arXiv:2007.13886 [cs.CV]
Zhao, R., Su, H., Ji, Q.: Bayesian adversarial human motion synthesis. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6224–6233 (2020). https://doi.org/10.1109/CVPR42600.2020.00626
Lee, H.-Y., Yang, X., Liu, M.-Y., Wang, T.-C., Lu, Y.-D., Yang, M.-H., Kautz, J.: Dancing to music 32 (2019)
Li, J., Yin, Y., Chu, H., Zhou, Y., Wang, T., Fidler, S., Li, H.: Learning to generate diverse dance motions with transformer. arXiv e-prints, 2008–08171 (2020) arXiv:2008.08171 [cs.CV]
Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3d dance generation with aist++. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13381–13392 (2021). https://doi.org/10.1109/ICCV48922.2021.01315
Bhattacharya, U., Childs, E., Rewkowski, N., Manocha, D.: Speech2affectivegestures: synthesizing co-speech gestures with generative adversarial affective expression learning. In: Proceedings of the 29th ACM International Conference on Multimedia. MM ’21, pp. 2027–2036. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3474085.3475223
Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3492–3501 (2019). https://doi.org/10.1109/CVPR.2019.00361
Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2action: generative adversarial synthesis from language to action. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5915–5920 (2018). https://doi.org/10.1109/ICRA.2018.8460608
Hong, F., Zhang, M., Pan, L., Cai, Z., Yang, L., Liu, Z.: AvatarCLIP: zero-shot text-driven generation and animation of 3D avatars. arXiv e-prints, 2205–08535 (2022) arXiv:2205.08535 [cs.CV]
Lin, X., Amer, M.R.: Human motion modeling using DVGANs. arXiv e-prints, 1804-10652 (2018) arXiv:1804.10652 [cs.CV]
Fang, L., Zeng, T., Liu, C., Bo, L., Dong, W., Chen, C.: Transformer-based conditional variational autoencoder for controllable story generation. arXiv e-prints, 2101-00828 (2021) arXiv:2101.00828 [cs.CL]
Jiang, J., Xia, G.G., Carlton, D.B., Anderson, C.N., Miyakawa, R.H.: Transformer VAE: a hierarchical model for structure-aware and interpretable music representation learning. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 516–520 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054554
Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: Video generation using VQ-VAE and transformers. arXiv e-prints, 2104-10157 (2021) arXiv:2104.10157 [cs.CV]
Guo, M.-H., Cai, J.-X., Liu, Z.-N., Mu, T.-J., Martin, R.R., Hu, S.-M.: PCT: point cloud transformer. Comput. Vis. Media 7(2), 187–199 (2021). https://doi.org/10.1007/s41095-021-0229-5
Zhao, H., Jiang, L., Jia, J., Torr, P., Koltun, V.: Point transformer. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16239–16248 (2021). https://doi.org/10.1109/ICCV48922.2021.01595
Tang, K., Chen, Y., Peng, W., Zhang, Y., Fang, M., Wang, Z., Song, P.: Reppvconv: attentively fusing reparameterized voxel features for efficient 3d point cloud perception. Vis. Comput. (2022). https://doi.org/10.1007/s00371-022-02682-0
Li, H., Sun, Z.: A structural-constraint 3d point clouds segmentation adversarial method. Vis. Comput. 37(2), 325–340 (2021). https://doi.org/10.1007/s00371-020-01801-z
Sun, Y., Miao, Y., Chen, J., Pajarola, R.: PGCNet: patch graph convolutional network for point cloud segmentation of indoor scenes. Vis. Comput. 36(10), 2407–2418 (2020). https://doi.org/10.1007/s00371-020-01892-8
Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel cnn for efficient 3d deep learning. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., (2019). https://proceedings.neurips.cc/paper_files/paper/2019/file/5737034557ef5b8c02c0e46513b98f90-Paper.pdf
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020). https://doi.org/10.1145/3422622
Hui, L., Xu, R., Xie, J., Qian, J., Yang, J.: In: Progressive Point Cloud Deconvolution Generation Network, pp. 2007–05361 (2020). arXiv:2007.05361
Li, R., Li, X., Hui, K.-H., Fu, C.-W.: SP-GAN: sphere-guided 3d shape generation and manipulation. ACM Trans. Graph. 40(4) (2021). https://doi.org/10.1145/3450626.3459766
Tang, Y., Qian, Y., Zhang, Q., Zeng, Y., Hou, J., Zhe, X.: Warpinggan: Warping multiple uniform priors for adversarial 3d point cloud generation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6387–6395 (2022). https://doi.org/10.1109/CVPR52688.2022.00629
Luo, S., Hu, W.: Diffusion probabilistic models for 3d point cloud generation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2836–2844 (2021). https://doi.org/10.1109/CVPR46437.2021.00286
Yang, G., Huang, X., Hao, Z., Liu, M.-Y., Belongie, S., Hariharan, B.: Pointflow: 3d point cloud generation with continuous normalizing flows. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4540–4549 (2019). https://doi.org/10.1109/ICCV.2019.00464
Zhang, K., Yang, X., Wu, Y., Jin, C.: In: Attention-Based Transformation from Latent Features to Point Clouds, vol. 36, pp. 3291–3299 (2022). https://doi.org/10.1609/aaai.v36i3.20238. https://ojs.aaai.org/index.php/AAAI/article/view/20238
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv e-prints, 1312–6114 (2013) arXiv:1312.6114 [stat.ML]
Zhang, Y., Zhao, W., Sun, B., Zhang, Y., Wen, W.: Point cloud upsampling algorithm: a systematic review. Algorithms 15(4) (2022). https://doi.org/10.3390/a15040124
Hodgins, J.: CMU graphics lab motion capture database. Jessica Hodgins (2015)
Ji, Y., Xu, F., Yang, Y., Shen, F., Shen, H.T., Zheng, W.-S.: A large-scale RGB-D database for arbitrary-view human action recognition. In: Proceedings of the 26th ACM International Conference on Multimedia. MM ’18, pp. 1510–1518. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3240508.3240675
Zou, S., Zuo, X., Qian, Y., Wang, S., Xu, C., Gong, M., Cheng, L.: In: 3D Human Shape Reconstruction from a Polarization Image, pp. 2007–09268 (2020). arXiv:2007.09268
Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: Video inference for human body pose and shape estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5252–5262 (2020). https://doi.org/10.1109/CVPR42600.2020.00530
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., Kot, A.C.: Ntu rgb+d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2020). https://doi.org/10.1109/TPAMI.2019.2916873
Yan, S., Xiong, Y., Lin, D.: In: Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, vol. 32 (2018). https://doi.org/10.1609/aaai.v32i1.12328. https://ojs.aaai.org/index.php/AAAI/article/view/12328
Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5738–5746 (2019). https://doi.org/10.1109/CVPR.2019.00589
Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., Lu, C.: Hybrik: a hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3382–3392 (2021). https://doi.org/10.1109/CVPR46437.2021.00339
Fan, H., Yang, Y., Kankanhalli, M.: Point 4d transformer networks for spatio-temporal modeling in point cloud videos. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14199–14208 (2021). https://doi.org/10.1109/CVPR46437.2021.01398
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.: Amass: archive of motion capture as surface shapes. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5441–5450 (2019). https://doi.org/10.1109/ICCV.2019.00554
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10967–10977 (2019). https://doi.org/10.1109/CVPR.2019.01123
Funding
This work has received partial support from the National Natural Science Foundation of China under Grants 61972160 and 62171145.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known financial or personal conflicts of interest that could have influenced the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary file 1 (mp4 49815 KB)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ma, W., Yin, M., Li, G. et al. PCMG:3D point cloud human motion generation based on self-attention and transformer. Vis Comput 40, 3765–3780 (2024). https://doi.org/10.1007/s00371-023-03063-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-023-03063-x