Temporal-Viewpoint Transportation Plan for Skeletal Few-Shot Action Recognition

Wang, Lei; Koniusz, Piotr

doi:10.1007/978-3-031-26316-3_19

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13844))

Included in the following conference series:

Asian Conference on Computer Vision

484 Accesses

Abstract

We propose a Few-shot Learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt (JEANIE). To factor out misalignment between query and support sequences of 3D body joints, we propose an advanced variant of Dynamic Time Warping which jointly models each smooth path between the query and support frames to achieve simultaneously the best alignment in the temporal and simulated camera viewpoint spaces for end-to-end learning under the limited few-shot training data. Sequences are encoded with a temporal block encoder based on Simple Spectral Graph Convolution, a lightweight linear Graph Neural Network backbone. We also include a setting with a transformer. Finally, we propose a similarity-based loss which encourages the alignment of sequences of the same class while preventing the alignment of unrelated sequences. We show state-of-the-art results on NTU-60, NTU-120, Kinetics-skeleton and UWA3D Multiview Activity II.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Meet JEANIE: A Similarity Measure for 3D Skeleton Sequences via Temporal-Viewpoint Alignment

Article Open access 06 May 2024

Learning by Aligning 2D Skeleton Sequences and Multi-modality Fusion

Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition

References

Euler angles. Wikipedia. https://en.wikipedia.org/wiki/Euler_angles. Accessed 08 Mar 2022
Lecture 12: Camera projection. On-line. https://www.cse.psu.edu/~rtc12/CSE486/lecture12.pdf. Accessed: 08 Mar 2022
Bart, E., Ullman, S.: Cross-generalization: Learning novel classes from a single example by feature replacement. In: CVPR, pp. 672–679 (2005)
Google Scholar
Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., Cox, D.D.: Hyperopt: a python library for model selection and hyperparameter optimization. Comput. Sci. Discov. 8(1), 014008 (2015)
Article Google Scholar
Cao, K., Ji, J., Cao, Z., Chang, C.Y., Niebles, J.C.: Few-shot video classification via temporal alignment. In: CVPR (2020)
Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR (2017)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)
Google Scholar
Catalin, I., Dragos, P., Vlad, O., Cristian, S.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE TPAMI (2014)
Google Scholar
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: CVPR (2020)
Google Scholar
Cuturi, M.: Fast global alignment kernels. In: ICML (2011)
Google Scholar
Cuturi, M., Blondel, M.: Soft-DTW: a differentiable loss function for time-series. In: ICML (2017)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)
Google Scholar
Dvornik, N., Schmid, C., Mairal, J.: Selecting relevant features from a multi-domain representation for few-shot classification. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 769–786. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_45
Chapter Google Scholar
Dwivedi, S.K., Gupta, V., Mitra, R., Ahmed, S., Jain, A.: Protogan: towards few shot learning for action recognition. arXiv (2019)
Google Scholar
Elsken, T., Staffler, B., Metzen, J.H., Hutter, F.: Meta-learning of neural architectures for few-shot learning. In: CVPR (2020)
Google Scholar
Fei, N., Guan, J., Lu, Z., Gao, Y.: Few-shot zero-shot learning: Knowledge transfer with less supervision. In: ACCV (2020)
Google Scholar
Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 594–611 (2006)
Article Google Scholar
Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks for video action recognition. In: CVPR (2017)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)
Google Scholar
Fink, M.: Object classification from a single example utilizing class relevance metrics. In: NeurIPS, pp. 449–456 (2005)
Google Scholar
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Precup, D., Teh, Y.W. (eds.) ICML, vol. 70, pp. 1126–1135. PMLR (2017)
Google Scholar
Guan, J., Zhang, M., Lu, Z.: Large-scale cross-domain few-shot learning. In: ACCV (2020)
Google Scholar
Guo, M., Chou, E., Huang, D.-A., Song, S., Yeung, S., Fei-Fei, L.: Neural graph matching networks for fewshot 3d action recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 673–689. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_40
Chapter Google Scholar
Guo, Y., Codella, N.C., Karlinsky, L., Codella, J.V., Smith, J.R., Saenko, K., Rosing, T., Feris, R.: A broader study of cross-domain few-shot learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 124–141. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_8
Chapter Google Scholar
Haasdonk, B., Burkhardt, H.: Invariant kernel functions for pattern analysis and machine learning. Mach. Learn. 68(1), 35–61 (2007)
Article MATH Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv (2017)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)
Google Scholar
Klicpera, J., Bojchevski, A., Gunnemann, S.: Predict then propagate: graph neural networks meet personalized pagerank. In: ICLR (2019)
Google Scholar
Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, vol. 2 (2015)
Google Scholar
Koniusz, P., Wang, L., Cherian, A.: Tensor representations for action recognition. IEEE TPAMI (2020)
Google Scholar
Koniusz, P., Wang, L., Sun, K.: High-order tensor pooling with attention for action recognition. arXiv (2021)
Google Scholar
Koniusz, P., Zhang, H.: Power normalizations in fine-grained image, few-shot image and graph classification. IEEE Trans. Pattern Anal. Mach. Intell. 44(2), 591–609 (2022)
Article Google Scholar
Lake, B.M., Salakhutdinov, R., Gross, J., Tenenbaum, J.B.: One shot learning of simple visual concepts. CogSci (2011)
Google Scholar
Li, F.F., VanRullen, R., Koch, C., Perona, P.: Rapid natural scene categorization in the near absence of attention. Proc. Natl. Acad. Sci. 99(14), 9596–9601 (2002)
Article Google Scholar
Li, K., Zhang, Y., Li, K., Fu, Y.: Adversarial feature hallucination networks for few-shot learning. In: CVPR (2020)
Google Scholar
Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: CVPR (2019)
Google Scholar
Lichtenstein, M., Sattigeri, P., Feris, R., Giryes, R., Karlinsky, L.: TAFSSL: task-adaptive feature sub-space learning for few-shot classification. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 522–539. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_31
Chapter Google Scholar
Liu, J., Wang, G., Hu, P., Duan, L., Kot, A.C.: Global context-aware attention LSTM networks for 3d action recognition. In: CVPR, pp. 3671–3680 (2017)
Google Scholar
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
Google Scholar
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: CVPR (2020)
Google Scholar
Lu, C., Koniusz, P.: Few-shot keypoint detection with uncertainty learning for unseen species. In: CVPR (2022)
Google Scholar
Luo, Q., Wang, L., Lv, J., Xiang, S., Pan, C.: Few-shot learning via feature hallucination with variational inference. In: WACV (2021)
Google Scholar
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: ICCV. pp. 2659–2668 (2017)
Google Scholar
Memmesheimer, R., Häring, S., Theisen, N., Paulus, D.: Skeleton-DML: deep metric learning for skeleton-based one-shot action recognition. arXiv (2021)
Google Scholar
Memmesheimer, R., Theisen, N., Paulus, D.: Signal level deep metric learning for multimodal one-shot action recognition. arXiv (2020)
Google Scholar
Miller, E.G., Matsakis, N.E., Viola, P.A.: Learning from one example through shared densities on transforms. In: CVPR, vol. 1, pp. 464–471 (2000)
Google Scholar
Mishra, A., Verma, V.K., Reddy, M.S.K., Arulkumar, S., Rai, P., Mittal, A.: A generative approach to zero-shot and few-shot action recognition. In: WACV, pp. 372–380 (2018)
Google Scholar
Qin, Z., et al.: Fusing higher-order features in graph neural networks for skeleton-based action recognition. IEEE Trans. Neural Netw. Learn. Syst. (99), 1–15 (2022)
Google Scholar
Rahmani, H., Mahmood, A., Huynh, D.Q., Mian, A.: Histogram of Oriented Principal Components for Cross-View Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38, 2430–2443 (2016)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: CVPR (2016)
Google Scholar
Simon, C., Koniusz, P., Harandi, M.: On learning the geodesic path for incremental learning. In: CVPR, pp. 1591–1600 (2021)
Google Scholar
Simon, C., Koniusz, P., Nock, R., Harandi, M.: On Modulating the gradient for meta-learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 556–572. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_33
Chapter Google Scholar
Smola, A.J., Kondor, R.: Kernels and regularization on graphs. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT-Kernel 2003. LNCS (LNAI), vol. 2777, pp. 144–158. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45167-9_12
Chapter MATH Google Scholar
Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning. In: Guyon, I., et al.: (eds.) NeurIPS, pp. 4077–4087 (2017)
Google Scholar
Su, B., Wen, J.R.: Temporal alignment prediction for supervised representation learning and few-shot sequence classification. In: ICLR (2022)
Google Scholar
Sun, K., Koniusz, P., Wang, Z.: Fisher-Bures adversary graph convolutional networks. In: Conference on Uncertainty in Artificial Intelligence, Israel, vol. 115, pp. 465–475 (2019)
Google Scholar
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H.S., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: CVPR, pp. 1199–1208 (2018)
Google Scholar
Tang, L., Wertheimer, D., Hariharan, B.: Revisiting pose-normalization for fine-grained few-shot recognition. In: CVPR (2020)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV (2015)
Google Scholar
Villani, C.: Optimal Transport Old and New. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-540-71050-9
Book MATH Google Scholar
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., Wierstra, D.: Matching networks for one shot learning. In: Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (eds.) NeurIPS, pp. 3630–3638 (2016)
Google Scholar
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks for action recognition in videos. IEEE Trans. Pattern. Anal. Mach. Intell. 41(11), 2740–2755 (2019)
Article Google Scholar
Wang, L.: Analysis and evaluation of kinect-based action recognition algorithms. Master’s thesis, School of the Computer Science and Software Engineering, The University of Western Australia (2017)
Google Scholar
Wang, L., Huynh, D.Q., Koniusz, P.: A comparative review of recent kinect-based action recognition algorithms. IEEE Trans. Image Process. 29, 15–28 (2020)
Article MathSciNet MATH Google Scholar
Wang, L., Huynh, D.Q., Mansour, M.R.: Loss switching fusion with similarity search for video classification. In: ICIP (2019)
Google Scholar
Wang, L., Koniusz, P.: Self-supervising action recognition by statistical moment and subspace descriptors. In: ACM-MM, pp. 4324–4333 (2021)
Google Scholar
Wang, L., Koniusz, P.: Uncertainty-DTW for time series and sequences. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds) Computer Vision–ECCV 2022. ECCV 2022. LNCS, vol. 13681. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19803-8_11
Wang, L., Koniusz, P., Huynh, D.Q.: Hallucinating IDT descriptors and I3D optical flow features for action recognition with CNNs. In: ICCV (2019)
Google Scholar
Wang, L., Ding, Z., Tao, Z., Liu, Y., Fu, Y.: Generative multi-view human action recognition. In: ICCV (2019)
Google Scholar
Wang, S., Yue, J., Liu, J., Tian, Q., Wang, M.: Large-scale few-shot learning via multi-modal knowledge discovery. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 718–734. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_42
Chapter Google Scholar
Wang, Y., Long, M., Wang, J., Yu, P.S.: Spatiotemporal pyramid network for video action recognition. In: CVPR (2017)
Google Scholar
Wu, F., Zhang, T., de Souza Jr., A.H., Fifty, C., Yu, T., Weinberger, K.Q.: Simplifying graph convolutional networks. In: ICML (2019)
Google Scholar
Xu, B., Ye, H., Zheng, Y., Wang, H., Luwang, T., Jiang, Y.G.: Dense dilated network for few shot action recognition. In: ACM ICMR, pp. 379–387 (2018)
Google Scholar
Yan, S., Xiong, Y., Lin, D.: Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In: AAAI (2018)
Google Scholar
Yu, X., Zhuang, Z., Koniusz, P., Li, H.: 6DoF object pose estimation via differentiable proxy voting regularizer. In: BMVC. BMVA Press (2020)
Google Scholar
Zhang, H., Koniusz, P.: Power normalizing second-order similarity network for few-shot learning. In: WACV, pp. 1185–1193 (2019)
Google Scholar
Zhang, H., Koniusz, P., Jian, S., Li, H., Torr, P.H.S.: Rethinking class relations: absolute-relative supervised and unsupervised few-shot learning. In: CVPR, pp. 9432–9441 (June 2021)
Google Scholar
Zhang, H., Li, H., Koniusz, P.: Multi-level second-order few-shot learning. IEEE Trans. Multim. (99), 1 (2022)
Google Scholar
Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P.H.S., Koniusz, P.: Few-shot action recognition with permutation-invariant attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 525–542. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_31
Chapter Google Scholar
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: ICCV (2017)
Google Scholar
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans. Pattern Aanal. Mach. Intell. 41(8), 1963–1978 (2019)
Article Google Scholar
Zhang, S., Luo, D., Wang, L., Koniusz, P.: Few-shot object detection by second-order pooling. In: Ishikawa, H., Liu, C.-L., Pajdla, T., Shi, J. (eds.) ACCV 2020. LNCS, vol. 12625, pp. 369–387. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-69538-5_23
Chapter Google Scholar
Zhang, S., Murray, N., Wang, L., Koniusz, P.: Time-rEversed diffusioN tEnsor transformer: a new TENET of few-shot object detection. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision–ECCV 2022. ECCV 2022. LNCS, vol. 13680. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20044-1_18
Zhang, S., Wang, L., Murray, N., Koniusz, P.: Kernelized few-shot object detection with efficient integral aggregation. In: CVPR, pp. 19207–19216 (June 2022)
Google Scholar
Zhu, H., Koniusz, P.: Simple spectral graph convolution. In: ICLR (2021)
Google Scholar
Zhu, H., Koniusz, P.: EASE: unsupervised discriminant subspace learning for transductive few-shot learning. In: CVPR (2022)
Google Scholar
Zhu, H., Sun, K., Koniusz, P.: Contrastive laplacian eigenmaps. In: NeurIPS, pp. 5682–5695 (2021)
Google Scholar
Zhu, L., Yang, Y.: Compound memory networks for few-shot video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 782–797. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_46
Chapter Google Scholar

Download references

Acknowledgements

We thank Dr. Jun Liu (SUTD) for discussions on FSAR for 3D skeletons, and CSIRO’s Machine Learning and Artificial Intelligence Future Science Platform (MLAI FSP).

Author information

Authors and Affiliations

Australian National University, Canberra, Australia
Lei Wang & Piotr Koniusz
Data61/CSIRO, Sydney, Australia
Lei Wang & Piotr Koniusz

Authors

Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Koniusz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Piotr Koniusz .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1138 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, L., Koniusz, P. (2023). Temporal-Viewpoint Transportation Plan for Skeletal Few-Shot Action Recognition. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13844. Springer, Cham. https://doi.org/10.1007/978-3-031-26316-3_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-26316-3_19
Published: 02 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26315-6
Online ISBN: 978-3-031-26316-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Temporal-Viewpoint Transportation Plan for Skeletal Few-Shot Action Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Meet JEANIE: A Similarity Measure for 3D Skeleton Sequences via Temporal-Viewpoint Alignment

Learning by Aligning 2D Skeleton Sequences and Multi-modality Fusion

Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1138 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Temporal-Viewpoint Transportation Plan for Skeletal Few-Shot Action Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Meet JEANIE: A Similarity Measure for 3D Skeleton Sequences via Temporal-Viewpoint Alignment

Learning by Aligning 2D Skeleton Sequences and Multi-modality Fusion

Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1138 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation