Skip to main content

Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13664))

Included in the following conference series:

Abstract

Few-shot action recognition aims to recognize few-labeled novel action classes and attracts growing attentions due to practical significance. Human skeletons provide explainable and data-efficient representation for this problem by explicitly modeling spatial-temporal relations among skeleton joints. However, existing skeleton-based spatial-temporal models tend to deteriorate the positional distinguishability of joints, which leads to fuzzy spatial matching and poor explainability. To address these issues, we propose a novel spatial matching strategy consisting of spatial disentanglement and spatial activation. The motivation behind spatial disentanglement is that we find more spatial information for leaf nodes (e.g., the “hand” joint) is beneficial to increase representation diversity for skeleton matching. To achieve spatial disentanglement, we encourage the skeletons to be represented in a full rank space with rank maximization constraint. Finally, an attention based spatial activation mechanism is introduced to incorporate the disentanglement, by adaptively adjusting the disentangled joints according to matching pairs. Extensive experiments on three skeleton benchmarks demonstrate that the proposed spatial matching strategy can be effectively inserted into existing temporal alignment frameworks, achieving considerable performance improvements as well as inherent explainability.

H. Zhang—Equal Contribution With the First Author.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://pytorch.org/.

  2. 2.

    https://github.com/NingMa-AI/DASTM.

References

  1. Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271 (2018)

  2. Ben-Ari, R., Nacson, M.S., Azulai, O., Barzelay, U., Rotman, D.: TAEN: temporal aware embedding network for few-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 2786–2794, June 2021

    Google Scholar 

  3. Bhatia, R.: Matrix analysis (2013)

    Google Scholar 

  4. Cao, C., Li, Y., Lv, Q., Wang, P., Zhang, Y.: Few-shot action recognition with implicit temporal alignment and pair similarity optimization (2020)

    Google Scholar 

  5. Cao, K., Ji, J., Cao, Z., Chang, C.Y., Niebles, J.C.: Few-shot video classification via temporal alignment. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10615–10624 (2020). https://doi.org/10.1109/CVPR42600.2020.01063

  6. Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172–186 (2019)

    Article  Google Scholar 

  7. Careaga, C., Hutchinson, B., Hodas, N., Phillips, L.: Metric-based few-shot learning for video action recognition (2019)

    Google Scholar 

  8. Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., Hu, W.: Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13359–13368, October 2021

    Google Scholar 

  9. Cui, S., Wang, S., Zhuo, J., Li, L., Huang, Q., Tian, Q.: Towards discriminability and diversity: Batch nuclear-norm maximization under label insufficient situations. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

    Google Scholar 

  10. Cuturi, M., Blondel, M.: Soft-DTW: a differentiable loss function for time-series. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, vol. 70, p. 894–903. JMLR.org (2017)

    Google Scholar 

  11. Doersch, C., Gupta, A., Zisserman, A.: CrossTransformers: spatially-aware few-shot transfer. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 21981–21993. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/fa28c6cdf8dd6f41a657c3d7caa5c709-Paper.pdf

  12. Dvornik, N., Hadji, I., Derpanis, K.G., Garg, A., Jepson, A.D.: Drop-DTW: aligning common signal between sequences while dropping outliers. In: NeurIPS (2021)

    Google Scholar 

  13. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 1126–1135. PMLR, 06–11 August 2017. https://proceedings.mlr.press/v70/finn17a.html

  14. Fu, Y., Zhang, L., Wang, J., Fu, Y., Jiang, Y.G.: Depth guided adaptive meta-fusion network for few-shot video recognition. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1142–1151 (2020)

    Google Scholar 

  15. Guo, M., Chou, E., Huang, D.-A., Song, S., Yeung, S., Fei-Fei, L.: Neural graph matching networks for fewshot 3D action recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 673–689. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_40

    Chapter  Google Scholar 

  16. Hong, J., Fisher, M., Gharbi, M., Fatahalian, K.: Video pose distillation for few-shot, fine-grained sports action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9254–9263, October 2021

    Google Scholar 

  17. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  18. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA. Conference Track Proceedings, 7–9 May 2015. https://arxiv.org/abs/1412.6980

  19. Konecny, J., Hagara, M.: One-shot-learning gesture recognition using HOG-HOF features. J. Mach. Learn. Res. 15(72), 2513–2532 (2014). https://jmlr.org/papers/v15/konecny14a.html

  20. Li, S., et al.: TA2N: two-stage action alignment network for few-shot action recognition (2021)

    Google Scholar 

  21. Lin, J., Gan, C., Wang, K., Han, S.: TSM: temporal shift module for efficient video understanding. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7082–7092 (2019)

    Google Scholar 

  22. Liu, J., et al.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2020)

    Article  Google Scholar 

  23. Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143–152 (2020)

    Google Scholar 

  24. Luo, D., et al.: Learning to drop: Robust graph neural network via topological denoising. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, WSDM 2021, pp. 779–787. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3437963.3441734

  25. Memmesheimer, R., Häring, S., Theisen, N., Paulus, D.: Skeleton-DML: deep metric learning for skeleton-based one-shot action recognition. arXiv preprint arXiv:2012.13823 (2020)

  26. Ni, X., Song, S., Tai, Y.W., Tang, C.K.: Semi-supervised few-shot atomic action recognition (2020)

    Google Scholar 

  27. Nielsen, F., Sun, K.: Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise Log-Sum-Exp inequalities. Entropy 18(12) (2016). https://doi.org/10.3390/e18120442. https://www.mdpi.com/1099-4300/18/12/442

  28. Patravali, J., Mittal, G., Yu, Y., Li, F., Chen, M.: Unsupervised few-shot action recognition via action-appearance aligned meta-adaptation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8484–8494, October 2021

    Google Scholar 

  29. Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., Damen, D.: Temporal-relational CrossTransformers for few-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 475–484, June 2021

    Google Scholar 

  30. Qi, M., Qin, J., Zhen, X., Huang, D., Yang, Y., Luo, J.: Few-shot ensemble learning for video classification with SlowFast memory networks. In: Proceedings of the 28th ACM International Conference on Multimedia, MM 2020, pp. 3007–3015. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3394171.3416269

  31. Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010). https://doi.org/10.1137/070697835

    Article  MathSciNet  MATH  Google Scholar 

  32. Rong, Y., Huang, W., Xu, T., Huang, J.: DropEdge: towards deep graph convolutional networks on node classification. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=Hkx1qkrKPr

  33. Sabater, A., Santos, L., Santos-Victor, J., Bernardino, A., Montesano, L., Murillo, A.C.: One-shot action recognition in challenging therapy scenarios. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 2777–2785, June 2021

    Google Scholar 

  34. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26(1), 43–49 (1978). https://doi.org/10.1109/TASSP.1978.1163055

    Article  MATH  Google Scholar 

  35. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)

    Google Scholar 

  36. Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175 (2017)

  37. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  38. Wang, J., Wang, Y., Liu, S., Li, A.: Few-shot fine-grained action recognition via bidirectional attention and contrastive meta-learning. In: Proceedings of the 29th ACM International Conference on Multimedia, MM 2021, pp. 582–591. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3474085.3475216

  39. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

    Chapter  Google Scholar 

  40. Wang, X., et al.: Semantic-guided relation propagation network for few-shot action recognition. In: Proceedings of the 29th ACM International Conference on Multimedia, MM 2021, pp. 816–825. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3474085.3475253

  41. Xian, Y., Korbar, B., Douze, M., Schiele, B., Akata, Z., Torresani, L.: Generalized many-way few-shot video classification. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12540, pp. 111–127. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-65414-6_10

    Chapter  Google Scholar 

  42. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  43. Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P.H.S., Koniusz, P.: Few-shot action recognition with permutation-invariant attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 525–542. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_31

    Chapter  Google Scholar 

  44. Zhang, S., Zhou, J., He, X.: Learning implicit temporal alignment for few-shot video classification. In: IJCAI (2021)

    Google Scholar 

  45. Zhao, L., Akoglu, L.: PairNorm: tackling oversmoothing in GNNs. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=rkecl1rtwB

  46. Zhu, L., Yang, Y.: Compound memory networks for few-shot video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 782–797. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_46

    Chapter  Google Scholar 

  47. Zhu, Z., Wang, L., Guo, S., Wu, G.: A closer look at few-shot video classification: A new baseline and benchmark (2021)

    Google Scholar 

Download references

Acknowledgement

This work is supported by the National Key Research and Development Program (Grant No. 2019YFF0302601), National Natural Science Foundation of China (Grant No: 61972349, 62106221) and Multi-Center Clinical Research Project in National Center (No. S20A0002).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Sheng Zhou or Jiajun Bu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 650 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ma, N. et al. (2022). Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13664. Springer, Cham. https://doi.org/10.1007/978-3-031-19772-7_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19772-7_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19771-0

  • Online ISBN: 978-3-031-19772-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics