Skip to main content

Learning by Aligning 2D Skeleton Sequences and Multi-modality Fusion

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

This paper presents a self-supervised temporal video alignment framework which is useful for several fine-grained human activity understanding applications. In contrast with the state-of-the-art method of CASA, where sequences of 3D skeleton coordinates are taken directly as input, our key idea is to use sequences of 2D skeleton heatmaps as input. Unlike CASA which performs self-attention in the temporal domain only, we feed 2D skeleton heatmaps to a video transformer which performs self-attention both in the spatial and temporal domains for extracting effective spatiotemporal and contextual features. In addition, we introduce simple heatmap augmentation techniques based on 2D skeletons for self-supervised learning. Despite the lack of 3D information, our approach achieves not only higher accuracy but also better robustness against missing and noisy keypoints than CASA. Furthermore, extensive evaluations on three public datasets, i.e., Penn Action, IKEA ASM, and H2O, demonstrate that our approach outperforms previous methods in different fine-grained human activity understanding tasks. Finally, fusing 2D skeleton heatmaps with RGB videos yields the state-of-the-art on all metrics and datasets. To our best knowledge, our work is the first to utilize 2D skeleton heatmap inputs and the first to explore multi-modality fusion for temporal video alignment.

Q.-H. Tran and M. Ahmed—Indicates joint first author.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahsan, U., Sun, C., Essa, I.: Discrimnet: Semi-supervised action recognition from videos using generative adversarial networks. arXiv preprint (2018)

    Google Scholar 

  2. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)a

    Google Scholar 

  3. Asghari-Esfeden, S., Sznaier, M., Camps, O.: Dynamic motion representation for human action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 557–566 (2020)

    Google Scholar 

  4. Ben-Shabat, Y., et al.: The ikea asm dataset: Understanding people assembling furniture through actions, objects, and pose. In: arXiv preprint (2020)

    Google Scholar 

  5. Benaim, S., et al.: Speednet: learning the speediness in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9922–9931 (2020)

    Google Scholar 

  6. Caetano, C., Sena, J., Brémond, F., Dos Santos, J.A., Schwartz, W.R.: Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8. IEEE (2019)

    Google Scholar 

  7. Cai, J., Jiang, N., Han, X., Jia, K., Lu, J.: Jolo-gcn: mining joint-centered light-weight information for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications Of Computer Vision, pp. 2735–2744 (2021)

    Google Scholar 

  8. Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172–186 (2019)

    Article  Google Scholar 

  9. Carlucci, F.M., D’Innocente, A., Bucci, S., Caputo, B., Tommasi, T.: Domain generalization by solving jigsaw puzzles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2229–2238 (2019)

    Google Scholar 

  10. Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149 (2018)

    Google Scholar 

  11. Caron, M., Bojanowski, P., Mairal, J., Joulin, A.: Unsupervised pre-training of image features on non-curated data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2959–2968 (2019)

    Google Scholar 

  12. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural. Inf. Process. Syst. 33, 9912–9924 (2020)

    Google Scholar 

  13. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)

    Google Scholar 

  14. Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)

    Google Scholar 

  15. Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., Hu, W.: Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13359–13368 (2021)

    Google Scholar 

  16. p Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: pose motion representation for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7024–7033 (2018)

    Google Scholar 

  17. Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. In: Advances in Neural Information Processing Systems, vol. 26 (2013)

    Google Scholar 

  18. Cuturi, M., Blondel, M.: Soft-dtw: a differentiable loss function for time-series. In: International Conference on Machine Learning, pp. 894–903. PMLR (2017)

    Google Scholar 

  19. Das, S., Sharma, S., Dai, R., Brémond, F., Thonnat, M.: VPN: learning video-pose embedding for activities of daily living. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX, pp. 72–90. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_5

    Chapter  Google Scholar 

  20. Dave, I., Gupta, R., Rizve, M.N., Shah, M.: Tclr: temporal contrastive learning for video representation. Comput. Vis. Image Underst. 219, 103406 (2022)

    Article  MATH  Google Scholar 

  21. Diba, A., Sharma, V., Gool, L.V., Stiefelhagen, R.: Dynamonet: Dynamic action and motion network. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6192–6201 (2019)

    Google Scholar 

  22. Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969–2978 (2022)

    Google Scholar 

  23. Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., Zisserman, A.: Temporal cycle-consistency learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

    Google Scholar 

  24. Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R., He, K.: A large-scale study on unsupervised spatiotemporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3299–3309 (2021)

    Google Scholar 

  25. Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3636–3645 (2017)

    Google Scholar 

  26. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=S1v4N2l0-

  27. Goroshin, R., Bruna, J., Tompson, J., Eigen, D., LeCun, Y.: Unsupervised learning of spatiotemporally coherent metrics. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4086–4093 (2015)

    Google Scholar 

  28. Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural. Inf. Process. Syst. 33, 21271–21284 (2020)

    MATH  Google Scholar 

  29. Gupta, P., et al.: Quo vadis, skeleton action recognition? Int. J. Comput. Vision 129(7), 2097–2112 (2021)

    Article  MATH  Google Scholar 

  30. Haresh, S., et al.: Learning by aligning videos in time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5548–5558 (2021)

    Google Scholar 

  31. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  32. Hernandez Ruiz, A., Porzi, L., Rota Bulò, S., Moreno-Noguer, F.: 3d cnns on distance matrices for human action recognition. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1087–1095 (2017)

    Google Scholar 

  33. Hu, K., Shao, J., Liu, Y., Raj, B., Savvides, M., Shen, Z.: Contrast and order representations for video self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7939–7949 (2021)

    Google Scholar 

  34. Hyder, S.W., et al.: Action segmentation using 2d skeleton heatmaps and multi-modality fusion. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2024)

    Google Scholar 

  35. Jenni, S., Jin, H., Favaro, P.: Steering self-supervised feature learning beyond local pixel statistics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6408–6417 (2020)

    Google Scholar 

  36. Kay, W., et al.: The kinetics human action video dataset. arXiv (2017). https://doi.org/10.48550/ARXIV.1705.06950, https://arxiv.org/abs/1705.06950

  37. Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017)

    Google Scholar 

  38. Khan, H., et al.: Timestamp-supervised action segmentation with graph convolutional networks. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10619–10626. IEEE (2022)

    Google Scholar 

  39. Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8545–8552 (2019)

    Google Scholar 

  40. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint (2014)

    Google Scholar 

  41. Kumar, S., Haresh, S., Ahmed, A., Konin, A., Zia, M.Z., Tran, Q.H.: Unsupervised action segmentation by joint representation learning and online clustering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20174–20185 (2022)

    Google Scholar 

  42. Kwon, T., Tekin, B., Stühmer, J., Bogo, F., Pollefeys, M.: H2o: Two hands manipulating objects for first person interaction recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10138–10148 (October 2021)

    Google Scholar 

  43. Kwon, T., Tekin, B., Tang, S., Pollefeys, M.: Context-aware sequence alignment using 4d skeletal augmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8172–8182 (2022)

    Google Scholar 

  44. Larsson, G., Maire, M., Shakhnarovich, G.: Learning representations for automatic colorization. In: European Conference on Computer Vision, pp. 577–593. Springer (2016). https://doi.org/10.1007/978-3-319-46493-0_35

  45. Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6874–6883 (2017)

    Google Scholar 

  46. Lee, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676 (2017)

    Google Scholar 

  47. Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint (2018)

    Google Scholar 

  48. Lin, L., Song, S., Yang, W., Liu, J.: Ms2l: multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2490–2498 (2020)

    Google Scholar 

  49. Lin, Z., Zhang, W., Deng, X., Ma, C., Wang, H.: Image-based pose representation for action recognition and hand gesture recognition. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 532–539. IEEE (2020)

    Google Scholar 

  50. Liu, W., Tekin, B., Coskun, H., Vineet, V., Fua, P., Pollefeys, M.: Learning to align sequential actions in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2181–2191 (2022)

    Google Scholar 

  51. Liu, X., Van De Weijer, J., Bagdanov, A.D.: Leveraging unlabeled data for crowd counting by learning to rank. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7661–7669 (2018)

    Google Scholar 

  52. Luvizon, D.C., Picard, D., Tabia, H.: 2d/3d pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5137–5146 (2018)

    Google Scholar 

  53. Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I, pp. 527–544. Springer International Publishing, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32

    Chapter  Google Scholar 

  54. Mobahi, H., Collobert, R., Weston, J.: Deep learning from temporal coherence in video. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 737–744 (2009)

    Google Scholar 

  55. Noroozi, M., Pirsiavash, H., Favaro, P.: Representation learning by learning to count. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5898–5906 (2017)

    Google Scholar 

  56. Paszke, A., et al.: Automatic differentiation in pytorch (2017)

    Google Scholar 

  57. Pickup, L.C., et al.: Seeing the arrow of time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2035–2042 (2014)

    Google Scholar 

  58. Qian, R., et al.: Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6964–6974 (2021)

    Google Scholar 

  59. Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141 (2018). https://doi.org/10.1109/ICRA.2018.8462891

  60. Shah, A., Lundell, B., Sawhney, H., Chellappa, R.: Steps: self-supervised key step extraction and localization from unlabeled procedural videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10375–10387 (2023)

    Google Scholar 

  61. Si, C., Nie, X., Wang, W., Wang, L., Tan, T., Feng, J.: Adversarial self-supervised learning for semi-supervised 3d action recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII, pp. 35–51. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_3

    Chapter  Google Scholar 

  62. Song, Y.F., Zhang, Z., Shan, C., Wang, L.: Stronger, faster and more explainable: a graph convolutional baseline for skeleton-based action recognition. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1625–1633 (2020)

    Google Scholar 

  63. Song, Y.F., Zhang, Z., Shan, C., Wang, L.: Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1474–1488 (2022)

    Article  MATH  Google Scholar 

  64. Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using lstms. In: International conference on machine learning, pp. 843–852 (2015)

    Google Scholar 

  65. Su, K., Liu, X., Shlizerman, E.: Predict & cluster: Unsupervised skeleton based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9631–9640 (2020)

    Google Scholar 

  66. Su, Y., Lin, G., Wu, Q.: Self-supervised 3d skeleton action representation learning with motion consistency and continuity. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13328–13338 (2021)

    Google Scholar 

  67. Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8922–8931 (2021)

    Google Scholar 

  68. Tran, Q.H., et al.: Permutation-aware activity segmentation via unsupervised frame-to-segment alignment. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6426–6436 (2024)

    Google Scholar 

  69. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  70. Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances in Neural Information Processing Systems, pp. 613–621 (2016)

    Google Scholar 

  71. Wang, J., Jiao, J., Liu, Y.-H.: Self-supervised video representation learning by pace prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII, pp. 504–521. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_30

    Chapter  Google Scholar 

  72. Wei, D., Lim, J.J., Zisserman, A., Freeman, W.T.: Learning and using the arrow of time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8060 (2018)

    Google Scholar 

  73. Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10334–10343 (2019)

    Google Scholar 

  74. Yan, A., Wang, Y., Li, Z., Qiao, Y.: Pa3d: pose-action 3d machine for video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7922–7931 (2019)

    Google Scholar 

  75. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference On Artificial Intelligence (2018)

    Google Scholar 

  76. Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q.: Video playback rate perception for self-supervised spatio-temporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6548–6557 (2020)

    Google Scholar 

  77. Zhang, H., Liu, D., Zheng, Q., Su, B.: Modeling video as stochastic processes for fine-grained video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2225–2234 (2023)

    Google Scholar 

  78. Zhang, W., Zhu, M., Derpanis, K.G.: From actemes to action: a strongly-supervised representation for detailed action understanding. In: 2013 IEEE International Conference on Computer Vision, pp. 2248–2255 (2013). https://doi.org/10.1109/ICCV.2013.280

  79. Zheng, N., Wen, J., Liu, R., Long, L., Dai, J., Gong, Z.: Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  80. Zhu, D., Zhang, Z., Cui, P., Zhu, W.: Robust graph convolutional networks against adversarial attacks. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and data mining, pp. 1399–1407 (2019)

    Google Scholar 

  81. Zou, W.Y., Ng, A.Y., Yu, K.: Unsupervised learning of visual invariance with temporal coherence. In: NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning, vol. 3 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Quoc-Huy Tran .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6407 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tran, QH., Ahmed, M., Popattia, M., Ahmed, M.H., Konin, A., Zia, M.Z. (2025). Learning by Aligning 2D Skeleton Sequences and Multi-modality Fusion. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15108. Springer, Cham. https://doi.org/10.1007/978-3-031-72973-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72973-7_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72972-0

  • Online ISBN: 978-3-031-72973-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics