C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition

Li, Rongchang; Feng, Zhenhua; Xu, Tianyang; Li, Linze; Wu, Xiao-Jun; Awais, Muhammad; Atito, Sara; Kittler, Josef

doi:10.1007/978-3-031-72920-1_21

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15096))

Included in the following conference series:

European Conference on Computer Vision

325 Accesses
1 Citations

Abstract

Compositional actions consist of dynamic (verbs) and static (objects) concepts. Humans can easily recognize unseen compositions using the learned concepts. For machines, solving such a problem requires a model to recognize unseen actions composed of previously observed verbs and objects, thus requiring so-called compositional generalization ability. To facilitate this research, we propose a novel Zero-Shot Compositional Action Recognition (ZS-CAR) task. For evaluating the task, we construct a new benchmark, Something-composition (Sth-com), based on the widely used Something-Something V2 dataset. We also propose a novel Component-to-Composition (C2C) learning method to solve the new ZS-CAR task. C2C includes an independent component learning module and a composition inference module. Last, we devise an enhanced training strategy to address the challenges of component variations between seen and unseen compositions and to handle the subtle balance between learning seen and unseen actions. The experimental results demonstrate that the proposed framework significantly surpasses the existing compositional generalization methods and sets a new state-of-the-art. The new Sth-com benchmark and code are available at https://github.com/RongchangLi/ZSCAR_C2C.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Spatio-Temporal Contrastive Learning for Compositional Action Recognition

Modelling object mask interaction for compositional action recognition

Article Open access 10 March 2025

Rethinking Zero-shot Action Recognition: Learning from Latent Atomic Actions

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: International Conference on Machine Learning, pp. 813–824 (2021)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Brattoli, B., Tighe, J., Zhdanov, F., Perona, P., Chalupka, K.: Rethinking zero-shot video classification: end-to-end training for realistic applications. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4613–4623 (2020)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chao, W.L., Changpinyo, S., Gong, B., Sha, F.: An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In: Proceedings of the European Conference on Computer Vision, pp. 52–68 (2016)
Google Scholar
Chen, S., Huang, D.: Elaborative rehearsal for zero-shot action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13638–13647 (2021)
Google Scholar
Chomsky, N.: Aspects of the Theory of Syntax, no. 11. MIT press, Cambridge (2014)
Google Scholar
Doshi, K., Yilmaz, Y.: Zero-shot action recognition with transformer-based video semantic embedding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4858–4867 (2023)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Google Scholar
Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Google Scholar
Feichtenhofer, C.: X3d: expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213 (2020)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Google Scholar
Fodor, J.A., Pylyshyn, Z.W.: Connectionism and cognitive architecture: a critical analysis. Cognition 28(1–2), 3–71 (1988)
Article Google Scholar
Fu, Y., Hospedales, T.M., Xiang, T., Gong, S.: Attribute learning for understanding unstructured social activity. In: Proceedings of the European Conference on Computer Vision, pp. 530–543 (2012)
Google Scholar
Goyal, R., et al.: The “something something" video database for learning and evaluating visual common sense. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5842–5850 (2017)
Google Scholar
Gretton, A., Bousquet, O., Smola, A., Schölkopf, B.: Measuring statistical dependence with hilbert-schmidt norms. In: International Conference on Algorithmic Learning Theory, pp. 63–77 (2005)
Google Scholar
Hao, S., Han, K., Wong, K.Y.K.: Learning attention as disentangler for compositional zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15315–15324 (2023)
Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
Google Scholar
Heckman, J.J.: Sample selection bias as a specification error. Econometrica: J. Econometric Soc. 153–161 (1979)
Google Scholar
Isola, P., Lim, J.J., Adelson, E.H.: Discovering states and transformations in image collections. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1383–1391 (2015)
Google Scholar
Ji, J., Krishna, R., Fei-Fei, L., Niebles, J.C.: Action genome: actions as compositions of spatio-temporal scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10236–10247 (2020)
Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)
Article Google Scholar
Karthik, S., Mancini, M., Akata, Z.: Kg-sp: knowledge guided simple primitives for open world compositional zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9336–9345 (2022)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kim, H., Lee, J., Park, S., Sohn, K.: Hierarchical visual primitive experts for compositional zero-shot learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5675–5685 (2023)
Google Scholar
Kwon, H., Kim, M., Kwak, S., Cho, M.: Motionsqueeze: neural motion feature learning for video understanding. In: Proceedings of the European Conference on Computer Vision, pp. 345–362 (2020)
Google Scholar
Kwon, H., Kim, M., Kwak, S., Cho, M.: Learning self-similarity in space and time as generalized motion for video action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13065–13075 (2021)
Google Scholar
Li, K., et al.: Uniformer: unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676 (2022)
Li, R.C., Wu, X.J., Wu, C., Xu, T.Y., Kittler, J.: Dynamic information enhancement for video classification. Image Vis. Comput. 114, 104244 (2021)
Article Google Scholar
Li, R., Wu, X.J., Xu, T.: Video is graph: structured graph module for video action recognition. arXiv preprint arXiv:2110.05904 (2021)
Li, R., Xu, T., Wu, X.J., Shen, Z., Kittler, J.: Perceiving actions via temporal video frame pairs. ACM Trans. Intell. Syst. Technol. 15(3), 1–20 (2024)
Article Google Scholar
Li, X., Yang, X., Wei, K., Deng, C., Yang, M.: Siamese contrastive embedding network for compositional zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9326–9335 (2022)
Google Scholar
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2020)
Google Scholar
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)
Google Scholar
Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3337–3344 (2011)
Google Scholar
Liu, Z., et al.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)
Google Scholar
Lu, X., Guo, S., Liu, Z., Guo, J.: Decomposed soft prompt guided fusion enhancing for compositional zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23560–23569 (2023)
Google Scholar
Ma, L., Zheng, Y., Zhang, Z., Yao, Y., Fan, X., Ye, Q.: Motion stimulation for compositional action recognition. IEEE Trans. Circuits Syst. Video Technol. 33, 2061–2074 (2022)
Article Google Scholar
Mancini, M., Naeem, M.F., Xian, Y., Akata, Z.: Open world compositional zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5222–5230 (2021)
Google Scholar
Mandal, D., et al.: Out-of-distribution detection for generalized zero-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9985–9993 (2019)
Google Scholar
Materzynska, J., Xiao, T., Herzig, R., Xu, H., Wang, X., Darrell, T.: Something-else: compositional action recognition with spatial-temporal interaction networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1049–1059 (2020)
Google Scholar
Misra, I., Gupta, A., Hebert, M.: From red wine to red tomato: composition with context. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1792–1801 (2017)
Google Scholar
Naeem, M.F., Xian, Y., Tombari, F., Akata, Z.: Learning graph embeddings for compositional zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 953–962 (2021)
Google Scholar
Nayak, N.V., Yu, P., Bach, S.: Learning to compose soft prompts for compositional zero-shot learning. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172 (2021)
Google Scholar
Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: St-adapter: parameter-efficient image-to-video transfer learning. Adv. Neural. Inf. Process. Syst. 35, 26462–26477 (2022)
Google Scholar
Park, J., Lee, J., Sohn, K.: Dual-path adaptation from image to video transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2203–2213 (2023)
Google Scholar
Piergiovanni, A., Kuo, W., Angelova, A.: Rethinking video vits: sparse video tubes for joint image and video learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2214–2224 (2023)
Google Scholar
Purushwalkam, S., Nickel, M., Gupta, A., Ranzato, M.: Task-driven modular networks for zero-shot compositional learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3593–3602 (2019)
Google Scholar
Qian, R., Lin, W., See, J., Li, D.: Controllable augmentations for video representation learning. Visual Intell. 2(1), 1 (2024)
Article Google Scholar
Qian, Y., Yu, L., Liu, W., Hauptmann, A.G.: Rethinking zero-shot action recognition: learning from latent atomic actions. In: Proceedings of the European Conference on Computer Vision, pp. 104–120 (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ruis, F., Burghouts, G., Bucur, D.: Independent prototype propagation for zero-shot compositionality. Adv. Neural. Inf. Process. Syst. 34, 10641–10653 (2021)
Google Scholar
Saini, N., Pham, K., Shrivastava, A.: Disentangling visual embeddings for attributes and objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13658–13667 (2022)
Google Scholar
Sun, P., Wu, B., Li, X., Li, W., Duan, L., Gan, C.: Counterfactual debiasing inference for compositional action recognition. In: Proceedings of the ACM International Conference on Multimedia, pp. 3220–3228 (2021)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Wang, L., Tong, Z., Ji, B., Wu, G.: TDN: temporal difference networks for efficient action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1895–1904 (2021)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the European Conference on Computer Vision, pp. 20–36 (2016)
Google Scholar
Wang, Q., et al.: Learning conditional attributes for compositional zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11197–11206 (2023)
Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision, pp. 305–321 (2018)
Google Scholar
Xu, T., Zhu, X.F., Wu, X.J.: Learning spatio-temporal discriminative model for affine subspace based visual object tracking. Visual Intell. 1(1), 4 (2023)
Article Google Scholar
Xu, X., Hospedales, T., Gong, S.: Transductive zero-shot action recognition by word-vector embedding. Int. J. Comput. Vision 123, 309–333 (2017)
Article MathSciNet Google Scholar
Yan, R., Huang, P., Shu, X., Zhang, J., Pan, Y., Tang, J.: Look less think more: rethinking compositional action recognition. In: Proceedings of the ACM International Conference on Multimedia, pp. 3666–3675 (2022)
Google Scholar
Yan, R., Xie, L., Shu, X., Zhang, L., Tang, J.: Progressive instance-aware feature learning for compositional action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45, 10317–103330 (2023)
Article Google Scholar
Yang, T., Zhu, Y., Xie, Y., Zhang, A., Chen, C., Li, M.: Aim: adapting image models for efficient video action recognition. arXiv preprint arXiv:2302.03024 (2023)
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023–6032 (2019)
Google Scholar
Zadrozny, B.: Learning and evaluating classifiers under sample selection bias. In: Proceedings of the International Conference on Machine Learning, p. 114 (2004)
Google Scholar
Zellers, R., Choi, Y.: Zero-shot activity recognition with verb attribute induction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 946–958 (2017)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision 130(9), 2337–2348 (2022)
Article Google Scholar

Download references

Acknowledgements

This work is supported in part by the National Key Research and Development Program of China (2023YFF1105102, 2023YFF1105105), the National Natural Science Foundation of China (62020106012, 62332008, 62106089, U1836218, 62336004), the 111 Project of Ministry of Education of China (B12018), the Postgraduate Research & Practice Innovation Program of Jiangsu Province (KYCX22_2307), and the UK EPSRC (EP/V002856/1, EP/T022205/1).

Author information

Authors and Affiliations

School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
Rongchang Li, Zhenhua Feng, Tianyang Xu, Linze Li & Xiao-Jun Wu
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Rongchang Li, Zhenhua Feng, Muhammad Awais, Sara Atito & Josef Kittler
School of Computer Science and Electronic Engineering, University of Surrey, Guildford, UK
Zhenhua Feng & Josef Kittler
Surrey Institute of People-centred AI (SI-PAI), University of Surrey, Guildford, UK
Zhenhua Feng, Muhammad Awais & Sara Atito

Authors

Rongchang Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhenhua Feng
View author publications
You can also search for this author in PubMed Google Scholar
Tianyang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Linze Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Jun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Awais
View author publications
You can also search for this author in PubMed Google Scholar
Sara Atito
View author publications
You can also search for this author in PubMed Google Scholar
Josef Kittler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao-Jun Wu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 27126 KB)

Supplementary material 2 (pdf 802 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, R. et al. (2025). C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15096. Springer, Cham. https://doi.org/10.1007/978-3-031-72920-1_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-72920-1_21
Published: 01 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72919-5
Online ISBN: 978-3-031-72920-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition