Abstract
An action contains rich multi-modal information, and current methods generally map the action class to a digital number as supervised information to train models. However, numerical labels cannot describe the semantic content contained in the action. This paper proposes PromptLearner-CLIP for action recognition, where the text pathway uses PromptLearner to automatically learn the text content of prompt as the input and calculates the semantic features of actions, and the vision pathway takes video data as the input to learn the visual features of actions. To strengthen the interaction between features of different modalities, this paper proposes a multi-modal information interaction module that utilizes Graph Neural Network(GNN) to process both the semantic features of text content and the visual features of a video. In addition, the single-modal video classification problem is transformed into a multi-modal video-text matching problem. Multi-modal contrastive learning is used to disclose the feature distance of the same but different modalities samples. The experimental results showed that PromptLearner-CLIP could utilize the textual semantic information to significantly improve the performance of various single-modal backbone networks on action recognition and achieved top-tier results on Kinetics400, UCF101, and HMDB51 datasets. Code is available at https://github.com/ZhenxingZheng/PromptLearner.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086. IEEE, Salt Lake City, UT, USA (2018)
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: ICCV, pp. 6836–6846. IEEE, Montreal, Canada (2021)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, pp. 813–824. ACM, Virtual (2021)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR, pp. 4724–4733. IEEE, Honolulu, HI, USA (2017)
Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: PoTion: pose motion representation for action recognition. In: CVPR, pp. 7024–7033. IEEE, Salt Lake City (2018)
Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: MARS: motion-augmented RGB stream for action recognition. In: CVPR, pp. 7874–7883. IEEE, Long Beach, CA, USA (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR, pp. 1–21. Virtual (2021)
Fan, H., et al.: Multiscale vision transformers. In: ICCV, pp. 6824–6835. IEEE, Montreal, Canada (2021)
Fan, L., Huang, W., Gan, C., Ermon, S., Gong, B., Huang, J.: End-to-end learning of motion representation for video understanding. In: CVPR, pp. 6016–6025. IEEE, Salt Lake City, UT, USA (2018)
Feichtenhofer, C.: X3D: expanding architectures for efficient video recognition. In: CVPR, pp. 200–210. IEEE, Seattle, WA, USA (2020)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: CVPR, pp. 6202–6211. IEEE, Seoul, Korea (2019)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR, pp. 9726–9735. IEEE, Seattle, WA, USA (2020)
Huang, G., Bors, A.G.: Learning spatio-temporal representations with temporal squeeze pooling. In: ICASSP, pp. 2103–2107. IEEE, Barcelona, Spain (2020)
Huang, G., Bors, A.G.: Busy-quiet video disentangling for video classification. In: WACV, pp. 1341–1350. IEEE, Waikoloa, HI, USA (2022)
Kar, A., Rai, N., Sikka, K., Sharma, G.: AdaScan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: CVPR, pp. 5699–5708. IEEE, Honolulu, HI, USA (2017)
Kay, W., et al.: The kinetics human action video dataset (2017)
Khosla, P., et al.: Supervised contrastive learning. In: NeurIPS, pp. 18661–18673. MIT Press, Virtual (2021)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV, pp. 2556–2563. IEEE, Barcelona, Spain (2011)
Lei, J., et al.: Less is more: clipBERT for video-and-language learning via sparse sampling. In: CVPR, pp. 7331–7341. IEEE, Virtual (2021)
Li, J., Wei, P., Zhang, Y., Zheng, N.: A slow-i-fast-p architecture for compressed video action recognition. In: ACM MM, pp. 2039–2047. ACM, Seattle, WA, USA (2020)
Li, J., Liu, X., Zhang, W., Zhang, M., Song, J., Sebe, N.: Spatiotemporal attention networks for action recognition and detection. IEEE Trans. Multimedia 22(11), 2990–3001 (2020)
Li, X., Wang, Y., Zhou, Z., Qiao, Y.: SmallBigNet: integrating core and contextual views for video classification. In: CVPR, pp. 1092–1101. IEEE, Seattle, WA, USA (2020)
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: temporal excitation and aggregation for action recognition. In: CVPR, pp. 906–915. IEEE, Seattle, WA, USA (2020)
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV, pp. 7082–7092. IEEE, Seoul, Korea (2019)
Liu, Z., et al.: TEINet: towards an efficient architecture for video recognition. In: AAAI, pp. 11669–11676. AAAI, New York, USA (2020)
Liu, Z., Wang, L., Wu, W., Qian, C., Lu, T.: TAM: temporal adaptive module for video recognition. In: ICCV, pp. 13708–13718. IEEE, Montreal, Canada (2021)
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: ICCV, pp. 3163–3172. IEEE, Montreal, Canada (2021)
Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W.: VideoMoCo: contrastive video representation learning with temporally adversarial examples. In: CVPR, pp. 11200–11209. IEEE, Virtual (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. ACM, Virtual (2021)
Ranasinghe, K., Naseer, M., Khan, S., Khan, F.S., Ryoo, M.: Self-supervised video transformer. In: CVPR, pp. 2874–2884. IEEE, New Orleans, Louisiana, USA (2022)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012)
Thatipelli, A., Narayan, S., Khan, S., Anwer, R.M., Khan, F.S., Ghanem, B.: Spatio-temporal relation modeling for few-shot action recognition. In: CVPR, pp. 19958–19967. IEEE, New Orleans, Louisiana, USA (2022)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp. 4489–4497. IEEE, Santiago, Chile (2015)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR, pp. 6450–6459. IEEE, Salt Lake City, UT, USA (2018)
Truong, T.D., et al.: DirecFormer: a directed attention in transformer approach to robust action recognition. In: CVPR, pp. 20030–20040. IEEE, New Orleans, Louisiana, USA (2022)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008. MIT Press, Long Beach, CA, USA (2017)
Wang, L., Li, W., Li, W., Van Gool, L.: Appearance-and-relation networks for video classification. In: CVPR, pp. 1430–1439. IEEE, Salt Lake City, UT, USA (2018)
Wang, L., Tong, Z., Ji, B., Wu, G.: TDN: temporal difference networks for efficient action recognition. In: CVPR, pp. 1895–1904. IEEE, Virtual (2021)
Wang, L., et al.: Temporal segment networks for action recognition in videos. IEEE Trans. Pattern Anal. Mach. Intell. 41(11), 2740–2755 (2019)
Wang, M., Xing, J., Liu, Y.: ActionCLIP: a new paradigm for video action recognition (2021)
Wang, R., et al.: BEVT: BERT pretraining of video transformers. In: CVPR, pp. 14733–14743. IEEE, New Orleans, Louisiana, USA (2022)
Wang, X., Gupta, A.: Videos as space-time region graphs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 413–431. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_25
Wu, Z., Jiang, Y.G., Wang, X., Ye, H., Xue, X.: Multi-stream multi-class fusion of deep networks for video classification. In: ACM MM, pp. 791–800. ACM, Amsterdam, Netherlands (2016)
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19
Xu, B., Ye, H., Zheng, Y., Wang, H., Luwang, T., Jiang, Y.: Dense dilated network for video action recognition. IEEE Trans. Image Process. 28(10), 4941–4953 (2019)
Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: CVPR, pp. 588–597. IEEE, Seattle, WA, USA (2020)
Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: ICCV, pp. 4643–4652. IEEE, Seoul, Korea (2019)
Zhang, H., Hao, Y., Ngo, C.W.: Token shift transformer for video classification. In: ACM MM, pp. 917–925. ACM, Chengdu, China (2021)
Zhang, Y., et al.: VidTr: video transformer without convolutions. In: ICCV, pp. 13577–13587. IEEE, Montreal, Canada (2021)
Zhao, Y., Wang, G., Luo, C., Zeng, W., Zha, Z.J.: Self-supervised visual representations learning by contrastive mask prediction. In: ICCV, pp. 10160–10169. IEEE, Virtual (2021)
Zheng, Y., Liu, Z., Lu, T., Wang, L.: Dynamic sampling networks for efficient action recognition in videos. IEEE Trans. Image Process. 29, 7970–7983 (2020)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision 130, 2337–2348 (2022)
Zhu, L., Yang, Y.: ActBERT: learning global-local video-text representations. In: CVPR, pp. 8746–8755. IEEE, Seattle, WA, USA (2020)
Zong, M., Wang, R., Chen, X., Chen, Z., Gong, Y.: Motion saliency based multi-stream multiplier ResNets for action recognition. Image Vis. Comput. 107, 104108 (2021)
Acknowledgements
This work was supported in part by the National Key Research and Development Program of China under Grant 2021YFE0110500 and in part by the National Natural Science Foundation of China under Grant 62006015 and Grant 62072028.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zheng, Z., An, G., Cao, S., Yang, Z., Ruan, Q. (2023). PromptLearner-CLIP: Contrastive Multi-Modal Action Representation Learning with Context Optimization. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13844. Springer, Cham. https://doi.org/10.1007/978-3-031-26316-3_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-26316-3_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26315-6
Online ISBN: 978-3-031-26316-3
eBook Packages: Computer ScienceComputer Science (R0)