PromptLearner-CLIP: Contrastive Multi-Modal Action Representation Learning with Context Optimization

Zheng, Zhenxing; An, Gaoyun; Cao, Shan; Yang, Zhaoqilin; Ruan, Qiuqi

doi:10.1007/978-3-031-26316-3_33

Zhenxing Zheng^12,13,14,
Gaoyun An^13,14,
Shan Cao^13,14,
Zhaoqilin Yang^13,14 &
…
Qiuqi Ruan^13,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13844))

Included in the following conference series:

Asian Conference on Computer Vision

350 Accesses

Abstract

An action contains rich multi-modal information, and current methods generally map the action class to a digital number as supervised information to train models. However, numerical labels cannot describe the semantic content contained in the action. This paper proposes PromptLearner-CLIP for action recognition, where the text pathway uses PromptLearner to automatically learn the text content of prompt as the input and calculates the semantic features of actions, and the vision pathway takes video data as the input to learn the visual features of actions. To strengthen the interaction between features of different modalities, this paper proposes a multi-modal information interaction module that utilizes Graph Neural Network(GNN) to process both the semantic features of text content and the visual features of a video. In addition, the single-modal video classification problem is transformed into a multi-modal video-text matching problem. Multi-modal contrastive learning is used to disclose the feature distance of the same but different modalities samples. The experimental results showed that PromptLearner-CLIP could utilize the textual semantic information to significantly improve the performance of various single-modal backbone networks on action recognition and achieved top-tier results on Kinetics400, UCF101, and HMDB51 datasets. Code is available at https://github.com/ZhenxingZheng/PromptLearner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086. IEEE, Salt Lake City, UT, USA (2018)
Google Scholar
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: ICCV, pp. 6836–6846. IEEE, Montreal, Canada (2021)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, pp. 813–824. ACM, Virtual (2021)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR, pp. 4724–4733. IEEE, Honolulu, HI, USA (2017)
Google Scholar
Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: PoTion: pose motion representation for action recognition. In: CVPR, pp. 7024–7033. IEEE, Salt Lake City (2018)
Google Scholar
Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: MARS: motion-augmented RGB stream for action recognition. In: CVPR, pp. 7874–7883. IEEE, Long Beach, CA, USA (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR, pp. 1–21. Virtual (2021)
Google Scholar
Fan, H., et al.: Multiscale vision transformers. In: ICCV, pp. 6824–6835. IEEE, Montreal, Canada (2021)
Google Scholar
Fan, L., Huang, W., Gan, C., Ermon, S., Gong, B., Huang, J.: End-to-end learning of motion representation for video understanding. In: CVPR, pp. 6016–6025. IEEE, Salt Lake City, UT, USA (2018)
Google Scholar
Feichtenhofer, C.: X3D: expanding architectures for efficient video recognition. In: CVPR, pp. 200–210. IEEE, Seattle, WA, USA (2020)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: CVPR, pp. 6202–6211. IEEE, Seoul, Korea (2019)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR, pp. 9726–9735. IEEE, Seattle, WA, USA (2020)
Google Scholar
Huang, G., Bors, A.G.: Learning spatio-temporal representations with temporal squeeze pooling. In: ICASSP, pp. 2103–2107. IEEE, Barcelona, Spain (2020)
Google Scholar
Huang, G., Bors, A.G.: Busy-quiet video disentangling for video classification. In: WACV, pp. 1341–1350. IEEE, Waikoloa, HI, USA (2022)
Google Scholar
Kar, A., Rai, N., Sikka, K., Sharma, G.: AdaScan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: CVPR, pp. 5699–5708. IEEE, Honolulu, HI, USA (2017)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset (2017)
Google Scholar
Khosla, P., et al.: Supervised contrastive learning. In: NeurIPS, pp. 18661–18673. MIT Press, Virtual (2021)
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV, pp. 2556–2563. IEEE, Barcelona, Spain (2011)
Google Scholar
Lei, J., et al.: Less is more: clipBERT for video-and-language learning via sparse sampling. In: CVPR, pp. 7331–7341. IEEE, Virtual (2021)
Google Scholar
Li, J., Wei, P., Zhang, Y., Zheng, N.: A slow-i-fast-p architecture for compressed video action recognition. In: ACM MM, pp. 2039–2047. ACM, Seattle, WA, USA (2020)
Google Scholar
Li, J., Liu, X., Zhang, W., Zhang, M., Song, J., Sebe, N.: Spatiotemporal attention networks for action recognition and detection. IEEE Trans. Multimedia 22(11), 2990–3001 (2020)
Article Google Scholar
Li, X., Wang, Y., Zhou, Z., Qiao, Y.: SmallBigNet: integrating core and contextual views for video classification. In: CVPR, pp. 1092–1101. IEEE, Seattle, WA, USA (2020)
Google Scholar
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: temporal excitation and aggregation for action recognition. In: CVPR, pp. 906–915. IEEE, Seattle, WA, USA (2020)
Google Scholar
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV, pp. 7082–7092. IEEE, Seoul, Korea (2019)
Google Scholar
Liu, Z., et al.: TEINet: towards an efficient architecture for video recognition. In: AAAI, pp. 11669–11676. AAAI, New York, USA (2020)
Google Scholar
Liu, Z., Wang, L., Wu, W., Qian, C., Lu, T.: TAM: temporal adaptive module for video recognition. In: ICCV, pp. 13708–13718. IEEE, Montreal, Canada (2021)
Google Scholar
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: ICCV, pp. 3163–3172. IEEE, Montreal, Canada (2021)
Google Scholar
Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W.: VideoMoCo: contrastive video representation learning with temporally adversarial examples. In: CVPR, pp. 11200–11209. IEEE, Virtual (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. ACM, Virtual (2021)
Google Scholar
Ranasinghe, K., Naseer, M., Khan, S., Khan, F.S., Ryoo, M.: Self-supervised video transformer. In: CVPR, pp. 2874–2884. IEEE, New Orleans, Louisiana, USA (2022)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012)
Google Scholar
Thatipelli, A., Narayan, S., Khan, S., Anwer, R.M., Khan, F.S., Ghanem, B.: Spatio-temporal relation modeling for few-shot action recognition. In: CVPR, pp. 19958–19967. IEEE, New Orleans, Louisiana, USA (2022)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp. 4489–4497. IEEE, Santiago, Chile (2015)
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR, pp. 6450–6459. IEEE, Salt Lake City, UT, USA (2018)
Google Scholar
Truong, T.D., et al.: DirecFormer: a directed attention in transformer approach to robust action recognition. In: CVPR, pp. 20030–20040. IEEE, New Orleans, Louisiana, USA (2022)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008. MIT Press, Long Beach, CA, USA (2017)
Google Scholar
Wang, L., Li, W., Li, W., Van Gool, L.: Appearance-and-relation networks for video classification. In: CVPR, pp. 1430–1439. IEEE, Salt Lake City, UT, USA (2018)
Google Scholar
Wang, L., Tong, Z., Ji, B., Wu, G.: TDN: temporal difference networks for efficient action recognition. In: CVPR, pp. 1895–1904. IEEE, Virtual (2021)
Google Scholar
Wang, L., et al.: Temporal segment networks for action recognition in videos. IEEE Trans. Pattern Anal. Mach. Intell. 41(11), 2740–2755 (2019)
Article Google Scholar
Wang, M., Xing, J., Liu, Y.: ActionCLIP: a new paradigm for video action recognition (2021)
Google Scholar
Wang, R., et al.: BEVT: BERT pretraining of video transformers. In: CVPR, pp. 14733–14743. IEEE, New Orleans, Louisiana, USA (2022)
Google Scholar
Wang, X., Gupta, A.: Videos as space-time region graphs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 413–431. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_25
Chapter Google Scholar
Wu, Z., Jiang, Y.G., Wang, X., Ye, H., Xue, X.: Multi-stream multi-class fusion of deep networks for video classification. In: ACM MM, pp. 791–800. ACM, Amsterdam, Netherlands (2016)
Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19
Chapter Google Scholar
Xu, B., Ye, H., Zheng, Y., Wang, H., Luwang, T., Jiang, Y.: Dense dilated network for video action recognition. IEEE Trans. Image Process. 28(10), 4941–4953 (2019)
Article MathSciNet MATH Google Scholar
Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: CVPR, pp. 588–597. IEEE, Seattle, WA, USA (2020)
Google Scholar
Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: ICCV, pp. 4643–4652. IEEE, Seoul, Korea (2019)
Google Scholar
Zhang, H., Hao, Y., Ngo, C.W.: Token shift transformer for video classification. In: ACM MM, pp. 917–925. ACM, Chengdu, China (2021)
Google Scholar
Zhang, Y., et al.: VidTr: video transformer without convolutions. In: ICCV, pp. 13577–13587. IEEE, Montreal, Canada (2021)
Google Scholar
Zhao, Y., Wang, G., Luo, C., Zeng, W., Zha, Z.J.: Self-supervised visual representations learning by contrastive mask prediction. In: ICCV, pp. 10160–10169. IEEE, Virtual (2021)
Google Scholar
Zheng, Y., Liu, Z., Lu, T., Wang, L.: Dynamic sampling networks for efficient action recognition in videos. IEEE Trans. Image Process. 29, 7970–7983 (2020)
Article MATH Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision 130, 2337–2348 (2022)
Article Google Scholar
Zhu, L., Yang, Y.: ActBERT: learning global-local video-text representations. In: CVPR, pp. 8746–8755. IEEE, Seattle, WA, USA (2020)
Google Scholar
Zong, M., Wang, R., Chen, X., Chen, Z., Gong, Y.: Motion saliency based multi-stream multiplier ResNets for action recognition. Image Vis. Comput. 107, 104108 (2021)
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China under Grant 2021YFE0110500 and in part by the National Natural Science Foundation of China under Grant 62006015 and Grant 62072028.

Author information

Authors and Affiliations

School of Communications and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an, 710121, China
Zhenxing Zheng
Institute of Information Science, Beijing Jiaotong University, Beijing, 100044, China
Zhenxing Zheng, Gaoyun An, Shan Cao, Zhaoqilin Yang & Qiuqi Ruan
Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, 100044, China
Zhenxing Zheng, Gaoyun An, Shan Cao, Zhaoqilin Yang & Qiuqi Ruan

Authors

Zhenxing Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Gaoyun An
View author publications
You can also search for this author in PubMed Google Scholar
Shan Cao
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoqilin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qiuqi Ruan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gaoyun An .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, Z., An, G., Cao, S., Yang, Z., Ruan, Q. (2023). PromptLearner-CLIP: Contrastive Multi-Modal Action Representation Learning with Context Optimization. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13844. Springer, Cham. https://doi.org/10.1007/978-3-031-26316-3_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-26316-3_33
Published: 02 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26315-6
Online ISBN: 978-3-031-26316-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PromptLearner-CLIP: Contrastive Multi-Modal Action Representation Learning with Context Optimization