Learning Using Privileged Information for Zero-Shot Action Recognition

Gao, Zhiyi; Hou, Yonghong; Li, Wanqing; Guo, Zihui; Yu, Bin

doi:10.1007/978-3-031-26316-3_21

Zhiyi Gao¹²,
Yonghong Hou¹²,
Wanqing Li¹³,
Zihui Guo¹² &
…
Bin Yu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13844))

Included in the following conference series:

Asian Conference on Computer Vision

498 Accesses

Abstract

Zero-Shot Action Recognition (ZSAR) aims to recognize video actions that have never been seen during training. Most existing methods assume a shared semantic space between seen and unseen actions and intend to directly learn a mapping from a visual space to the semantic space. This approach has been challenged by the semantic gap between the visual space and semantic space. This paper presents a novel method that uses object semantics as privileged information to narrow the semantic gap and, hence, effectively, assist the learning. In particular, a simple hallucination network is proposed to implicitly extract object semantics during testing without explicitly extracting objects and a cross-attention module is developed to augment visual feature with the object semantics. Experiments on the Olympic Sports, HMDB51 and UCF101 datasets have shown that the proposed method outperforms the state-of-the-art methods by a large margin.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Hybrid attentive prototypical network for few-shot action recognition

Article Open access 19 August 2024

DVAMN: Dual Visual Attention Matching Network for Zero-Shot Action Recognition

Improving zero-shot action recognition using human instruction with text description

Article 17 July 2023

References

Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4768–4777 (2017)
Google Scholar
Wang, Y., Long, M., Wang, J., Yu, P.S.: Spatiotemporal pyramid network for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 529–1538 (2017) 1
Google Scholar
Zhu, Y., et al.: A comprehensive study of deep video action recognition. arXiv preprint arXiv:2012.06567 (2020)
Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3337–3344 (2011)
Google Scholar
Fu, Y., Hospedales, T.M., Xiang, T., Fu, Z., Gong, S.: Transductive multi-view embedding for zero-shot recognition and annotation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 584–599. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_38
Chapter Google Scholar
Wang, Q., Chen, K.: Zero-shot visual recognition via bidirectional latent embedding. Int. J. Comput. Vis. 124, 356–383 (2017)
Article MathSciNet MATH Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Brattoli, B., Tighe, J., Zhdanov, F., Perona, P., Chalupka, K.: Rethinking zero-shot video classification: End-to-end training for realistic applications. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4613–4623 (2020)
Google Scholar
Wang, Q., Chen, K.: Alternative semantic representations for zero-shot human action recognition. In: Ceci, M., Hollmén, J., Todorovski, L., Vens, C., Džeroski, S. (eds.) ECML PKDD 2017. LNCS (LNAI), vol. 10534, pp. 87–102. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71249-9_6
Chapter Google Scholar
Jain, M., Van Gemert, J.C., Mensink, T., Snoek, C.G.: Objects2action: classifying and localizing actions without any video example. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4588–4596 (2015)
Google Scholar
Su, Y., Xing, M., An, S., Peng, W., Feng, Z.: Vdarn: video disentangling attentive relation network for few-shot and zero-shot action recognition. Ad Hoc Netw. 113, 102380 (2021)
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Vapnik, V., Vashist, A.: A new learning paradigm: learning using privileged information. Neural Netw. 22, 544–557 (2009)
Article MATH Google Scholar
Niebles, J.C., Chen, C.W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Proceedings of the European Conference on Computer Vision, Springer (2010) 392–405
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Xu, X., Hospedales, T., Gong, S.: Semantic embedding space for zero-shot action recognition. In: 2015 IEEE International Conference on Image Processing, pp. 63–67. IEEE (2015)
Google Scholar
Bishay, M., Zoumpourlis, G., Patras, I.: Tarn: Temporal attentive relation network for few-shot and zero-shot action recognition. arXiv preprint arXiv:1907.09021 (2019)
Zhou, L., Li, W., Ogunbona, P., Zhang, Z.: Semantic action recognition by learning a pose lexicon. Pattern Recogn. 72, 548–562 (2017)
Article Google Scholar
Zhou, L., Li, W., Ogunbona, P., Zhang, Z.: Jointly learning visual poses and pose lexicon for semantic action recognition. IEEE Trans. Circuits Syst. Video Technol. 30, 457–467 (2019)
Article Google Scholar
Chen, S., Huang, D.: Elaborative rehearsal for zero-shot action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 13638–13647 (2021)
Google Scholar
Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–958 (2009)
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
Google Scholar
Wang, H., Oneata, D., Verbeek, J., Schmid, C.: A robust and efficient video representation for action recognition. Int. J. Comput. Vis. 119, 219–238 (2016)
Article MathSciNet Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Xu, X., Hospedales, T., Gong, S.: Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision 123, 309–333 (2017)
Article MathSciNet MATH Google Scholar
Niu, L., Li, W., Xu, D.: Visual recognition by learning from web data: a weakly supervised domain generalization approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2774–2783 (2015)
Google Scholar
Motiian, S., Piccirilli, M., Adjeroh, D.A., Doretto, G.: Information bottleneck learning using privileged information for visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1496–1505 (2016)
Google Scholar
Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: Mars: motion-augmented RGB stream for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7882–7891 (2019)
Google Scholar
Garcia, N.C., Morerio, P., Murino, V.: Learning with privileged information via adversarial discriminative modality distillation. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2581–2593 (2020)
Article Google Scholar
Mishra, A., Verma, V.K., Reddy, M.S.K., Arulkumar, S., Rai, P., Mittal, A.: A generative approach to zero-shot and few-shot action recognition. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 372–380 (2018)
Google Scholar
Mishra, A., Pandey, A., Murthy, H.A.: Zero-shot learning for action recognition using synthesized features. Neurocomputing 390, 117–130 (2020)
Article Google Scholar
Kolesnikov, A., et al.: Big transfer (BiT): general visual representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 491–507. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_29
Chapter Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Xu, X., Hospedales, T.M., Gong, S.: Multi-task zero-shot action recognition with prioritised data augmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 343–359. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_22
Chapter Google Scholar
Zhu, Y., Long, Y., Guan, Y., Newsam, S., Shao, L.: Towards universal representation for unseen action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9436–9445 (2018)
Google Scholar
Mandal, D., et al.: Out-of-distribution detection for generalized zero-shot action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9985–9993 (2019)
Google Scholar
Gao, J., Zhang, T., Xu, C.: I know the relationships: zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. Proc. AAAI Conf. Artif. Intell. 33, 8303–8311 (2019)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical and Information Engineering, Tianjin University, Tianjin, China
Zhiyi Gao, Yonghong Hou, Zihui Guo & Bin Yu
Advanced Multimedia Research Laboratory, University of Wollongong, Wollongong, Australia
Wanqing Li

Authors

Zhiyi Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yonghong Hou
View author publications
You can also search for this author in PubMed Google Scholar
Wanqing Li
View author publications
You can also search for this author in PubMed Google Scholar
Zihui Guo
View author publications
You can also search for this author in PubMed Google Scholar
Bin Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yonghong Hou .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, Z., Hou, Y., Li, W., Guo, Z., Yu, B. (2023). Learning Using Privileged Information for Zero-Shot Action Recognition. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13844. Springer, Cham. https://doi.org/10.1007/978-3-031-26316-3_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-26316-3_21
Published: 02 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26315-6
Online ISBN: 978-3-031-26316-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Using Privileged Information for Zero-Shot Action Recognition