Recognizing an Action Using Its Name: A Knowledge-Based Approach

Gan, Chuang; Yang, Yi; Zhu, Linchao; Zhao, Deli; Zhuang, Yueting

doi:10.1007/s11263-016-0893-6

Recognizing an Action Using Its Name: A Knowledge-Based Approach

Published: 02 March 2016

Volume 120, pages 61–77, (2016)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Chuang Gan¹,
Yi Yang²,
Linchao Zhu²,
Deli Zhao³ &
…
Yueting Zhuang⁴

2052 Accesses
39 Citations
Explore all metrics

Abstract

Existing action recognition algorithms require a set of positive exemplars to train a classifier for each action. However, the amount of action classes is very large and the users’ queries vary dramatically. It is impractical to pre-define all possible action classes beforehand. To address this issue, we propose to perform action recognition with no positive exemplars, which is often known as the zero-shot learning. Current zero-shot learning paradigms usually train a series of attribute classifiers and then recognize the target actions based on the attribute representation. To ensure the maximum coverage of ad-hoc action classes, the attribute-based approaches require large numbers of reliable and accurate attribute classifiers, which are often unavailable in the real world. In this paper, we propose an approach that merely takes an action name as the input to recognize the action of interest without any pre-trained attribute classifiers and positive exemplars. Given an action name, we first build an analogy pool according to an external ontology, and each action in the analogy pool is related to the target action at different levels. The correlation information inferred from the external ontology may be noisy. We then propose an algorithm, namely adaptive multi-model rank-preserving mapping (AMRM), to train a classifier for action recognition, which is able to evaluate the relatedness of each video in the analogy pool adaptively. As multiple mapping models are employed, our algorithm has better capability to bridge the gap between visual features and the semantic information inferred from the ontology. Extensive experiments demonstrate that our method achieves the promising performance for action recognition only using action names, while no attributes and positive exemplars are available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards a Fair Evaluation of Zero-Shot Action Recognition Using External Data

Object Priors for Classifying and Localizing Unseen Actions

Article Open access 19 April 2021

Pascal Mettes, William Thong & Cees G. M. Snoek

Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation

Notes

Trial 5 in Table 1.
http://www.nist.gov/itl/iad/mig/med11.cfm.

References

Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C., et al. (2013). Label-embedding for attribute-based classification. In CVPR (pp. 819–826).
Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In ICCV (Vol. 2, pp. 1395–1402).
Cai, J., Zha, Z. J., Zhou, W., & Tian, Q. (2012). Attribute-assisted reranking for web image retrieval. In Multimedia (pp. 873–876). ACM.
Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In ECCV (pp. 430–443).
Chen, M. Y., & Hauptmann, A. (2009). Mosift: Recognizing human actions in surveillance videos.
Duan, L., Xu, D., Tsang, I. H., & Luo, J. (2012). Visual event recognition in videos by learning from web data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1667–1680.
Article Google Scholar
Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In CVPR, (pp. 1778–1785).
Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., & Mikolov, T., et al. (2013). Devise: A deep visual-semantic embedding model. In NIPS, (pp. 2121–2129).
Fu, Y., Hospedales, T. M., Xiang, T., Fu, Z., & Gong, S. (2014). Transductive multi-view embedding for zero-shot recognition and annotation. In ECCV (pp. 584–599).
Fu, Y., Hospedales, T. M., Xiang, T., & Gong, S. (2015). Transductive multi-view zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(11), 2332–2345.
Article Google Scholar
Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI (pp. 1606–1611).
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV (pp. 2712–2719).
Hauptmann, A., Yan, R., Lin, W. H., Christel, M., & Wactlar, H. (2007). Can high-level concepts fill the semantic gap in video retrieval? a case study with broadcast news. IEEE Transactions on Multimedia, 9(5), 958–966.
Article Google Scholar
Jiang, Y. G., Bhattacharya, S., Chang, S. F., & Shah, M. (2013). High-level event recognition in unconstrained videos. International Journal of Multimedia Information Retrieval, 2(2), 73–101.
Article Google Scholar
Kankuekul, P., Kawewong, A., Tangruamsub, S., & Hasegawa, O. (2012). Online incremental attribute-based zero-shot learning. In CVPR (pp. 3657–3664).
Lampert, C.H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In CVPR (pp. 951–958).
Lan, Z. Z., Bao, L., Yu, S. I., Liu, W., & Hauptmann, A. G. (2012). Double fusion for multimedia event detection.
Lan, Z.Z., Jiang, L., Yu, S.I., Rawat, S., Cai, Y., Gao, C., Xu, S., Shen, H., Li, X., & Wang, Y., et al. (2013). Cmu-informedia at trecvid 2013 multimedia event detection. In TRECVID 2013 Workshop (Vol. 1, p. 5).
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.
Article Google Scholar
Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR (pp. 1–8).
Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In CVPR (pp. 3337–3344).
Liu, J., Yu, Q., Javed, O., Ali, S., Tamrakar, A., Divakaran, A., Cheng, H., & Sawhney, H.S. (2013). Video event recognition using concept attributes. In WACV (pp. 339–346).
Ma, Z., Yang, Y., Nie, F., Sebe, N., Yan, S., & Hauptmann, A. G. (2014). Harnessing lab knowledge for real-world action recognition. International Journal of Computer Vision, 109(1–2), 60–73.
Article MATH Google Scholar
Ma, Z., Yang, Y., Sebe, N., Zheng, K., & Hauptmann, A. G. (2013). Multimedia event detection using a classifier-specific intermediate representation. IEEE Transactions on Multimedia, 15(7), 1628–1637.
Article Google Scholar
Ma, Z., Yang, Y., Xu, Z., Sebe, N., & Hauptmann, A.G. (2013). We are not equally negative: Fine-grained labeling for multimedia event detection. In ACM Multimedia (pp. 293–302).
Oneata, D., Verbeek, J., & Schmid, C., et al. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV.
Reddy, K. K., & Shah, M. (2013). Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5), 971–981.
Article Google Scholar
Rohrbach, M., Ebert, S., & Schiele, B. (2013). Transfer learning in a transductive setting. In NIPS (pp. 46–54).
Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., & Schiele, B. (2012). Script data for attribute-based recognition of composite activities. In ECCV (pp. 144–157).
Rohrbach, M., Stark, M., & Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In CVPR (pp. 1641–1648).
Rohrbach, M., Stark, M., Szarvas, G., Gurevych, I., & Schiele, B. (2010). What helps where-and why? Semantic relatedness for knowledge transfer. In CVPR (pp. 910–917).
Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR (pp. 1234–1241).
Sánchez, J., Perronnin, F., Mensink, T., & Verbeek, J. (2013). Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3), 222–245.
Article MathSciNet MATH Google Scholar
Socher, R., Ganjoo, M., Manning, C.D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In NIPS (pp. 935–943).
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
Sun, C., Gan, C., & Nevatia, R. (2015). Automatic concept discovery from parallel text and visual corpora. In ICCV.
Tang, K., Yao, B., Fei-Fei, L., & Koller, D. (2013). Combining the right features for complex event recognition. In ICCV (pp. 2696–2703).
Torresani, L., Szummer, M., & Fitzgibbon, A. (2010). Efficient object category recognition using classemes. In ECCV (pp. 776–789).
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2014). C3D: generic features for video analysis. arXiv preprint arXiv:1412.0767.
Vovk, V. (2013). Kernel ridge regression. In Empirical inference (pp. 105–116).
Wang, H., Klaser, A., Schmid, C., & Liu, C. L. (2011). Action recognition by dense trajectories. In CVPR (pp. 3169–3176).
Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.
Article MathSciNet Google Scholar
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV (pp. 3551–3558).
Wang, S., Yang, Y., Ma, Z., Li, X., Pang, C., & Hauptmann, A. G. (2012). Action recognition by exploring data distribution and feature correlation. In CVPR (pp. 1370–1377).
Yang, Y., Ma, Z., Nie, F., Chang, X., & Hauptmann, A. G. (2014). Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision, 113(2), 113–127.
Article MathSciNet Google Scholar
Yu, F. X., Cao, L., Feris, R. S., Smith, J. R., & Chang, S. F. (2013). Designing category-level attributes for discriminative visual recognition. In CVPR (pp. 771–778).

Download references

Acknowledgments

This work was partially supported by the 973 Program (No. 2012CB316400), partially supported by the National Natural Science Foundation of China Grant 61033001, 61361136003, and partially supported by the ARC DECRA (DE130101311), the ACR DP (DP150103008). This work was done when Chuang Gan was a visiting student at Zhejiang University.

Author information

Authors and Affiliations

IIIS, Tsinghua University, Beijing, China
Chuang Gan
QCIS, University of Technology, Sydney, Australia
Yi Yang & Linchao Zhu
HTC Research, Beijing, China
Deli Zhao
Zhejiang University, Hangzhou, China
Yueting Zhuang

Authors

Chuang Gan
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Linchao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Deli Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yueting Zhuang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Yang.

Additional information

Communicated by Deva Ramanan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gan, C., Yang, Y., Zhu, L. et al. Recognizing an Action Using Its Name: A Knowledge-Based Approach. Int J Comput Vis 120, 61–77 (2016). https://doi.org/10.1007/s11263-016-0893-6

Download citation

Received: 07 August 2014
Accepted: 15 February 2016
Published: 02 March 2016
Issue Date: October 2016
DOI: https://doi.org/10.1007/s11263-016-0893-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recognizing an Action Using Its Name: A Knowledge-Based Approach

Abstract

Access this article

Similar content being viewed by others

Towards a Fair Evaluation of Zero-Shot Action Recognition Using External Data

Object Priors for Classifying and Localizing Unseen Actions

Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Recognizing an Action Using Its Name: A Knowledge-Based Approach

Abstract

Access this article

Similar content being viewed by others

Towards a Fair Evaluation of Zero-Shot Action Recognition Using External Data

Object Priors for Classifying and Localizing Unseen Actions

Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation