Skip to main content
Log in

Recognizing an Action Using Its Name: A Knowledge-Based Approach

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Existing action recognition algorithms require a set of positive exemplars to train a classifier for each action. However, the amount of action classes is very large and the users’ queries vary dramatically. It is impractical to pre-define all possible action classes beforehand. To address this issue, we propose to perform action recognition with no positive exemplars, which is often known as the zero-shot learning. Current zero-shot learning paradigms usually train a series of attribute classifiers and then recognize the target actions based on the attribute representation. To ensure the maximum coverage of ad-hoc action classes, the attribute-based approaches require large numbers of reliable and accurate attribute classifiers, which are often unavailable in the real world. In this paper, we propose an approach that merely takes an action name as the input to recognize the action of interest without any pre-trained attribute classifiers and positive exemplars. Given an action name, we first build an analogy pool according to an external ontology, and each action in the analogy pool is related to the target action at different levels. The correlation information inferred from the external ontology may be noisy. We then propose an algorithm, namely adaptive multi-model rank-preserving mapping (AMRM), to train a classifier for action recognition, which is able to evaluate the relatedness of each video in the analogy pool adaptively. As multiple mapping models are employed, our algorithm has better capability to bridge the gap between visual features and the semantic information inferred from the ontology. Extensive experiments demonstrate that our method achieves the promising performance for action recognition only using action names, while no attributes and positive exemplars are available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Trial 5 in Table  1.

  2. http://www.nist.gov/itl/iad/mig/med11.cfm.

References

  • Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C., et al. (2013). Label-embedding for attribute-based classification. In CVPR (pp. 819–826).

  • Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In ICCV (Vol. 2, pp. 1395–1402).

  • Cai, J., Zha, Z. J., Zhou, W., & Tian, Q. (2012). Attribute-assisted reranking for web image retrieval. In Multimedia (pp. 873–876). ACM.

  • Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In ECCV (pp. 430–443).

  • Chen, M. Y., & Hauptmann, A. (2009). Mosift: Recognizing human actions in surveillance videos.

  • Duan, L., Xu, D., Tsang, I. H., & Luo, J. (2012). Visual event recognition in videos by learning from web data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1667–1680.

    Article  Google Scholar 

  • Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In CVPR, (pp. 1778–1785).

  • Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., & Mikolov, T., et al. (2013). Devise: A deep visual-semantic embedding model. In NIPS, (pp. 2121–2129).

  • Fu, Y., Hospedales, T. M., Xiang, T., Fu, Z., & Gong, S. (2014). Transductive multi-view embedding for zero-shot recognition and annotation. In ECCV (pp. 584–599).

  • Fu, Y., Hospedales, T. M., Xiang, T., & Gong, S. (2015). Transductive multi-view zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(11), 2332–2345.

    Article  Google Scholar 

  • Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI (pp. 1606–1611).

  • Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV (pp. 2712–2719).

  • Hauptmann, A., Yan, R., Lin, W. H., Christel, M., & Wactlar, H. (2007). Can high-level concepts fill the semantic gap in video retrieval? a case study with broadcast news. IEEE Transactions on Multimedia, 9(5), 958–966.

    Article  Google Scholar 

  • Jiang, Y. G., Bhattacharya, S., Chang, S. F., & Shah, M. (2013). High-level event recognition in unconstrained videos. International Journal of Multimedia Information Retrieval, 2(2), 73–101.

    Article  Google Scholar 

  • Kankuekul, P., Kawewong, A., Tangruamsub, S., & Hasegawa, O. (2012). Online incremental attribute-based zero-shot learning. In CVPR (pp. 3657–3664).

  • Lampert, C.H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In CVPR (pp. 951–958).

  • Lan, Z. Z., Bao, L., Yu, S. I., Liu, W., & Hauptmann, A. G. (2012). Double fusion for multimedia event detection.

  • Lan, Z.Z., Jiang, L., Yu, S.I., Rawat, S., Cai, Y., Gao, C., Xu, S., Shen, H., Li, X., & Wang, Y., et al. (2013). Cmu-informedia at trecvid 2013 multimedia event detection. In TRECVID 2013 Workshop (Vol. 1, p. 5).

  • Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.

    Article  Google Scholar 

  • Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR (pp. 1–8).

  • Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In CVPR (pp. 3337–3344).

  • Liu, J., Yu, Q., Javed, O., Ali, S., Tamrakar, A., Divakaran, A., Cheng, H., & Sawhney, H.S. (2013). Video event recognition using concept attributes. In WACV (pp. 339–346).

  • Ma, Z., Yang, Y., Nie, F., Sebe, N., Yan, S., & Hauptmann, A. G. (2014). Harnessing lab knowledge for real-world action recognition. International Journal of Computer Vision, 109(1–2), 60–73.

    Article  MATH  Google Scholar 

  • Ma, Z., Yang, Y., Sebe, N., Zheng, K., & Hauptmann, A. G. (2013). Multimedia event detection using a classifier-specific intermediate representation. IEEE Transactions on Multimedia, 15(7), 1628–1637.

    Article  Google Scholar 

  • Ma, Z., Yang, Y., Xu, Z., Sebe, N., & Hauptmann, A.G. (2013). We are not equally negative: Fine-grained labeling for multimedia event detection. In ACM Multimedia (pp. 293–302).

  • Oneata, D., Verbeek, J., & Schmid, C., et al. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV.

  • Reddy, K. K., & Shah, M. (2013). Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5), 971–981.

    Article  Google Scholar 

  • Rohrbach, M., Ebert, S., & Schiele, B. (2013). Transfer learning in a transductive setting. In NIPS (pp. 46–54).

  • Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., & Schiele, B. (2012). Script data for attribute-based recognition of composite activities. In ECCV (pp. 144–157).

  • Rohrbach, M., Stark, M., & Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In CVPR (pp. 1641–1648).

  • Rohrbach, M., Stark, M., Szarvas, G., Gurevych, I., & Schiele, B. (2010). What helps where-and why? Semantic relatedness for knowledge transfer. In CVPR (pp. 910–917).

  • Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR (pp. 1234–1241).

  • Sánchez, J., Perronnin, F., Mensink, T., & Verbeek, J. (2013). Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3), 222–245.

    Article  MathSciNet  MATH  Google Scholar 

  • Socher, R., Ganjoo, M., Manning, C.D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In NIPS (pp. 935–943).

  • Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

  • Sun, C., Gan, C., & Nevatia, R. (2015). Automatic concept discovery from parallel text and visual corpora. In ICCV.

  • Tang, K., Yao, B., Fei-Fei, L., & Koller, D. (2013). Combining the right features for complex event recognition. In ICCV (pp. 2696–2703).

  • Torresani, L., Szummer, M., & Fitzgibbon, A. (2010). Efficient object category recognition using classemes. In ECCV (pp. 776–789).

  • Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2014). C3D: generic features for video analysis. arXiv preprint arXiv:1412.0767.

  • Vovk, V. (2013). Kernel ridge regression. In Empirical inference (pp. 105–116).

  • Wang, H., Klaser, A., Schmid, C., & Liu, C. L. (2011). Action recognition by dense trajectories. In CVPR (pp. 3169–3176).

  • Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.

    Article  MathSciNet  Google Scholar 

  • Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV (pp. 3551–3558).

  • Wang, S., Yang, Y., Ma, Z., Li, X., Pang, C., & Hauptmann, A. G. (2012). Action recognition by exploring data distribution and feature correlation. In CVPR (pp. 1370–1377).

  • Yang, Y., Ma, Z., Nie, F., Chang, X., & Hauptmann, A. G. (2014). Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision, 113(2), 113–127.

    Article  MathSciNet  Google Scholar 

  • Yu, F. X., Cao, L., Feris, R. S., Smith, J. R., & Chang, S. F. (2013). Designing category-level attributes for discriminative visual recognition. In CVPR (pp. 771–778).

Download references

Acknowledgments

This work was partially supported by the 973 Program (No. 2012CB316400), partially supported by the National Natural Science Foundation of China Grant 61033001, 61361136003, and partially supported by the ARC DECRA (DE130101311), the ACR DP (DP150103008). This work was done when Chuang Gan was a visiting student at Zhejiang University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Yang.

Additional information

Communicated by Deva Ramanan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gan, C., Yang, Y., Zhu, L. et al. Recognizing an Action Using Its Name: A Knowledge-Based Approach. Int J Comput Vis 120, 61–77 (2016). https://doi.org/10.1007/s11263-016-0893-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-016-0893-6

Keywords

Navigation