Skip to main content

Weakly Supervised Action Recognition and Localization Using Web Images

  • Conference paper
  • First Online:
Computer Vision -- ACCV 2014 (ACCV 2014)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9007))

Included in the following conference series:

  • 1634 Accesses

Abstract

This paper addresses the problem of joint recognition and localization of actions in videos. We develop a novel Transfer Latent Support Vector Machine (TLSVM) by using Web images and weakly annotated training videos. In order to alleviate the laborious and time-consuming manual annotations of action locations, the model takes training videos which are only annotated with action labels as input. Due to the non-available ground-truth of action locations in videos, the locations are treated as latent variables in our method and are inferred during both training and testing phrases. For the purpose of improving the localization accuracy with some prior information of action locations, we collect a number of Web images which are annotated with both action labels and action locations to learn a discriminative model by enforcing the local similarities between videos and Web images. A structural transformation based on randomized clustering forest is used to map Web images to videos for handling the heterogeneous features of Web images and videos. Experiments on two publicly available action datasets demonstrate that the proposed model is effective for both action localization and action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: IEEE International Conference on Computer Vision, pp. 726–733 (2003)

    Google Scholar 

  2. Niebles, J.C., Fei-Fei, L.: A hierarchical model of shape and appearance for human action classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2007)

    Google Scholar 

  3. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008)

    Google Scholar 

  4. Wu, X., Xu, D., Duan, L., Luo, J., Jia, Y.: Action recognition using multilevel features and latent structural svm. IEEE Trans. Circ. Syst. Video Technol. 23, 1422–1431 (2013)

    Article  Google Scholar 

  5. Yao, A., Gall, J., Van Gool, L.: A hough transform-based voting framework for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2061–2068 (2010)

    Google Scholar 

  6. Oikonomopoulos, A., Patras, I., Pantic, M.: An implicit spatiotemporal shape model for human activity localization and recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), pp. 27–33 (2009)

    Google Scholar 

  7. Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: IEEE International Conference on Computer Vision (ICCV), pp. 2003–2010 (2011)

    Google Scholar 

  8. Raptis, M., Kokkinos, I., Soatto, S.: Discovering discriminative action parts from mid-level video representations. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1242–1249 (2012)

    Google Scholar 

  9. Shapovalova, N., Vahdat, A., Cannons, K., Lan, T., Mori, G.: Similarity constrained latent support vector machine: an application to weakly supervised action classification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VII. LNCS, vol. 7578, pp. 55–68. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  10. Ma, S., Zhang, J., Ikizler-Cinbis, N., Sclaroff, S.: Action recognition and localization by hierarchical space-time segments. In: IEEE International Conference on Computer Vision (2013)

    Google Scholar 

  11. Duan, L., Xu, D., Chang, S.F.: Exploiting web images for event recognition in consumer videos: a multiple source domain adaptation approach. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1338–1345 (2012)

    Google Scholar 

  12. Chen, L., Duan, L., Xu, D.: Event recognition in videos by learning from heterogeneous web sources. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2666–2673 (2013)

    Google Scholar 

  13. Ikizler-Cinbis, N., Sclaroff, S.: Web-based classifiers for human action recognition. IEEE Trans. Multimed. 14, 1031–1045 (2012)

    Article  Google Scholar 

  14. Moosmann, F., Nowak, E., Jurie, F.: Randomized clustering forests for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 30, 1632–1646 (2008)

    Article  Google Scholar 

  15. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 886–893 (2005)

    Google Scholar 

  16. Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3169–3176 (2011)

    Google Scholar 

  17. Alexe, B., Deselaers, T., Ferrari, V.: Measuring the objectness of image windows. IEEE Trans. Pattern Anal. Mach. Intell. 34, 2189–2202 (2012)

    Article  Google Scholar 

  18. Leordeanu, M., Sukthankar, R., Sminchisescu, C.: Efficient closed-form solution to generalized boundary detection. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575, pp. 516–529. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  19. Liu, C.: Beyond pixels: exploring new representations and applications for motion analysis. Ph.D. thesis, Massachusetts Institute of Technology (2009)

    Google Scholar 

  20. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Sci. Am. Assoc. Adv. sci. 315, 972–976 (2007). American Association for the Advancement of Science

    MATH  MathSciNet  Google Scholar 

  21. Do, T.M.T., Artières, T.: Large margin training for hidden markov models with partially observed states. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 265–272 (2009)

    Google Scholar 

  22. Rodriguez, M., Ahmed, J., Shah, M.: Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE Conference on Computer vision and pattern recognition (CVPR), pp. 1–8 (2008)

    Google Scholar 

  23. Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 392–405. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  24. Tang, K., Fei-Fei, L., Koller, D.: Learning latent temporal structure for complex event detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1250–1257 (2012)

    Google Scholar 

  25. Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3337–3344 (2011)

    Google Scholar 

  26. Li, W., Vasconcelos, N.: Recognizing activities by attribute dynamics. In: NIPS, pp. 1115–1123 (2012)

    Google Scholar 

Download references

Acknowledgement

The research was supported in part by the Natural Science Foundation of China (NSFC) under Grant 61203274, the Specialized Research Fund for the Doctoral Program of Higher Education of China (20121101120029), the Specialized Fund for Joint Building Program of Beijing Municipal Education Commission and the Excellent Young Scholars Research Fund of Beijing Institute of Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cuiwei Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Liu, C., Wu, X., Jia, Y. (2015). Weakly Supervised Action Recognition and Localization Using Web Images. In: Cremers, D., Reid, I., Saito, H., Yang, MH. (eds) Computer Vision -- ACCV 2014. ACCV 2014. Lecture Notes in Computer Science(), vol 9007. Springer, Cham. https://doi.org/10.1007/978-3-319-16814-2_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-16814-2_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-16813-5

  • Online ISBN: 978-3-319-16814-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics