Skip to main content

Real-Time Spatio-Temporal Action Localization via Learning Motion Representation

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12628))

Abstract

Most state-of-the-art spatio-temporal (S-T) action localization methods explicitly use optical flow as auxiliary motion information. Although the combination of optical flow and RGB significantly improves the performance, optical flow estimation brings a large amount of computational cost and the whole network is not end-to-end trainable. These shortcomings hinder the interactive fusion between motion information and RGB information, and greatly limit its real-world applications. In this paper, we exploit better ways to use motion information in a unified end-to-end trainable network architecture. First, we use knowledge distillation to enable the 3D-Convolutional branch to learn motion information from RGB inputs. Second, we propose a novel motion cue called short-range-motion (SRM) module to enhance the 2D-Convolutional branch to learn RGB information and dynamic motion information. In this strategy, flow computation at test time is avoided. Finally, we apply our methods to learn powerful RGB-motion representations for action classification and localization. Experimental results show that our method significantly outperforms the state-of-the-arts on dataset benchmarks J-HMDB-21 and UCF101-24 with an impressive improvement of \(\sim \)8% and \(\sim \)3%.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Yang, Z., Gao, J., Nevatia, R.: Spatio-temporal action detection with cascade proposal and location anticipation. arXiv preprint arXiv:1708.00042 (2017)

  2. He, J., Deng, Z., Ibrahim, M.S., Mori, G.: Generic tubelet proposals for action localization. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 343–351. IEEE (2018)

    Google Scholar 

  3. Ye, Y., Yang, X., Tian, Y.: Discovering spatio-temporal action tubes. J. Visual Commun. Image Represent. 58, 515–524 (2019)

    Article  Google Scholar 

  4. Köpüklü, O., Wei, X., Rigoll, G.: You only watch once: a unified CNN architecture for real-time spatiotemporal action localization. arXiv preprint arXiv:1911.06644 (2019)

  5. Yang, X., Yang, X., Liu, M.Y., Xiao, F., Davis, L.S., Kautz, J.: Step: spatio-temporal progressive learning for video action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 264–272 (2019)

    Google Scholar 

  6. Zhang, D., He, L., Tu, Z., Zhang, S., Han, F., Yang, B.: Learning motion representation for real-time spatio-temporal action localization. Pattern Recognit. 103, 107312 (2020)

    Article  Google Scholar 

  7. Soomro, K., Zamir, A.R., Shah, M.: A dataset of 101 human action classes from videos in the wild. Center Res. Comput. Vis. 2 (2012)

    Google Scholar 

  8. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: International Conference on Computer Vision (ICCV), pp. 3192–3199 (2013)

    Google Scholar 

  9. Gu, C., et al.: Ava: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)

    Google Scholar 

  10. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

    Google Scholar 

  11. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017)

    Google Scholar 

  12. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

  13. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2

    Chapter  Google Scholar 

  14. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)

    Google Scholar 

  15. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2961–2969 (2017)

    Google Scholar 

  16. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)

    Google Scholar 

  17. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)

    Google Scholar 

  18. Singh, G., Saha, S., Sapienza, M., Torr, P.H., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3637–3646 (2017)

    Google Scholar 

  19. Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3164–3172 (2015)

    Google Scholar 

  20. Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients (2008)

    Google Scholar 

  21. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)

    Google Scholar 

  22. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  23. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  24. Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7083–7093 (2019)

    Google Scholar 

  25. Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: Mars: motion-augmented RGB stream for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7882–7891 (2019)

    Google Scholar 

  26. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-net: CNNs for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943 (2018)

    Google Scholar 

  27. Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 284–293 (2019)

    Google Scholar 

  28. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: STAT, vol. 9, p. 1050 (2015)

    Google Scholar 

  29. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  30. Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. arXiv preprint arXiv:2003.12039 (2020)

  31. Dosovitskiy, A., et al.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)

    Google Scholar 

  32. Hui, T.W., Tang, X., Change Loy, C.: Liteflownet: a lightweight convolutional neural network for optical flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8981–8989 (2018)

    Google Scholar 

  33. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

    Chapter  Google Scholar 

  34. Zhang, C., Zou, Y., Chen, G., Gan, L.: Pan: towards fast action recognition via learning persistence of appearance. arXiv preprint arXiv:2008.03462 (2020)

  35. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision (ICCV) (2011)

    Google Scholar 

  36. Duarte, K., Rawat, Y., Shah, M.: Videocapsulenet: a simplified network for action detection. In: Advances in Neural Information Processing Systems, pp. 7610–7619 (2018)

    Google Scholar 

  37. Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_45

    Chapter  Google Scholar 

  38. Saha, S., Singh, G., Sapienza, M., Torr, P.H., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. Pattern Recognit. (2015)

    Google Scholar 

  39. Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5822–5831 (2017)

    Google Scholar 

  40. Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision, 4405–4413 (2017)

    Google Scholar 

  41. Alwando, E.H.P., Chen, Y.T., Fang, W.H.: CNN-based multiple path search for action tube detection in videos. IEEE Trans. Circ. Syst. Video Technol. 30(1), 104–116 (2018)

    Google Scholar 

  42. Wei, J., Wang, H., Yi, Y., Li, Q., Huang, D.: P3D-CTN: pseudo-3D convolutional tube network for spatio-temporal action detection in videos. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 300–304. IEEE (2019)

    Google Scholar 

  43. Singh, G., Saha, S., Cuzzolin, F.: Predicting action tubes. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

Download references

Acknowledgments

The work is supported by the National Key Research and Development Program of China (No. 2018YFB1600600)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhigang Tu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, Y., Tu, Z., Lin, L., Xie, X., Qin, Q. (2021). Real-Time Spatio-Temporal Action Localization via Learning Motion Representation. In: Sato, I., Han, B. (eds) Computer Vision – ACCV 2020 Workshops. ACCV 2020. Lecture Notes in Computer Science(), vol 12628. Springer, Cham. https://doi.org/10.1007/978-3-030-69756-3_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-69756-3_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-69755-6

  • Online ISBN: 978-3-030-69756-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics