Skip to main content
Log in

Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

  2. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

  3. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Adv Neural Inf. Process. Syst. (2014)

  4. Tran, D., et al.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (2015)

  5. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision (2016)

  6. Wang, H., et al.: Action recognition by dense trajectories (2011)

  7. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision (2013)

  8. Lan, Z., et al.: Beyond Gaussian pyramid: multi-skip feature stacking for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

  9. Wang, Y., et al.: Spatiotemporal pyramid network for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

  10. Yang, K., et al.: IF-TTN: information fused temporal transformation network for video action recognition (2019)

  11. Wang, Y., et al.: Reversing two-stream networks with decoding discrepancy penalty for robust action recognition (2018)

  12. Zhao, Y., Xiong, Y., Lin, D.: Recognize actions by disentangling components of dynamics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

  13. Wang, X., et al.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

  14. Shou, Z., et al.: Dmc-net: generating discriminative motion cues for fast compressed video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)

  15. Ioffe, S., Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift (2015)

  16. Girdhar, R., Ramanan, D. Attentional pooling for action recognition. Adv. Neural Inf. Process. Syst. (2017)

  17. Tran, A., Cheong, L.-F.: Two-stream flow-guided convolutional attention networks for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision (2017)

  18. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012)

  19. Kuehne, H., et al.: HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision (2011)

  20. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L 1 optical flow. In: Joint Pattern Recognition Symposium (2007)

  21. Wang, L., Qiao, Y., Tang, X.: MoFAP: a multi-level representation for action recognition. Int. J. Comput. Vis. 119(3), 254–271 (2016)

    Article  MathSciNet  Google Scholar 

  22. Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

  23. Sharma, S., Kiros, R., Salakhutdinov, R. Action recognition using visual attention (2015)

  24. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  25. Peng, X., et al.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)

    Article  Google Scholar 

  26. Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2017)

    Article  Google Scholar 

  27. Huang, S., Lin, X., Karaman, S., et al.: Flow-distilled IP two-stream networks for compressed video action recognition. In: Computer Vision and Pattern Recognition (2019)

  28. Cao, D., Xu, L., Chen, H., et al.: Action recognition in untrimmed videos with composite self-attention two-stream framework. In: Computer Vision and Pattern Recognition (2019)

  29. Du, Y., Yuan, C., Li, B., et al.: Interaction-aware spatio-temporal pyramid attention networks for action classification. In: European Conference on Computer Vision, pp. 388–404 (2018)

  30. Deep draw. https://github.com/auduno/deepdraw (2016)

Download references

Acknowledgements

This work is supported by the grants from the Natural Science Foundation of Shandong Province (No. ZR20-20MF136), Key Research and Development Plan of Shandong Province (No. 2019GGX101015), and the Fundamental Research Funds for the Central Universities (No. 20CX05018A).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chunlei Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cao, H., Wu, C., Lu, J. et al. Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention. SIViP 17, 1173–1180 (2023). https://doi.org/10.1007/s11760-022-02324-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-022-02324-x

Keywords

Navigation