Skip to main content
Log in

Dynamic-boosting attention for self-supervised video representation learning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Self-supervised video representation learning leverages supervisory signal of data itself to obtain scalable video representation for downstream tasks, i.e., action recognition. Previous methods mainly leverage temporal signals to learn the temporal relationship between video frames. However, these methods learn weak semantic information of videos due to the lack of semantic labels. Moreover, they cannot train models sufficiently due to the interference of the meaningless frames. To tackle these problems, this paper proposes a novel self-supervised video representation learning method, which guides the network to learn compact and effective semantic information and temporal relationship of videos. Specifically, we introduce the video clip order prediction (VCOP) pretext task to learn the temporal relationship of video frames. On the basis of VCOP, we further propose a Dynamic-Boosting Attention (DBA) module to mine the video semantic information and select the key frames softly. DBA performs a dynamic boosting scheme to extract the semantic information from the high-level video feature and uses the semantic information to softly select the low-level key frame features. We train 3D CNNs with our method and apply the learned model as the pretrained model on two downstream tasks. Experimental results demonstrate that, our DBA method can increase the training efficiency of self-supervised learning. And notably, our 3D CNN model learns great semantic knowledge and achieves obvious improvement on downstream tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Bi HB, Lu D, Zhu HH, Yang LN, Guan HP (2020) Sta-net: spatial-temporal attention network for video salient object detection. Appl Intell pp 1–10

  2. Buchler U, Brattoli B, Ommer B (2018) Improving spatiotemporal self-supervision by deep reinforcement learning. In: Proceedings of the European conference on computer vision (ECCV), pp 770–786

  3. Ding C, Liu K, Cheng F, Belyaev E (2020) Spatio-temporal attention on manifold space for 3d human action recognition. Appl Intell vol 51(5)

  4. Doersch C, Gupta A, Efros AA (2015) Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE international conference on computer vision, pp 1422–1430

  5. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference computer vision pattern recognition, pp 2625–2634

  6. Feng Y, Li K, Gao Y, Qiu J (2020) Hierarchical graph attention networks for semi-supervised node classification. Appl Intell vol 50(3)

  7. Fernando B, Bilen H, Gavves E, Gould S (2017) Self-supervised video representation learning with odd-one-out networks. In: Proceedings of the IEEE conference computer vision pattern recognition, pp 3636–3645

  8. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3146–3154

  9. Gao Z, Guo L, Guan W, Liu AA, Ren T, Chen S (2020) A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2. IEEE Trans Image Process 30:767–782

    Article  Google Scholar 

  10. Gao Z, Guo L, Ren T, Liu AA, Cheng ZY, Chen S (2020) Pairwise two-stream convnets for cross-domain action recognition with small data. IEEE Trans Neural Netw Learn Syst PP(99):1–15. https://doi.org/10.1109/TNNLS.2020.3041018

    Article  Google Scholar 

  11. He J, Zhao L, Yang H, Zhang M, Li W (2019) Hsi-bert: Hyperspectral image classification using the bidirectional encoder representation from transformers. IEEE Trans Geosci Remote Sens 58(1):165–178

    Article  Google Scholar 

  12. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  13. Huang C, Wang H (2020) Novel key-frames selection framework for comprehensive video summarization. IEEE Trans Circ Syst Video Technol 30(2):577–589

    Article  Google Scholar 

  14. Huang W, Gu J, Ma X, Li Y (2020) End-to-end multitask siamese network with residual hierarchical attention for real-time object tracking. Appl Intell vol 50(7)

  15. Kar A, Rai N, Sikka K, Sharma G (2017) Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3376–3385

  16. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732

  17. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P et al (2017) The kinetics human action video dataset. arXiv:1705.06950

  18. Kim D, Cho D, Kweon IS (2019) Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8545–8552

  19. Koohzadi M, Charkari NM, Ghaderi F (2020) Unsupervised representation learning based on the deep multi-view ensemble learning. Appl Intell 50(2):562–581

    Article  Google Scholar 

  20. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision. IEEE, pp 2556–2563

  21. Larsson G, Maire M, Shakhnarovich G (2017) Colorization as a proxy task for visual understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6874–6883

  22. Lee HY, Huang JB, Singh M, Yang MH (2017) Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE international conference on computer vision, pp 667–676

  23. Long X, Gan C, De Melo G, Wu J, Liu X, Wen S (2018) Attention clusters: Purely attention based local feature integration for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7834–7843

  24. Luo Z, Peng B, Huang DA, Alahi A, Fei-Fei L (2017) Unsupervised learning of long-term motion dynamics for videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2203–2212

  25. Lv TX, Pan X, Zhu YZ, Li LH (2020) Unsupervised medical images denoising via graph attention dual adversarial network. Appl Intell pp 1–10

  26. Mao Y, He Z (2020) Dual-y network: infrared-visible image patches matching via semi-supervised transfer learning. Appl Intell pp 1–10

  27. Misra I, Zitnick CL, Hebert M (2016) Shuffle and learn: unsupervised learning using temporal order verification. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 527–544

  28. Mou L, Zhu XX (2019) Learning to pay attention on spectral domain: A spectral attention module-based convolutional network for hyperspectral image classification. IEEE Trans Geosci Remote Sens 58 (1):110–122

    Article  Google Scholar 

  29. Nalepa J, Myller M, Imai Y, Honda KI, Takeda T, Antoniak M (2020) Unsupervised segmentation of hyperspectral images using 3-d convolutional autoencoders. IEEE Geosci Remote Sens Lett pp 1–5

  30. Nathan Mundhenk T, Ho D, Chen BY (2018) Improvements to context based self-supervised learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9339–9348

  31. Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 69–84

  32. Noroozi M, Pirsiavash H, Favaro P (2017) Representation learning by learning to count. In: Proceedings of the IEEE international conference on computer vision, pp 5898–5906

  33. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems, pp 8026–8037

  34. Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA (2016) Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2536–2544

  35. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vision 115(3):211–252

    Article  MathSciNet  Google Scholar 

  36. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  37. Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  38. Sun P, Su X, Guo S, Chen F (2020) Cycle representation-disentangling network: learning to completely disentangle spatial-temporal features in video. Appl Intell pp 1–20

  39. Tang H, Liu H, Xiao W, Sebe N (2019) Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion. Neurocomputing 331:424–433

    Article  Google Scholar 

  40. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  41. Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. In: Advances in neural information processing systems, pp 613–621

  42. Wang X, He K, Gupta A (2017) Transitive invariance for self-supervised visual representation learning. In: Proceedings of the IEEE international conference on computer vision, pp 1329–1338

  43. Xu D, Xiao J, Zhao Z, Shao J, Xie D, Zhuang Y (2019) Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10334–10343

  44. Yang K, Liu Z, Lu Q, Xia GS (2019) Multi-scale weighted branch network for remote sensing image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1–10

  45. Zhang J, Hu H, Lu X (2019) Moving foreground-aware visual attention and key volume mining for human action recognition. ACM Trans Multimed Comput Commun Appl (TOMM) 15(3):1–16

    Google Scholar 

  46. Zhu W, Hu J, Sun G, Cao X, Qiao Y (2016) A key volume mining deep framework for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1991–1999

Download references

Acknowledgements

The authors would like to thank the editors and the anonymous reviewers for their constructive comments and suggestions, which greatly helped in improving this article.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 61731003, 62001302, in part by the Guangdong Basic and Applied Basic Research Foundation(Nos.2021A1515011348 and 2019A1515111205), in part by the Shenzhen Science and Technology Program (Nos. JCYJ20190808145011259 and RCBS20200714114920379), in part by the Open Project Program of State Key Laboratory of Virtual Reality Technology and Systems, Beihang University No.VRLAB2021C05.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qingyuan Yang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Z., Hou, C., Yue, G. et al. Dynamic-boosting attention for self-supervised video representation learning. Appl Intell 52, 3143–3155 (2022). https://doi.org/10.1007/s10489-021-02440-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02440-0

Keywords

Navigation