Skip to main content
Log in

D2F: discriminative dense fusion of appearance and motion modalities for end-to-end video classification

  • 1177: Advances in Deep Learning for Multimodal Fusion and Alignment
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Recently, two-stream networks with multi-modality inputs have shown to be of vital importance for state-of-the-art video understanding. Previous deep systems typically employ a late fusion strategy, however, despite its simplicity and effectiveness, the late strategy might experience insufficient fusion due to that it performs fusion across modalities only once and treats each modality equally without discrimination. In this paper, we propose a Discriminative Dense Fusion (D2F) network, addressing these limitations by densely inserting an attention-based fusion block at each layer. We experiment with two typical action classification benchmarks and three popular classification backbones, where our proposed module consistently outperforms state-of-the-art baselines by noticeable margins. Specifically, the two-stream VGG16, ResNet and I3D achieve accuracy of [93.5%, 69.2%], [94.6%, 70.5%], [94.1%, 72.3%] with D2F on [UCF101, HMDB51], respectively, with absolute gains of [5.5%, 9.8%], [5.13%, 9.91%], and [0.7%, 5.9%] compared with their late fusion counterparts. The qualitative performance also demonstrates that our model can learn more informative complementary representation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://ww2.mathworks.cn/help/deeplearning/ug/deep-learning-in-matlab.html

References

  1. Abavisani M, Joze HRV, Patel VM (2019) Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1165–1174

  2. Asadi-Aghbolaghi M, Bertiche H, Roig V, Kasaei S, Escalera S (2017) Action recognition from rgb-d data: comparison and fusion of spatio-temporal handcrafted features and deep strategies. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops

  3. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308

  4. Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1251–1258

  5. Duan B, Tang H, Wang W, Zong Z, Yang G, Yan Y (2021) Audio-visual event localization via recursive fusion by joint co-attention. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 4013–4022 (2021)

  6. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE international conference on computer vision, pp 6202–6211

  7. Feichtenhofer C, Pinz A, Wildes RP (2016) Spatiotemporal residual networks for video action recognition. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS16, p 34763484. Curran Associates Inc., Red Hook, NY, USA

  8. Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4768–4777

  9. Goyal P, Sahu S, Ghosh S, Lee C (2020) Cross-modal learning for multi-modal video categorization

  10. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  11. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint. arXiv:1502.03167

  12. Jain SD, Xiong B, Grauman K (2017) Fusionseg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2117–2126

  13. Kalfaoglu M, Kalkan S, Alatan AA (2020) Late temporal modeling in 3d cnn architectures with bert for action recognition. arXiv preprint. arXiv:2008.01232

  14. Katsaggelos AK, Bahaadini S, Molina R (2015) Audiovisual fusion: challenges and new approaches. Proc IEEE 103(9):1635–1653

    Article  Google Scholar 

  15. Khan MA, Sharif M, Akram T, Raza M, Saba T (2020) Rehman A (2020) Hand-crafted and deep convolutional neural network features fusion and selection strategy: an application to intelligent human action recognition. Appl Soft Comput 87:105986

    Article  Google Scholar 

  16. Khowaja SA, Lee SL (2020) Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition. Neural Comput Appl 32(14):10423–10434

    Article  Google Scholar 

  17. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision (ICCV)

  18. Li A, Thotakuri M, Ross DA, Carreira J, Vostrikov A, Zisserman A (2020) The ava-kinetics localized human actions video dataset. arXiv preprint. arXiv:2005.00214

  19. Li Y, Miao Q, Tian K, Fan Y, Xu X, Ma Z, Song J (2019) Large-scale gesture recognition with a fusion of rgb-d data based on optical flow and the c3d model. Pattern Recogn Lett 119:187–194

    Article  Google Scholar 

  20. Liu K, Liu W, Gan C, Tan M, Ma H (2018) T-C3D: temporal convolutional 3d network for real-time action recognition. In: SA McIlraith, KQ Weinberger (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2–7, 2018, pp. 7138–7145. AAAI Press (2018). URL https://www.aaai.org/ocs/index. php/AAAI/AAAI18/paper/view/17205

  21. Mai S, Hu H, Xing S (2020) Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. In: Proceedings of the AAAI Conference on Artificial Intelligence 34(1):164–172. https://doi.org/10.1609/aaai.v34i01.5347. URL https://ojs.aaai.org/index.php/AAAI/article/view/5347

  22. Ramachandram D, Taylor GW (2017) Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process Mag 34(6):96–108

    Article  Google Scholar 

  23. Rashed H, Yogamani S, El-Sallab A, Krizek P, El-Helw M (2019) Optical flow augmented semantic segmentation networks for automated driving. arXiv preprint. arXiv:1901.07355

  24. Riva M, Wand M, Schmidhuber J (2020) Motion dynamics improve speaker-independent lipreading. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4407–4411

  25. Roitberg A, Pollert T, Haurilet M, Martin M, Stiefelhagen R (2019) Analysis of deep fusion strategies for multi-modal gesture recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

  26. Saha S, Singh G, Cuzzolin F (2020) Two-stream amtnet for action detection. arXiv preprint. arXiv:2004.01494

  27. Sarma D, Kavyasree V, Bhuyan M (2020) Two-stream fusion model for dynamic hand gesture recognition using 3d-cnn and 2d-cnn optical flow guided motion template. arXiv preprint. arXiv:2007.08847

  28. Shi Y, Tian Y, Wang Y, Huang T (2017) Sequential deep trajectory descriptor for action recognition with three-stream cnn. IEEE Trans Multimedia 19(7):1510–1520. https://doi.org/10.1109/TMM.2017.2666540

    Article  Google Scholar 

  29. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  30. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Comput Vis Pattern Recognit

  31. Sohn K, Shang W, Lee H (2014) Improved multimodal deep learning with variation of information. In: Advances in neural information processing systems, pp 2141–2149

  32. Song S, Liu J, Li Y, Guo Z (2020) Modality compensation network: cross-modal adaptation for action recognition. IEEE Trans Image Process 29:3957–3969

    Article  Google Scholar 

  33. Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. Comput Vis Pattern Recogn

  34. Sterpu G, Saam C, Harte N (2020) Should we hard-code the recurrence concept or learn it instead? Exploring the transformer architecture for audio-visual speech recognition

  35. Su R, Ouyang W, Zhou L, Xu D (2019) Improving action localization by progressive crossstream cooperation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 12016–12025

  36. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision, pp 20–36, Springer

  37. Weng X, Kitani K (2019) Learning spatio-temporal features with two-stream deep 3d cnns for lipreading. arXiv preprint. arXiv:1905.02540

  38. Xiao J, Yang S, Zhang Y, Shan S, Chen X (2020) Deformation flow based two-stream network for lip reading. arXiv preprint. arXiv:2003.05709

  39. Xu B, Lu C, Guo Y, Wang J (2020) Discriminative multi-modality speech recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  40. Yang C, Xu Y, Shi J, Dai B, Zhou B (2020) Temporal pyramid network for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 591–600

  41. Yao L, Yang W (2020) Huang W (2020) A data augmentation method for human action recognition using dense joint motion images. Appl Soft Comput 97:106713

    Article  Google Scholar 

  42. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702

  43. Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l 1 optical flow. In: Joint pattern recognition symposium, pp 214–223, Springer

  44. Zhang D, He L, Tu Z, Zhang S, Han F (2020) Yang B Learning motion representation for real-time spatio-temporal action localization. Pattern Recognit 103:107312

    Article  Google Scholar 

  45. Zhao J, Snoek CG (2019) Dance with flow: two-in-one stream action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9935–9944

  46. Zhou T, Wang S, Zhou Y, Yao Y, Li J, Shao L (2020) Motion-attentive transition for zero-shot video object segmentation. Proc AAAI Conf Artif intel 2:3

    Google Scholar 

Download references

Acknowledgements

This paper is supported by the Fundamental Research Funds for the Central Universities (No. WK2150110007 and WK2150110012) and by the National Natural Science Foundation of China (No. 61772490, 61472382, 61472381, and 61572454).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ammar Hawbani.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, L., Wang, X., Hawbani, A. et al. D2F: discriminative dense fusion of appearance and motion modalities for end-to-end video classification. Multimed Tools Appl 81, 12157–12176 (2022). https://doi.org/10.1007/s11042-021-11247-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11247-7

Keywords

Navigation