D2F: discriminative dense fusion of appearance and motion modalities for end-to-end video classification

Wang, Lin; Wang, Xingfu; Hawbani, Ammar; Xiong, Yan; Zhang, Xu

doi:10.1007/s11042-021-11247-7

D²F: discriminative dense fusion of appearance and motion modalities for end-to-end video classification

1177: Advances in Deep Learning for Multimodal Fusion and Alignment
Published: 08 January 2022

Volume 81, pages 12157–12176, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Lin Wang ORCID: orcid.org/0000-0002-9944-5151¹,
Xingfu Wang¹,
Ammar Hawbani¹,
Yan Xiong¹ &
…
Xu Zhang²

293 Accesses
2 Citations
Explore all metrics

Abstract

Recently, two-stream networks with multi-modality inputs have shown to be of vital importance for state-of-the-art video understanding. Previous deep systems typically employ a late fusion strategy, however, despite its simplicity and effectiveness, the late strategy might experience insufficient fusion due to that it performs fusion across modalities only once and treats each modality equally without discrimination. In this paper, we propose a Discriminative Dense Fusion (D²F) network, addressing these limitations by densely inserting an attention-based fusion block at each layer. We experiment with two typical action classification benchmarks and three popular classification backbones, where our proposed module consistently outperforms state-of-the-art baselines by noticeable margins. Specifically, the two-stream VGG16, ResNet and I3D achieve accuracy of [93.5%, 69.2%], [94.6%, 70.5%], [94.1%, 72.3%] with D²F on [UCF101, HMDB51], respectively, with absolute gains of [5.5%, 9.8%], [5.13%, 9.91%], and [0.7%, 5.9%] compared with their late fusion counterparts. The qualitative performance also demonstrates that our model can learn more informative complementary representation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Article 28 October 2019

Multi-modality Fusion Network for Action Recognition

Multi-level Three-Stream Convolutional Networks for Video-Based Action Recognition

Notes

https://ww2.mathworks.cn/help/deeplearning/ug/deep-learning-in-matlab.html

References

Abavisani M, Joze HRV, Patel VM (2019) Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1165–1174
Asadi-Aghbolaghi M, Bertiche H, Roig V, Kasaei S, Escalera S (2017) Action recognition from rgb-d data: comparison and fusion of spatio-temporal handcrafted features and deep strategies. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308
Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1251–1258
Duan B, Tang H, Wang W, Zong Z, Yang G, Yan Y (2021) Audio-visual event localization via recursive fusion by joint co-attention. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 4013–4022 (2021)
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE international conference on computer vision, pp 6202–6211
Feichtenhofer C, Pinz A, Wildes RP (2016) Spatiotemporal residual networks for video action recognition. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS16, p 34763484. Curran Associates Inc., Red Hook, NY, USA
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4768–4777
Goyal P, Sahu S, Ghosh S, Lee C (2020) Cross-modal learning for multi-modal video categorization
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint. arXiv:1502.03167
Jain SD, Xiong B, Grauman K (2017) Fusionseg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2117–2126
Kalfaoglu M, Kalkan S, Alatan AA (2020) Late temporal modeling in 3d cnn architectures with bert for action recognition. arXiv preprint. arXiv:2008.01232
Katsaggelos AK, Bahaadini S, Molina R (2015) Audiovisual fusion: challenges and new approaches. Proc IEEE 103(9):1635–1653
Article Google Scholar
Khan MA, Sharif M, Akram T, Raza M, Saba T (2020) Rehman A (2020) Hand-crafted and deep convolutional neural network features fusion and selection strategy: an application to intelligent human action recognition. Appl Soft Comput 87:105986
Article Google Scholar
Khowaja SA, Lee SL (2020) Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition. Neural Comput Appl 32(14):10423–10434
Article Google Scholar
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision (ICCV)
Li A, Thotakuri M, Ross DA, Carreira J, Vostrikov A, Zisserman A (2020) The ava-kinetics localized human actions video dataset. arXiv preprint. arXiv:2005.00214
Li Y, Miao Q, Tian K, Fan Y, Xu X, Ma Z, Song J (2019) Large-scale gesture recognition with a fusion of rgb-d data based on optical flow and the c3d model. Pattern Recogn Lett 119:187–194
Article Google Scholar
Liu K, Liu W, Gan C, Tan M, Ma H (2018) T-C3D: temporal convolutional 3d network for real-time action recognition. In: SA McIlraith, KQ Weinberger (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2–7, 2018, pp. 7138–7145. AAAI Press (2018). URL https://www.aaai.org/ocs/index. php/AAAI/AAAI18/paper/view/17205
Mai S, Hu H, Xing S (2020) Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. In: Proceedings of the AAAI Conference on Artificial Intelligence 34(1):164–172. https://doi.org/10.1609/aaai.v34i01.5347. URL https://ojs.aaai.org/index.php/AAAI/article/view/5347
Ramachandram D, Taylor GW (2017) Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process Mag 34(6):96–108
Article Google Scholar
Rashed H, Yogamani S, El-Sallab A, Krizek P, El-Helw M (2019) Optical flow augmented semantic segmentation networks for automated driving. arXiv preprint. arXiv:1901.07355
Riva M, Wand M, Schmidhuber J (2020) Motion dynamics improve speaker-independent lipreading. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4407–4411
Roitberg A, Pollert T, Haurilet M, Martin M, Stiefelhagen R (2019) Analysis of deep fusion strategies for multi-modal gesture recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
Saha S, Singh G, Cuzzolin F (2020) Two-stream amtnet for action detection. arXiv preprint. arXiv:2004.01494
Sarma D, Kavyasree V, Bhuyan M (2020) Two-stream fusion model for dynamic hand gesture recognition using 3d-cnn and 2d-cnn optical flow guided motion template. arXiv preprint. arXiv:2007.08847
Shi Y, Tian Y, Wang Y, Huang T (2017) Sequential deep trajectory descriptor for action recognition with three-stream cnn. IEEE Trans Multimedia 19(7):1510–1520. https://doi.org/10.1109/TMM.2017.2666540
Article Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Comput Vis Pattern Recognit
Sohn K, Shang W, Lee H (2014) Improved multimodal deep learning with variation of information. In: Advances in neural information processing systems, pp 2141–2149
Song S, Liu J, Li Y, Guo Z (2020) Modality compensation network: cross-modal adaptation for action recognition. IEEE Trans Image Process 29:3957–3969
Article Google Scholar
Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. Comput Vis Pattern Recogn
Sterpu G, Saam C, Harte N (2020) Should we hard-code the recurrence concept or learn it instead? Exploring the transformer architecture for audio-visual speech recognition
Su R, Ouyang W, Zhou L, Xu D (2019) Improving action localization by progressive crossstream cooperation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 12016–12025
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision, pp 20–36, Springer
Weng X, Kitani K (2019) Learning spatio-temporal features with two-stream deep 3d cnns for lipreading. arXiv preprint. arXiv:1905.02540
Xiao J, Yang S, Zhang Y, Shan S, Chen X (2020) Deformation flow based two-stream network for lip reading. arXiv preprint. arXiv:2003.05709
Xu B, Lu C, Guo Y, Wang J (2020) Discriminative multi-modality speech recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yang C, Xu Y, Shi J, Dai B, Zhou B (2020) Temporal pyramid network for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 591–600
Yao L, Yang W (2020) Huang W (2020) A data augmentation method for human action recognition using dense joint motion images. Appl Soft Comput 97:106713
Article Google Scholar
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l 1 optical flow. In: Joint pattern recognition symposium, pp 214–223, Springer
Zhang D, He L, Tu Z, Zhang S, Han F (2020) Yang B Learning motion representation for real-time spatio-temporal action localization. Pattern Recognit 103:107312
Article Google Scholar
Zhao J, Snoek CG (2019) Dance with flow: two-in-one stream action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9935–9944
Zhou T, Wang S, Zhou Y, Yao Y, Li J, Shao L (2020) Motion-attentive transition for zero-shot video object segmentation. Proc AAAI Conf Artif intel 2:3
Google Scholar

Download references

Acknowledgements

This paper is supported by the Fundamental Research Funds for the Central Universities (No. WK2150110007 and WK2150110012) and by the National Natural Science Foundation of China (No. 61772490, 61472382, 61472381, and 61572454).

Author information

Authors and Affiliations

School of Computer Science and Technology, University of Science and Technology of China, Hefei, 230027, China
Lin Wang, Xingfu Wang, Ammar Hawbani & Yan Xiong
National Computer Network Emergency Response Technical Center of China, Chengdu, 610072, China
Xu Zhang

Authors

Lin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xingfu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ammar Hawbani
View author publications
You can also search for this author in PubMed Google Scholar
Yan Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Xu Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ammar Hawbani.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, L., Wang, X., Hawbani, A. et al. D²F: discriminative dense fusion of appearance and motion modalities for end-to-end video classification. Multimed Tools Appl 81, 12157–12176 (2022). https://doi.org/10.1007/s11042-021-11247-7

Download citation

Received: 31 July 2020
Revised: 20 May 2021
Accepted: 08 July 2021
Published: 08 January 2022
Issue Date: April 2022
DOI: https://doi.org/10.1007/s11042-021-11247-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

D²F: discriminative dense fusion of appearance and motion modalities for end-to-end video classification

Abstract

Access this article

Similar content being viewed by others

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Multi-modality Fusion Network for Action Recognition

Multi-level Three-Stream Convolutional Networks for Video-Based Action Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

D2F: discriminative dense fusion of appearance and motion modalities for end-to-end video classification

Abstract

Access this article

Similar content being viewed by others

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Multi-modality Fusion Network for Action Recognition

Multi-level Three-Stream Convolutional Networks for Video-Based Action Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

D²F: discriminative dense fusion of appearance and motion modalities for end-to-end video classification