Abstract
Recently, two-stream networks with multi-modality inputs have shown to be of vital importance for state-of-the-art video understanding. Previous deep systems typically employ a late fusion strategy, however, despite its simplicity and effectiveness, the late strategy might experience insufficient fusion due to that it performs fusion across modalities only once and treats each modality equally without discrimination. In this paper, we propose a Discriminative Dense Fusion (D2F) network, addressing these limitations by densely inserting an attention-based fusion block at each layer. We experiment with two typical action classification benchmarks and three popular classification backbones, where our proposed module consistently outperforms state-of-the-art baselines by noticeable margins. Specifically, the two-stream VGG16, ResNet and I3D achieve accuracy of [93.5%, 69.2%], [94.6%, 70.5%], [94.1%, 72.3%] with D2F on [UCF101, HMDB51], respectively, with absolute gains of [5.5%, 9.8%], [5.13%, 9.91%], and [0.7%, 5.9%] compared with their late fusion counterparts. The qualitative performance also demonstrates that our model can learn more informative complementary representation.
Similar content being viewed by others
References
Abavisani M, Joze HRV, Patel VM (2019) Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1165–1174
Asadi-Aghbolaghi M, Bertiche H, Roig V, Kasaei S, Escalera S (2017) Action recognition from rgb-d data: comparison and fusion of spatio-temporal handcrafted features and deep strategies. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308
Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1251–1258
Duan B, Tang H, Wang W, Zong Z, Yang G, Yan Y (2021) Audio-visual event localization via recursive fusion by joint co-attention. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 4013–4022 (2021)
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE international conference on computer vision, pp 6202–6211
Feichtenhofer C, Pinz A, Wildes RP (2016) Spatiotemporal residual networks for video action recognition. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS16, p 34763484. Curran Associates Inc., Red Hook, NY, USA
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4768–4777
Goyal P, Sahu S, Ghosh S, Lee C (2020) Cross-modal learning for multi-modal video categorization
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint. arXiv:1502.03167
Jain SD, Xiong B, Grauman K (2017) Fusionseg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2117–2126
Kalfaoglu M, Kalkan S, Alatan AA (2020) Late temporal modeling in 3d cnn architectures with bert for action recognition. arXiv preprint. arXiv:2008.01232
Katsaggelos AK, Bahaadini S, Molina R (2015) Audiovisual fusion: challenges and new approaches. Proc IEEE 103(9):1635–1653
Khan MA, Sharif M, Akram T, Raza M, Saba T (2020) Rehman A (2020) Hand-crafted and deep convolutional neural network features fusion and selection strategy: an application to intelligent human action recognition. Appl Soft Comput 87:105986
Khowaja SA, Lee SL (2020) Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition. Neural Comput Appl 32(14):10423–10434
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision (ICCV)
Li A, Thotakuri M, Ross DA, Carreira J, Vostrikov A, Zisserman A (2020) The ava-kinetics localized human actions video dataset. arXiv preprint. arXiv:2005.00214
Li Y, Miao Q, Tian K, Fan Y, Xu X, Ma Z, Song J (2019) Large-scale gesture recognition with a fusion of rgb-d data based on optical flow and the c3d model. Pattern Recogn Lett 119:187–194
Liu K, Liu W, Gan C, Tan M, Ma H (2018) T-C3D: temporal convolutional 3d network for real-time action recognition. In: SA McIlraith, KQ Weinberger (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2–7, 2018, pp. 7138–7145. AAAI Press (2018). URL https://www.aaai.org/ocs/index. php/AAAI/AAAI18/paper/view/17205
Mai S, Hu H, Xing S (2020) Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. In: Proceedings of the AAAI Conference on Artificial Intelligence 34(1):164–172. https://doi.org/10.1609/aaai.v34i01.5347. URL https://ojs.aaai.org/index.php/AAAI/article/view/5347
Ramachandram D, Taylor GW (2017) Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process Mag 34(6):96–108
Rashed H, Yogamani S, El-Sallab A, Krizek P, El-Helw M (2019) Optical flow augmented semantic segmentation networks for automated driving. arXiv preprint. arXiv:1901.07355
Riva M, Wand M, Schmidhuber J (2020) Motion dynamics improve speaker-independent lipreading. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4407–4411
Roitberg A, Pollert T, Haurilet M, Martin M, Stiefelhagen R (2019) Analysis of deep fusion strategies for multi-modal gesture recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
Saha S, Singh G, Cuzzolin F (2020) Two-stream amtnet for action detection. arXiv preprint. arXiv:2004.01494
Sarma D, Kavyasree V, Bhuyan M (2020) Two-stream fusion model for dynamic hand gesture recognition using 3d-cnn and 2d-cnn optical flow guided motion template. arXiv preprint. arXiv:2007.08847
Shi Y, Tian Y, Wang Y, Huang T (2017) Sequential deep trajectory descriptor for action recognition with three-stream cnn. IEEE Trans Multimedia 19(7):1510–1520. https://doi.org/10.1109/TMM.2017.2666540
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Comput Vis Pattern Recognit
Sohn K, Shang W, Lee H (2014) Improved multimodal deep learning with variation of information. In: Advances in neural information processing systems, pp 2141–2149
Song S, Liu J, Li Y, Guo Z (2020) Modality compensation network: cross-modal adaptation for action recognition. IEEE Trans Image Process 29:3957–3969
Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. Comput Vis Pattern Recogn
Sterpu G, Saam C, Harte N (2020) Should we hard-code the recurrence concept or learn it instead? Exploring the transformer architecture for audio-visual speech recognition
Su R, Ouyang W, Zhou L, Xu D (2019) Improving action localization by progressive crossstream cooperation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 12016–12025
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision, pp 20–36, Springer
Weng X, Kitani K (2019) Learning spatio-temporal features with two-stream deep 3d cnns for lipreading. arXiv preprint. arXiv:1905.02540
Xiao J, Yang S, Zhang Y, Shan S, Chen X (2020) Deformation flow based two-stream network for lip reading. arXiv preprint. arXiv:2003.05709
Xu B, Lu C, Guo Y, Wang J (2020) Discriminative multi-modality speech recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yang C, Xu Y, Shi J, Dai B, Zhou B (2020) Temporal pyramid network for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 591–600
Yao L, Yang W (2020) Huang W (2020) A data augmentation method for human action recognition using dense joint motion images. Appl Soft Comput 97:106713
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l 1 optical flow. In: Joint pattern recognition symposium, pp 214–223, Springer
Zhang D, He L, Tu Z, Zhang S, Han F (2020) Yang B Learning motion representation for real-time spatio-temporal action localization. Pattern Recognit 103:107312
Zhao J, Snoek CG (2019) Dance with flow: two-in-one stream action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9935–9944
Zhou T, Wang S, Zhou Y, Yao Y, Li J, Shao L (2020) Motion-attentive transition for zero-shot video object segmentation. Proc AAAI Conf Artif intel 2:3
Acknowledgements
This paper is supported by the Fundamental Research Funds for the Central Universities (No. WK2150110007 and WK2150110012) and by the National Natural Science Foundation of China (No. 61772490, 61472382, 61472381, and 61572454).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, L., Wang, X., Hawbani, A. et al. D2F: discriminative dense fusion of appearance and motion modalities for end-to-end video classification. Multimed Tools Appl 81, 12157–12176 (2022). https://doi.org/10.1007/s11042-021-11247-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11247-7