Skip to main content

Hybrid Deep Convolutional Neural Network with Multimodal Fusion

  • Conference paper
  • First Online:
Data Stream Mining & Processing (DSMP 2020)

Abstract

The Hybrid Deep Convolutional Neural Network with Multimodal Fusion (HDCNNMF) topology for the multimodal recognition of the speech, the face, the lips, and human gestures behavior is proposed. Conducted researches relate to improving the understanding of complex dynamic scenes. The basic unit of the proposed hybrid system is deep neural network topology, which combines 2D and 3D convolutional neural network (CNN) for each modality with proposed intermediate-level feature fusion subsystem. Such a feature map fusion method is based on scaling procedure with a specific combination of pooling operation with non-square kernels and allows merging different type of modalities. Also, the method for forming the audio modality feature is proposed. This method is based on eigenvectors of Mel frequency cepstral coefficients (MFCC) and Mel frequency energy coefficients (MFEC) self-similarity matrix and allows increasing informativeness of modality feature. The specific characteristics of proposed fusion operation is that the data of the same dimension without regard to the modality type are fed to the input of fusion subsystem. During the experiments, the high recognition efficiency was obtained both in cases of individual modalities and their fusion. The distinctive feature of proposed HDCNNMF topology is that the input set can be extended by new modalities types. This extension of modalities set should improve the quality of identification, segmentation or recognition in complex ambiguous visual scenes and simplify the task of affordance detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bodyanskiy, Y., Dolotov, A., Vynokurova, O.: Evolving spiking wavelet-neuro-fuzzy self-learning system. Appl. Soft Comput. J. 14(B), 252–258 (2014). https://doi.org/10.1016/j.asoc.2013.05.020

    Article  Google Scholar 

  2. Bodyanskiy, Y., Setlak, G., Peleshko, D., Vynokurova, O.: Hybrid generalized additive neuro-fuzzy system and its adaptive learning algorithms. In: Proceedings of the 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS 2015, vol. 1, pp. 328–333 (2015). https://doi.org/10.1109/IDAACS.2015.7340753

  3. Bodyanskiy, Y., Vynokurova, O., Pliss, I., Peleshko, D., Rashkevych, Y.: Hybrid generalized additive wavelet-neuro-fuzzy-system and its adaptive learning. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Dependability Engineering and Complex Systems. AISC, vol. 470, pp. 51–61. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39639-2_5

    Chapter  Google Scholar 

  4. Castellini, C., Tommasi, T., Noceti, N., Odone, F., Caputo, B.: Using object affordances to improve object recognition. IEEE Trans. Auton. Ment. Dev. 3(3), 207–215 (2011). https://doi.org/10.1109/TAMD.2011.2106782

    Article  Google Scholar 

  5. Cheng, S.T., Hsu, C.W., Li, J.P.: Combined hand gesture-speech model for human action recognition. Sensors 13, 17098–17129 (2013). https://doi.org/10.3390/s131217098

    Article  Google Scholar 

  6. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6. http://www.robots.ox.ac.uk/~vgg/data/lipreading/lrw1.html

    Chapter  Google Scholar 

  7. Dibia, V.: Handtrack: A library for prototyping real-time hand tracking interfaces using convolutional neural networks (2017). https://github.com/victordibia/handtracking

  8. Ding, R., Pang, C., Liu, H.: Audio-visual keyword spotting based on multidimensional convolution neural network. In: Proceedings of the 2018 IEEE International Conference on Image Processing, Athens, Greece, pp. 4138–4142 (2018). https://doi.org/10.1109/ICIP.2018.8451096

  9. Favorskaya, M., Nosov, A., Popov, A.: Localization and recognition of dynamic hand gesture based on hierarchy of manifold classifiers. Int. Arch. Photogram. Remote Sens. Spat. Inf. Sci. XL-5/W6, 151–161 (2015). https://www.int-arch-photogramm-remote-sens-spatial-inf-sci.net/XL-5-W6/1/2015/isprsarchives-XL-5-W6-1-2015.pdf

  10. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016). arXiv:1604.06573

  11. Gibson, J.: The Theory of Affordances. Erlbaum, Hillsdale (1977)

    Google Scholar 

  12. Gibson, J.: The Ecological Approach to Visual Perception. Erlbaum, Hillsdale (1986)

    Google Scholar 

  13. Hu, Z., Youmin, H., Liu, J., Wu, B., Han, D., Kurfess, T.: 3D separable convolutional neural network for dynamic hand gesture recognition. Neurocomputing 318, 151–161 (2018). https://doi.org/10.1016/j.neucom.2018.08.042

    Article  Google Scholar 

  14. Jackson, L.: Motion-detection-python (2017). https://github.com/ic0n/Motion-Detection-Python

  15. KaewTraKulPong, P., Bowden, R.: An improved adaptive background mixture model for real-time tracking with shadow detection. In: Remagnino, P., Jones, G.A., Paragios, N., Regazzoni, C.S. (eds.) Video-Based Surveillance Systems, pp. 135–144. Springer, Boston (2001). https://doi.org/10.1007/978-1-4615-0913-4_11. http://personal.ee.surrey.ac.uk/Personal/R.Bowden/publications/avbs01/avbs01.pdf

    Chapter  Google Scholar 

  16. Kaiming, H., Xiangyu, Z., Shaoqing, R., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Computer Vision and Pattern Recognition (2015). arXiv:1502.01852

  17. Kampman, O., Barezi, E., Bertero, D., Fung, P.: Investigating audio, video, and text fusion methods for end-to-end automatic personality prediction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 606–611. Springer (2018). https://doi.org/10.1007/978-981-10-7560-5118

  18. Lyons, R.: FFT interpolation based on FFT samples: a detective story with a surprise ending (2018). https://www.dsprelated.com/showarticle/1156.php

  19. Majeed, S., Husain, H., Samad, S., Idbeaa, T.: Mel frequency cepstral coefficients (MFCC) feature extraction enhancement in the application of speech recognition: a comparison study. J. Theoret. Appl. Inf. Technol. 79(1), 38–56 (2015)

    Google Scholar 

  20. Nguyen, A.: Scene understanding for autonomous manipulation with deep learning (2019). arXiv:1903.09761

  21. RMSProp (2020). http://ruder.io/optimizing-gradient-descent/index.html#rmsprop

  22. Rua, E., Bredin, H., Mateo, C., Chollet, A., Jimenez, D.: Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden Markov models. Pattern Anal. Appl. 12, 271–284 (2009). https://doi.org/10.1007/s10044-008-0121-2

    Article  MathSciNet  Google Scholar 

  23. Sargin, M., Yemez, Y., Erzin, E., Tekalp, A.: Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans. Multimedia 9, 1396–1403 (2007). https://doi.org/10.1109/TMM.2007.906583

    Article  Google Scholar 

  24. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 1, pp. 568–576 (2014). arXiv:1406.21998

  25. Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. In: INTERSPEECH, pp. 3652–3656 (2017)

    Google Scholar 

  26. Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 246–252 (1999). https://doi.org/10.1109/CVPR.1999.784637

  27. Torfi, A., Iranmanesh, S., Nasrabadi, N., Dawson, J.: 3D convolutional neural networks for cross audio-visual matching recognition. Comput. Vis. Pattern Recogn. PP, 1396–1403 (2017). https://doi.org/10.1109/ACCESS.2017.2761539

    Article  Google Scholar 

  28. Tsironi, E., Barros, P., Weber, C., Wermter, S.: An analysis of convolutional long short-term memory recurrent neural networks for gesture recognition. Neurocomputing 268, 76–86 (2017). https://doi.org/10.1016/j.neucom.2016.12.088

    Article  Google Scholar 

  29. Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B., Zafeiriou, S.: End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Signal Process. 11(8), 1301–1309 (2017). https://doi.org/10.1109/JSTSP.2017.2764438

    Article  Google Scholar 

  30. Viola, P., Jones, M.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004). https://doi.org/10.1109/CVPR.1999.784637

    Article  Google Scholar 

  31. Wu, P., Liu, H., Li, X., Fan, T., Zhang, X.: A novel lip descriptor for audio-visual keyword spotting based on adaptive decision fusion. IEEE Trans. Multimedia 18(3), 326–338 (2016). https://doi.org/10.1109/TMM.2016.2520091

    Article  Google Scholar 

  32. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., et al.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015). https://doi.org/10.1109/CVPR.2015.7299101

Download references

Acknowledgments

This research was supported by Samsung R&D Institute Ukraine, which kindly provided computing power for practical experiments. We are very grateful to Rob Cooper at BBC Research for help in obtaining the LRW.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Olena Vynokurova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vynokurova, O., Peleshko, D., Peleshko, M. (2020). Hybrid Deep Convolutional Neural Network with Multimodal Fusion. In: Babichev, S., Peleshko, D., Vynokurova, O. (eds) Data Stream Mining & Processing. DSMP 2020. Communications in Computer and Information Science, vol 1158. Springer, Cham. https://doi.org/10.1007/978-3-030-61656-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-61656-4_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-61655-7

  • Online ISBN: 978-3-030-61656-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics