Hybrid Deep Convolutional Neural Network with Multimodal Fusion

Vynokurova, Olena; Peleshko, Dmytro; Peleshko, Marta

doi:10.1007/978-3-030-61656-4_4

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1158))

Included in the following conference series:

International Conference on Data Stream Mining and Processing

424 Accesses
1 Citations

Abstract

The Hybrid Deep Convolutional Neural Network with Multimodal Fusion (HDCNNMF) topology for the multimodal recognition of the speech, the face, the lips, and human gestures behavior is proposed. Conducted researches relate to improving the understanding of complex dynamic scenes. The basic unit of the proposed hybrid system is deep neural network topology, which combines 2D and 3D convolutional neural network (CNN) for each modality with proposed intermediate-level feature fusion subsystem. Such a feature map fusion method is based on scaling procedure with a specific combination of pooling operation with non-square kernels and allows merging different type of modalities. Also, the method for forming the audio modality feature is proposed. This method is based on eigenvectors of Mel frequency cepstral coefficients (MFCC) and Mel frequency energy coefficients (MFEC) self-similarity matrix and allows increasing informativeness of modality feature. The specific characteristics of proposed fusion operation is that the data of the same dimension without regard to the modality type are fed to the input of fusion subsystem. During the experiments, the high recognition efficiency was obtained both in cases of individual modalities and their fusion. The distinctive feature of proposed HDCNNMF topology is that the input set can be extended by new modalities types. This extension of modalities set should improve the quality of identification, segmentation or recognition in complex ambiguous visual scenes and simplify the task of affordance detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bodyanskiy, Y., Dolotov, A., Vynokurova, O.: Evolving spiking wavelet-neuro-fuzzy self-learning system. Appl. Soft Comput. J. 14(B), 252–258 (2014). https://doi.org/10.1016/j.asoc.2013.05.020
Article Google Scholar
Bodyanskiy, Y., Setlak, G., Peleshko, D., Vynokurova, O.: Hybrid generalized additive neuro-fuzzy system and its adaptive learning algorithms. In: Proceedings of the 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS 2015, vol. 1, pp. 328–333 (2015). https://doi.org/10.1109/IDAACS.2015.7340753
Bodyanskiy, Y., Vynokurova, O., Pliss, I., Peleshko, D., Rashkevych, Y.: Hybrid generalized additive wavelet-neuro-fuzzy-system and its adaptive learning. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Dependability Engineering and Complex Systems. AISC, vol. 470, pp. 51–61. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39639-2_5
Chapter Google Scholar
Castellini, C., Tommasi, T., Noceti, N., Odone, F., Caputo, B.: Using object affordances to improve object recognition. IEEE Trans. Auton. Ment. Dev. 3(3), 207–215 (2011). https://doi.org/10.1109/TAMD.2011.2106782
Article Google Scholar
Cheng, S.T., Hsu, C.W., Li, J.P.: Combined hand gesture-speech model for human action recognition. Sensors 13, 17098–17129 (2013). https://doi.org/10.3390/s131217098
Article Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6. http://www.robots.ox.ac.uk/~vgg/data/lipreading/lrw1.html
Chapter Google Scholar
Dibia, V.: Handtrack: A library for prototyping real-time hand tracking interfaces using convolutional neural networks (2017). https://github.com/victordibia/handtracking
Ding, R., Pang, C., Liu, H.: Audio-visual keyword spotting based on multidimensional convolution neural network. In: Proceedings of the 2018 IEEE International Conference on Image Processing, Athens, Greece, pp. 4138–4142 (2018). https://doi.org/10.1109/ICIP.2018.8451096
Favorskaya, M., Nosov, A., Popov, A.: Localization and recognition of dynamic hand gesture based on hierarchy of manifold classifiers. Int. Arch. Photogram. Remote Sens. Spat. Inf. Sci. XL-5/W6, 151–161 (2015). https://www.int-arch-photogramm-remote-sens-spatial-inf-sci.net/XL-5-W6/1/2015/isprsarchives-XL-5-W6-1-2015.pdf
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016). arXiv:1604.06573
Gibson, J.: The Theory of Affordances. Erlbaum, Hillsdale (1977)
Google Scholar
Gibson, J.: The Ecological Approach to Visual Perception. Erlbaum, Hillsdale (1986)
Google Scholar
Hu, Z., Youmin, H., Liu, J., Wu, B., Han, D., Kurfess, T.: 3D separable convolutional neural network for dynamic hand gesture recognition. Neurocomputing 318, 151–161 (2018). https://doi.org/10.1016/j.neucom.2018.08.042
Article Google Scholar
Jackson, L.: Motion-detection-python (2017). https://github.com/ic0n/Motion-Detection-Python
KaewTraKulPong, P., Bowden, R.: An improved adaptive background mixture model for real-time tracking with shadow detection. In: Remagnino, P., Jones, G.A., Paragios, N., Regazzoni, C.S. (eds.) Video-Based Surveillance Systems, pp. 135–144. Springer, Boston (2001). https://doi.org/10.1007/978-1-4615-0913-4_11. http://personal.ee.surrey.ac.uk/Personal/R.Bowden/publications/avbs01/avbs01.pdf
Chapter Google Scholar
Kaiming, H., Xiangyu, Z., Shaoqing, R., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Computer Vision and Pattern Recognition (2015). arXiv:1502.01852
Kampman, O., Barezi, E., Bertero, D., Fung, P.: Investigating audio, video, and text fusion methods for end-to-end automatic personality prediction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 606–611. Springer (2018). https://doi.org/10.1007/978-981-10-7560-5118
Lyons, R.: FFT interpolation based on FFT samples: a detective story with a surprise ending (2018). https://www.dsprelated.com/showarticle/1156.php
Majeed, S., Husain, H., Samad, S., Idbeaa, T.: Mel frequency cepstral coefficients (MFCC) feature extraction enhancement in the application of speech recognition: a comparison study. J. Theoret. Appl. Inf. Technol. 79(1), 38–56 (2015)
Google Scholar
Nguyen, A.: Scene understanding for autonomous manipulation with deep learning (2019). arXiv:1903.09761
RMSProp (2020). http://ruder.io/optimizing-gradient-descent/index.html#rmsprop
Rua, E., Bredin, H., Mateo, C., Chollet, A., Jimenez, D.: Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden Markov models. Pattern Anal. Appl. 12, 271–284 (2009). https://doi.org/10.1007/s10044-008-0121-2
Article MathSciNet Google Scholar
Sargin, M., Yemez, Y., Erzin, E., Tekalp, A.: Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans. Multimedia 9, 1396–1403 (2007). https://doi.org/10.1109/TMM.2007.906583
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 1, pp. 568–576 (2014). arXiv:1406.21998
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. In: INTERSPEECH, pp. 3652–3656 (2017)
Google Scholar
Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 246–252 (1999). https://doi.org/10.1109/CVPR.1999.784637
Torfi, A., Iranmanesh, S., Nasrabadi, N., Dawson, J.: 3D convolutional neural networks for cross audio-visual matching recognition. Comput. Vis. Pattern Recogn. PP, 1396–1403 (2017). https://doi.org/10.1109/ACCESS.2017.2761539
Article Google Scholar
Tsironi, E., Barros, P., Weber, C., Wermter, S.: An analysis of convolutional long short-term memory recurrent neural networks for gesture recognition. Neurocomputing 268, 76–86 (2017). https://doi.org/10.1016/j.neucom.2016.12.088
Article Google Scholar
Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B., Zafeiriou, S.: End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Signal Process. 11(8), 1301–1309 (2017). https://doi.org/10.1109/JSTSP.2017.2764438
Article Google Scholar
Viola, P., Jones, M.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004). https://doi.org/10.1109/CVPR.1999.784637
Article Google Scholar
Wu, P., Liu, H., Li, X., Fan, T., Zhang, X.: A novel lip descriptor for audio-visual keyword spotting based on adaptive decision fusion. IEEE Trans. Multimedia 18(3), 326–338 (2016). https://doi.org/10.1109/TMM.2016.2520091
Article Google Scholar
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., et al.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015). https://doi.org/10.1109/CVPR.2015.7299101

Download references

Acknowledgments

This research was supported by Samsung R&D Institute Ukraine, which kindly provided computing power for practical experiments. We are very grateful to Rob Cooper at BBC Research for help in obtaining the LRW.

Author information

Authors and Affiliations

GeoGuard, Kharkiv, Ukraine
Olena Vynokurova & Dmytro Peleshko
Kharkiv National University of Radio Electronics, Kharkiv, Ukraine
Olena Vynokurova
Lviv State University of Life Safety, Lviv, Ukraine
Marta Peleshko

Authors

Olena Vynokurova
View author publications
You can also search for this author in PubMed Google Scholar
Dmytro Peleshko
View author publications
You can also search for this author in PubMed Google Scholar
Marta Peleshko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olena Vynokurova .

Editor information

Editors and Affiliations

Department of Informatics, Univerzita Jana Evangelisty Purkyně v Ústí nad Labem, Ústí nad Labem, Czech Republic
Sergii Babichev
GeoGuard, Kharkiv, Ukraine
Dmytro Peleshko
Kharkiv National University of Radio Electronics, Kharkiv, Ukraine
Olena Vynokurova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vynokurova, O., Peleshko, D., Peleshko, M. (2020). Hybrid Deep Convolutional Neural Network with Multimodal Fusion. In: Babichev, S., Peleshko, D., Vynokurova, O. (eds) Data Stream Mining & Processing. DSMP 2020. Communications in Computer and Information Science, vol 1158. Springer, Cham. https://doi.org/10.1007/978-3-030-61656-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-61656-4_4
Published: 05 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61655-7
Online ISBN: 978-3-030-61656-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics