Abstract
The Hybrid Deep Convolutional Neural Network with Multimodal Fusion (HDCNNMF) topology for the multimodal recognition of the speech, the face, the lips, and human gestures behavior is proposed. Conducted researches relate to improving the understanding of complex dynamic scenes. The basic unit of the proposed hybrid system is deep neural network topology, which combines 2D and 3D convolutional neural network (CNN) for each modality with proposed intermediate-level feature fusion subsystem. Such a feature map fusion method is based on scaling procedure with a specific combination of pooling operation with non-square kernels and allows merging different type of modalities. Also, the method for forming the audio modality feature is proposed. This method is based on eigenvectors of Mel frequency cepstral coefficients (MFCC) and Mel frequency energy coefficients (MFEC) self-similarity matrix and allows increasing informativeness of modality feature. The specific characteristics of proposed fusion operation is that the data of the same dimension without regard to the modality type are fed to the input of fusion subsystem. During the experiments, the high recognition efficiency was obtained both in cases of individual modalities and their fusion. The distinctive feature of proposed HDCNNMF topology is that the input set can be extended by new modalities types. This extension of modalities set should improve the quality of identification, segmentation or recognition in complex ambiguous visual scenes and simplify the task of affordance detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bodyanskiy, Y., Dolotov, A., Vynokurova, O.: Evolving spiking wavelet-neuro-fuzzy self-learning system. Appl. Soft Comput. J. 14(B), 252–258 (2014). https://doi.org/10.1016/j.asoc.2013.05.020
Bodyanskiy, Y., Setlak, G., Peleshko, D., Vynokurova, O.: Hybrid generalized additive neuro-fuzzy system and its adaptive learning algorithms. In: Proceedings of the 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS 2015, vol. 1, pp. 328–333 (2015). https://doi.org/10.1109/IDAACS.2015.7340753
Bodyanskiy, Y., Vynokurova, O., Pliss, I., Peleshko, D., Rashkevych, Y.: Hybrid generalized additive wavelet-neuro-fuzzy-system and its adaptive learning. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Dependability Engineering and Complex Systems. AISC, vol. 470, pp. 51–61. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39639-2_5
Castellini, C., Tommasi, T., Noceti, N., Odone, F., Caputo, B.: Using object affordances to improve object recognition. IEEE Trans. Auton. Ment. Dev. 3(3), 207–215 (2011). https://doi.org/10.1109/TAMD.2011.2106782
Cheng, S.T., Hsu, C.W., Li, J.P.: Combined hand gesture-speech model for human action recognition. Sensors 13, 17098–17129 (2013). https://doi.org/10.3390/s131217098
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6. http://www.robots.ox.ac.uk/~vgg/data/lipreading/lrw1.html
Dibia, V.: Handtrack: A library for prototyping real-time hand tracking interfaces using convolutional neural networks (2017). https://github.com/victordibia/handtracking
Ding, R., Pang, C., Liu, H.: Audio-visual keyword spotting based on multidimensional convolution neural network. In: Proceedings of the 2018 IEEE International Conference on Image Processing, Athens, Greece, pp. 4138–4142 (2018). https://doi.org/10.1109/ICIP.2018.8451096
Favorskaya, M., Nosov, A., Popov, A.: Localization and recognition of dynamic hand gesture based on hierarchy of manifold classifiers. Int. Arch. Photogram. Remote Sens. Spat. Inf. Sci. XL-5/W6, 151–161 (2015). https://www.int-arch-photogramm-remote-sens-spatial-inf-sci.net/XL-5-W6/1/2015/isprsarchives-XL-5-W6-1-2015.pdf
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016). arXiv:1604.06573
Gibson, J.: The Theory of Affordances. Erlbaum, Hillsdale (1977)
Gibson, J.: The Ecological Approach to Visual Perception. Erlbaum, Hillsdale (1986)
Hu, Z., Youmin, H., Liu, J., Wu, B., Han, D., Kurfess, T.: 3D separable convolutional neural network for dynamic hand gesture recognition. Neurocomputing 318, 151–161 (2018). https://doi.org/10.1016/j.neucom.2018.08.042
Jackson, L.: Motion-detection-python (2017). https://github.com/ic0n/Motion-Detection-Python
KaewTraKulPong, P., Bowden, R.: An improved adaptive background mixture model for real-time tracking with shadow detection. In: Remagnino, P., Jones, G.A., Paragios, N., Regazzoni, C.S. (eds.) Video-Based Surveillance Systems, pp. 135–144. Springer, Boston (2001). https://doi.org/10.1007/978-1-4615-0913-4_11. http://personal.ee.surrey.ac.uk/Personal/R.Bowden/publications/avbs01/avbs01.pdf
Kaiming, H., Xiangyu, Z., Shaoqing, R., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Computer Vision and Pattern Recognition (2015). arXiv:1502.01852
Kampman, O., Barezi, E., Bertero, D., Fung, P.: Investigating audio, video, and text fusion methods for end-to-end automatic personality prediction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 606–611. Springer (2018). https://doi.org/10.1007/978-981-10-7560-5118
Lyons, R.: FFT interpolation based on FFT samples: a detective story with a surprise ending (2018). https://www.dsprelated.com/showarticle/1156.php
Majeed, S., Husain, H., Samad, S., Idbeaa, T.: Mel frequency cepstral coefficients (MFCC) feature extraction enhancement in the application of speech recognition: a comparison study. J. Theoret. Appl. Inf. Technol. 79(1), 38–56 (2015)
Nguyen, A.: Scene understanding for autonomous manipulation with deep learning (2019). arXiv:1903.09761
RMSProp (2020). http://ruder.io/optimizing-gradient-descent/index.html#rmsprop
Rua, E., Bredin, H., Mateo, C., Chollet, A., Jimenez, D.: Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden Markov models. Pattern Anal. Appl. 12, 271–284 (2009). https://doi.org/10.1007/s10044-008-0121-2
Sargin, M., Yemez, Y., Erzin, E., Tekalp, A.: Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans. Multimedia 9, 1396–1403 (2007). https://doi.org/10.1109/TMM.2007.906583
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 1, pp. 568–576 (2014). arXiv:1406.21998
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. In: INTERSPEECH, pp. 3652–3656 (2017)
Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 246–252 (1999). https://doi.org/10.1109/CVPR.1999.784637
Torfi, A., Iranmanesh, S., Nasrabadi, N., Dawson, J.: 3D convolutional neural networks for cross audio-visual matching recognition. Comput. Vis. Pattern Recogn. PP, 1396–1403 (2017). https://doi.org/10.1109/ACCESS.2017.2761539
Tsironi, E., Barros, P., Weber, C., Wermter, S.: An analysis of convolutional long short-term memory recurrent neural networks for gesture recognition. Neurocomputing 268, 76–86 (2017). https://doi.org/10.1016/j.neucom.2016.12.088
Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B., Zafeiriou, S.: End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Signal Process. 11(8), 1301–1309 (2017). https://doi.org/10.1109/JSTSP.2017.2764438
Viola, P., Jones, M.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004). https://doi.org/10.1109/CVPR.1999.784637
Wu, P., Liu, H., Li, X., Fan, T., Zhang, X.: A novel lip descriptor for audio-visual keyword spotting based on adaptive decision fusion. IEEE Trans. Multimedia 18(3), 326–338 (2016). https://doi.org/10.1109/TMM.2016.2520091
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., et al.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015). https://doi.org/10.1109/CVPR.2015.7299101
Acknowledgments
This research was supported by Samsung R&D Institute Ukraine, which kindly provided computing power for practical experiments. We are very grateful to Rob Cooper at BBC Research for help in obtaining the LRW.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Vynokurova, O., Peleshko, D., Peleshko, M. (2020). Hybrid Deep Convolutional Neural Network with Multimodal Fusion. In: Babichev, S., Peleshko, D., Vynokurova, O. (eds) Data Stream Mining & Processing. DSMP 2020. Communications in Computer and Information Science, vol 1158. Springer, Cham. https://doi.org/10.1007/978-3-030-61656-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-61656-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61655-7
Online ISBN: 978-3-030-61656-4
eBook Packages: Computer ScienceComputer Science (R0)