Abstract
Audiovisual speech recognition is one of the promising technologies in a noisy environment. In this work, we develop the database for Kannada Language and develop an AVSR system for the same. The proposed work is categorized into three main components: a. Audio mechanism. b. Visual speech mechanism. c. Integration of audio and visual mechanisms. In the audio model, MFCC is used to extract the features and a one-dimensional convolutional neural network is used for classification. In the visual module, Dlib is used to extract the features and long short-term memory recurrent neural network is used for classification. Finally, integration of audio and visual module is done using feed forward neural network. Audio speech recognition of Kannada dataset training accuracy achieved is 93.86 and 91.07% for testing data using seventy epochs. Visual speech recognition for Kannada dataset training accuracy is 77.57%, and testing accuracy is 75%. After integration, audiovisual speech recognition for Kannada dataset train accuracy is 93.33% and for testing is 92.26%.
Similar content being viewed by others
References
Petridis S, Stafylakis T, Ma P, Cai F, Tzimiropoulos G, Pantic M (2018) End-to end audiovisual speech recognition. In: ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process.—Proc., vol 2018-April, pp 6548–6552. https://doi.org/10.1109/ICASSP.2018.8461326
Martinez B, Ma P, Petridis S, Pantic M (2020) Lipreading using temporal convolutional networks. ArXiv, pp 6319–6323
Yu J et al (2020) Audio-visual recognition of overlapped speech for the LRS2 dataset, pp 6984–6988. https://doi.org/10.1109/icassp40776.2020.9054127.
Liu H, Chen Z, Yang B (2020) Lip graph assisted audio-visual speech recognition using bidirectional synchronous fusion. In: Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol 2020-Octob, pp 3520–3524. https://doi.org/10.21437/Interspeech.2020-3146
Shashidhar R, Patilkulkarni S, Puneeth SB (2020) Audio visual speech recognition using feed forward neural network architecture. In: 2020 IEEE Int. Conf. Innov. Technol. INOCON 2020. https://doi.org/10.1109/INOCON50539.2020.9298429
Debnath S, Roy P (2020) Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition. Signal Image Video Process. https://doi.org/10.1007/s11760-020-01717-0
Xia L, Chen G, Xu X, Cui J, Gao Y (2020) Audiovisual speech recognition: a review and forecast. Int J Adv Robot Syst 17(6):1–17. https://doi.org/10.1177/1729881420976082
Aldeneh Z et al (2020) Self-supervised learning of visual speech features with audiovisual speech enhancement. ArXiv
Gao R, Grauman K (2021) Audio-visual speech separation with cross-modal consistency (Supplementary Materials), pp 1–4
Lalonde K, Werner LA (2021) Development of the mechanisms underlying audiovisual speech perception benefit. Brain Sci 11(1):1–17. https://doi.org/10.3390/brainsci11010049
Mundnich K, Fenster A, Khare A, Sundaram S (2021) Audiovisual highlight detection in videos. [Online]. http://arxiv.org/abs/2102.05811
Xu X, Xu D, Jia J, Wang Y, Chen B (2021) MFFCN: multi-layer feature fusion convolution network for audio-visual speech enhancement. [Online]. http://arxiv.org/abs/2101.05975
Gieseler A, Rosemann S, Tahden M, Wagener KC, Thiel C, Colonius H (2020) Linking audiovisual integration to audiovisual speech recognition in noise
Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A (2018) Deep audio-visual speech recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2018.2889052
Shashidhar R, Patilkulkarni S (2021) Visual speech recognition for small scale dataset using VGG16 convolution neural network. Multimed Tools Appl 80:28941–28952. https://doi.org/10.1007/s11042-021-11119-0
Sooraj V et al (2020) Lip-reading techniques: a review. Int J Sci Technol Res 9(2):4378–4383
Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A (2018) Deep audio-visual speech recognition. IEEE Trans Pattern Anal Mach Intell 1–13. https://doi.org/10.1109/TPAMI.2018.2889052
Goh YH, Lau KX, Lee YK (2019) Audio visual speech recognition system using recurrent neural network. In: Proc. 2019 4th Int. Conf. Inf. Technol. Encompassing Intell. Technol. Innov. Towar. New Era Hum. Life, InCIT 2019, pp 38–43. https://doi.org/10.1109/INCIT.2019.8912049
Yu W, Zeiler S, Kolossa D (2021) Multimodal integration for large-vocabulary audio-visual speech recognition. In: Eur. Signal Process. Conf., vol. 2021-Janua, no. 1, pp 341–345. https://doi.org/10.23919/Eusipco47968.2020.9287841
Wu J et al (2019) Time domain audio visual speech separation. In: 2019 IEEE Autom. Speech Recognit. Underst. Work. ASRU 2019—Proc., pp 667–673. https://doi.org/10.1109/ASRU46091.2019.9003983
Zhou P, Yang W, Chen W, Wang Y, Jia J (2019) Modality attention for end-to-end audio-visual speech recognition. In: ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process.—Proc., vol 2019-May, pp 6565–6569. https://doi.org/10.1109/ICASSP.2019.8683733
Tao F, Busso C (2018) Audiovisual speech activity detection with advanced long short term memory. In: Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. September, pp 1244–1248. https://doi.org/10.21437/Interspeech.2018-2490
Wand M, Vu NT (2018) Investigations on end-to-end audiovisual fusion Istituto Dalle Molle di studi sull’ Intelligenza Artificiale (IDSIA), Institute for Natural Language Processing (IMS), University of Stuttgart, Germany,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process.—Proc.
Jadczyk T (2018) Audio-visual speech-processing system for Polish applicable to human–computer interaction. Comput Sci 19(1):41–64. https://doi.org/10.7494/csci.2018.19.1.2398
Cornejo JYR, Pedrini H (2019) Audiovisual emotion recognition using a hybrid deep convolutional neural network based on census transform. In: Conf. Proc.—IEEE Int. Conf. Syst. Man Cybern., vol 2019-Octob, pp 3396–3402. https://doi.org/10.1109/SMC.2019.8914193
Tan K, Xu Y, Zhang SX, Yu M, Yu D (2020) Audio-visual speech separation and dereverberation with a two-stage multimodal network. IEEE J Sel Top Signal Process 14(3):542–553. https://doi.org/10.1109/JSTSP.2020.2987209
Meutzner H, Ma N, Nickel R, Schymura C, Kolossa D (2017) Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates. In: ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process.—Proc., pp 5320–5324. https://doi.org/10.1109/ICASSP.2017.7953172
Tao F, Busso C (2021) End-to-end audiovisual speech recognition system with multitask learning. IEEE Trans Multimed 23:1–11. https://doi.org/10.1109/TMM.2020.2975922
Frew BB (2019) Audio-visual speech recognition using LIP movement for amharic language. Int J Eng Res Technol (IJERT) 08
Cornejo JYR, Pedrini H (2019) Audio-visual emotion recognition using a hybrid deep convolutional neural network based on census transform. In: 2019 IEEE international conference on systems, man and cybernetics (SMC), pp 3396–3402. https://doi.org/10.1109/SMC.2019.8914193
Shashidhar R, Patilkulkarni S, Puneeth SB (2022) Combining audio and visual speech recognition using LSTM and deep convolutional neural network. Int J Inf Tecnol. https://doi.org/10.1007/s41870-022-00907-y
Funding
Not Applicable.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
Availability of data and material
The data that support the findings of this study are available from the corresponding author on reasonable request.
Code availability
The code available on reasonable request to the corresponding author.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shashidhar, R., Patilkulkarni, S. Audiovisual speech recognition for Kannada language using feed forward neural network. Neural Comput & Applic 34, 15603–15615 (2022). https://doi.org/10.1007/s00521-022-07249-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-07249-7