Audiovisual speech recognition for Kannada language using feed forward neural network

Shashidhar, R.; Patilkulkarni, S.

doi:10.1007/s00521-022-07249-7

Audiovisual speech recognition for Kannada language using feed forward neural network

Original Article
Published: 25 April 2022

Volume 34, pages 15603–15615, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

246 Accesses
3 Citations
Explore all metrics

Abstract

Audiovisual speech recognition is one of the promising technologies in a noisy environment. In this work, we develop the database for Kannada Language and develop an AVSR system for the same. The proposed work is categorized into three main components: a. Audio mechanism. b. Visual speech mechanism. c. Integration of audio and visual mechanisms. In the audio model, MFCC is used to extract the features and a one-dimensional convolutional neural network is used for classification. In the visual module, Dlib is used to extract the features and long short-term memory recurrent neural network is used for classification. Finally, integration of audio and visual module is done using feed forward neural network. Audio speech recognition of Kannada dataset training accuracy achieved is 93.86 and 91.07% for testing data using seventy epochs. Visual speech recognition for Kannada dataset training accuracy is 77.57%, and testing accuracy is 75%. After integration, audiovisual speech recognition for Kannada dataset train accuracy is 93.33% and for testing is 92.26%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VisionX—A Virtual Assistant for the Visually Impaired Using Deep Learning Models

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

References

Petridis S, Stafylakis T, Ma P, Cai F, Tzimiropoulos G, Pantic M (2018) End-to end audiovisual speech recognition. In: ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process.—Proc., vol 2018-April, pp 6548–6552. https://doi.org/10.1109/ICASSP.2018.8461326
Martinez B, Ma P, Petridis S, Pantic M (2020) Lipreading using temporal convolutional networks. ArXiv, pp 6319–6323
Yu J et al (2020) Audio-visual recognition of overlapped speech for the LRS2 dataset, pp 6984–6988. https://doi.org/10.1109/icassp40776.2020.9054127.
Liu H, Chen Z, Yang B (2020) Lip graph assisted audio-visual speech recognition using bidirectional synchronous fusion. In: Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol 2020-Octob, pp 3520–3524. https://doi.org/10.21437/Interspeech.2020-3146
Shashidhar R, Patilkulkarni S, Puneeth SB (2020) Audio visual speech recognition using feed forward neural network architecture. In: 2020 IEEE Int. Conf. Innov. Technol. INOCON 2020. https://doi.org/10.1109/INOCON50539.2020.9298429
Debnath S, Roy P (2020) Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition. Signal Image Video Process. https://doi.org/10.1007/s11760-020-01717-0
Xia L, Chen G, Xu X, Cui J, Gao Y (2020) Audiovisual speech recognition: a review and forecast. Int J Adv Robot Syst 17(6):1–17. https://doi.org/10.1177/1729881420976082
Article Google Scholar
Aldeneh Z et al (2020) Self-supervised learning of visual speech features with audiovisual speech enhancement. ArXiv
Gao R, Grauman K (2021) Audio-visual speech separation with cross-modal consistency (Supplementary Materials), pp 1–4
Lalonde K, Werner LA (2021) Development of the mechanisms underlying audiovisual speech perception benefit. Brain Sci 11(1):1–17. https://doi.org/10.3390/brainsci11010049
Article Google Scholar
Mundnich K, Fenster A, Khare A, Sundaram S (2021) Audiovisual highlight detection in videos. [Online]. http://arxiv.org/abs/2102.05811
Xu X, Xu D, Jia J, Wang Y, Chen B (2021) MFFCN: multi-layer feature fusion convolution network for audio-visual speech enhancement. [Online]. http://arxiv.org/abs/2101.05975
Gieseler A, Rosemann S, Tahden M, Wagener KC, Thiel C, Colonius H (2020) Linking audiovisual integration to audiovisual speech recognition in noise
Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A (2018) Deep audio-visual speech recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2018.2889052
Shashidhar R, Patilkulkarni S (2021) Visual speech recognition for small scale dataset using VGG16 convolution neural network. Multimed Tools Appl 80:28941–28952. https://doi.org/10.1007/s11042-021-11119-0
Article Google Scholar
Sooraj V et al (2020) Lip-reading techniques: a review. Int J Sci Technol Res 9(2):4378–4383
Google Scholar
Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A (2018) Deep audio-visual speech recognition. IEEE Trans Pattern Anal Mach Intell 1–13. https://doi.org/10.1109/TPAMI.2018.2889052
Goh YH, Lau KX, Lee YK (2019) Audio visual speech recognition system using recurrent neural network. In: Proc. 2019 4th Int. Conf. Inf. Technol. Encompassing Intell. Technol. Innov. Towar. New Era Hum. Life, InCIT 2019, pp 38–43. https://doi.org/10.1109/INCIT.2019.8912049
Yu W, Zeiler S, Kolossa D (2021) Multimodal integration for large-vocabulary audio-visual speech recognition. In: Eur. Signal Process. Conf., vol. 2021-Janua, no. 1, pp 341–345. https://doi.org/10.23919/Eusipco47968.2020.9287841
Wu J et al (2019) Time domain audio visual speech separation. In: 2019 IEEE Autom. Speech Recognit. Underst. Work. ASRU 2019—Proc., pp 667–673. https://doi.org/10.1109/ASRU46091.2019.9003983
Zhou P, Yang W, Chen W, Wang Y, Jia J (2019) Modality attention for end-to-end audio-visual speech recognition. In: ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process.—Proc., vol 2019-May, pp 6565–6569. https://doi.org/10.1109/ICASSP.2019.8683733
Tao F, Busso C (2018) Audiovisual speech activity detection with advanced long short term memory. In: Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. September, pp 1244–1248. https://doi.org/10.21437/Interspeech.2018-2490
Wand M, Vu NT (2018) Investigations on end-to-end audiovisual fusion Istituto Dalle Molle di studi sull’ Intelligenza Artificiale (IDSIA), Institute for Natural Language Processing (IMS), University of Stuttgart, Germany,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process.—Proc.
Jadczyk T (2018) Audio-visual speech-processing system for Polish applicable to human–computer interaction. Comput Sci 19(1):41–64. https://doi.org/10.7494/csci.2018.19.1.2398
Article MathSciNet Google Scholar
Cornejo JYR, Pedrini H (2019) Audiovisual emotion recognition using a hybrid deep convolutional neural network based on census transform. In: Conf. Proc.—IEEE Int. Conf. Syst. Man Cybern., vol 2019-Octob, pp 3396–3402. https://doi.org/10.1109/SMC.2019.8914193
Tan K, Xu Y, Zhang SX, Yu M, Yu D (2020) Audio-visual speech separation and dereverberation with a two-stage multimodal network. IEEE J Sel Top Signal Process 14(3):542–553. https://doi.org/10.1109/JSTSP.2020.2987209
Article Google Scholar
Meutzner H, Ma N, Nickel R, Schymura C, Kolossa D (2017) Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates. In: ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process.—Proc., pp 5320–5324. https://doi.org/10.1109/ICASSP.2017.7953172
Tao F, Busso C (2021) End-to-end audiovisual speech recognition system with multitask learning. IEEE Trans Multimed 23:1–11. https://doi.org/10.1109/TMM.2020.2975922
Article Google Scholar
Frew BB (2019) Audio-visual speech recognition using LIP movement for amharic language. Int J Eng Res Technol (IJERT) 08
Cornejo JYR, Pedrini H (2019) Audio-visual emotion recognition using a hybrid deep convolutional neural network based on census transform. In: 2019 IEEE international conference on systems, man and cybernetics (SMC), pp 3396–3402. https://doi.org/10.1109/SMC.2019.8914193
Shashidhar R, Patilkulkarni S, Puneeth SB (2022) Combining audio and visual speech recognition using LSTM and deep convolutional neural network. Int J Inf Tecnol. https://doi.org/10.1007/s41870-022-00907-y
Article Google Scholar

Download references

Funding

Not Applicable.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, JSS Science and Technology University, Mysuru, India, 570006
R. Shashidhar & S. Patilkulkarni

Authors

R. Shashidhar
View author publications
You can also search for this author in PubMed Google Scholar
S. Patilkulkarni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. Shashidhar.

Ethics declarations

Conflicts of interest

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

Availability of data and material

The data that support the findings of this study are available from the corresponding author on reasonable request.

Code availability

The code available on reasonable request to the corresponding author.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shashidhar, R., Patilkulkarni, S. Audiovisual speech recognition for Kannada language using feed forward neural network. Neural Comput & Applic 34, 15603–15615 (2022). https://doi.org/10.1007/s00521-022-07249-7

Download citation

Received: 23 August 2021
Accepted: 29 March 2022
Published: 25 April 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s00521-022-07249-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Audiovisual speech recognition for Kannada language using feed forward neural network

Abstract

Access this article

Similar content being viewed by others

VisionX—A Virtual Assistant for the Visually Impaired Using Deep Learning Models

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Availability of data and material

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Audiovisual speech recognition for Kannada language using feed forward neural network

Abstract

Access this article

Similar content being viewed by others

VisionX—A Virtual Assistant for the Visually Impaired Using Deep Learning Models

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Availability of data and material

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation