Skip to main content
Log in

Audiovisual speech recognition for Kannada language using feed forward neural network

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Audiovisual speech recognition is one of the promising technologies in a noisy environment. In this work, we develop the database for Kannada Language and develop an AVSR system for the same. The proposed work is categorized into three main components: a. Audio mechanism. b. Visual speech mechanism. c. Integration of audio and visual mechanisms. In the audio model, MFCC is used to extract the features and a one-dimensional convolutional neural network is used for classification. In the visual module, Dlib is used to extract the features and long short-term memory recurrent neural network is used for classification. Finally, integration of audio and visual module is done using feed forward neural network. Audio speech recognition of Kannada dataset training accuracy achieved is 93.86 and 91.07% for testing data using seventy epochs. Visual speech recognition for Kannada dataset training accuracy is 77.57%, and testing accuracy is 75%. After integration, audiovisual speech recognition for Kannada dataset train accuracy is 93.33% and for testing is 92.26%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

References

  1. Petridis S, Stafylakis T, Ma P, Cai F, Tzimiropoulos G, Pantic M (2018) End-to end audiovisual speech recognition. In: ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process.—Proc., vol 2018-April, pp 6548–6552. https://doi.org/10.1109/ICASSP.2018.8461326

  2. Martinez B, Ma P, Petridis S, Pantic M (2020) Lipreading using temporal convolutional networks. ArXiv, pp 6319–6323

  3. Yu J et al (2020) Audio-visual recognition of overlapped speech for the LRS2 dataset, pp 6984–6988. https://doi.org/10.1109/icassp40776.2020.9054127.

  4. Liu H, Chen Z, Yang B (2020) Lip graph assisted audio-visual speech recognition using bidirectional synchronous fusion. In: Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol 2020-Octob, pp 3520–3524. https://doi.org/10.21437/Interspeech.2020-3146

  5. Shashidhar R, Patilkulkarni S, Puneeth SB (2020) Audio visual speech recognition using feed forward neural network architecture. In: 2020 IEEE Int. Conf. Innov. Technol. INOCON 2020. https://doi.org/10.1109/INOCON50539.2020.9298429

  6. Debnath S, Roy P (2020) Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition. Signal Image Video Process. https://doi.org/10.1007/s11760-020-01717-0

  7. Xia L, Chen G, Xu X, Cui J, Gao Y (2020) Audiovisual speech recognition: a review and forecast. Int J Adv Robot Syst 17(6):1–17. https://doi.org/10.1177/1729881420976082

    Article  Google Scholar 

  8. Aldeneh Z et al (2020) Self-supervised learning of visual speech features with audiovisual speech enhancement. ArXiv

  9. Gao R, Grauman K (2021) Audio-visual speech separation with cross-modal consistency (Supplementary Materials), pp 1–4

  10. Lalonde K, Werner LA (2021) Development of the mechanisms underlying audiovisual speech perception benefit. Brain Sci 11(1):1–17. https://doi.org/10.3390/brainsci11010049

    Article  Google Scholar 

  11. Mundnich K, Fenster A, Khare A, Sundaram S (2021) Audiovisual highlight detection in videos. [Online]. http://arxiv.org/abs/2102.05811

  12. Xu X, Xu D, Jia J, Wang Y, Chen B (2021) MFFCN: multi-layer feature fusion convolution network for audio-visual speech enhancement. [Online]. http://arxiv.org/abs/2101.05975

  13. Gieseler A, Rosemann S, Tahden M, Wagener KC, Thiel C, Colonius H (2020) Linking audiovisual integration to audiovisual speech recognition in noise

  14. Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A (2018) Deep audio-visual speech recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2018.2889052

  15. Shashidhar R, Patilkulkarni S (2021) Visual speech recognition for small scale dataset using VGG16 convolution neural network. Multimed Tools Appl 80:28941–28952. https://doi.org/10.1007/s11042-021-11119-0

    Article  Google Scholar 

  16. Sooraj V et al (2020) Lip-reading techniques: a review. Int J Sci Technol Res 9(2):4378–4383

    Google Scholar 

  17. Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A (2018) Deep audio-visual speech recognition. IEEE Trans Pattern Anal Mach Intell 1–13. https://doi.org/10.1109/TPAMI.2018.2889052

  18. Goh YH, Lau KX, Lee YK (2019) Audio visual speech recognition system using recurrent neural network. In: Proc. 2019 4th Int. Conf. Inf. Technol. Encompassing Intell. Technol. Innov. Towar. New Era Hum. Life, InCIT 2019, pp 38–43. https://doi.org/10.1109/INCIT.2019.8912049

  19. Yu W, Zeiler S, Kolossa D (2021) Multimodal integration for large-vocabulary audio-visual speech recognition. In: Eur. Signal Process. Conf., vol. 2021-Janua, no. 1, pp 341–345. https://doi.org/10.23919/Eusipco47968.2020.9287841

  20. Wu J et al (2019) Time domain audio visual speech separation. In: 2019 IEEE Autom. Speech Recognit. Underst. Work. ASRU 2019—Proc., pp 667–673. https://doi.org/10.1109/ASRU46091.2019.9003983

  21. Zhou P, Yang W, Chen W, Wang Y, Jia J (2019) Modality attention for end-to-end audio-visual speech recognition. In: ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process.—Proc., vol 2019-May, pp 6565–6569. https://doi.org/10.1109/ICASSP.2019.8683733

  22. Tao F, Busso C (2018) Audiovisual speech activity detection with advanced long short term memory. In: Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. September, pp 1244–1248. https://doi.org/10.21437/Interspeech.2018-2490

  23. Wand M, Vu NT (2018) Investigations on end-to-end audiovisual fusion Istituto Dalle Molle di studi sull’ Intelligenza Artificiale (IDSIA), Institute for Natural Language Processing (IMS), University of Stuttgart, Germany,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process.—Proc.

  24. Jadczyk T (2018) Audio-visual speech-processing system for Polish applicable to human–computer interaction. Comput Sci 19(1):41–64. https://doi.org/10.7494/csci.2018.19.1.2398

    Article  MathSciNet  Google Scholar 

  25. Cornejo JYR, Pedrini H (2019) Audiovisual emotion recognition using a hybrid deep convolutional neural network based on census transform. In: Conf. Proc.—IEEE Int. Conf. Syst. Man Cybern., vol 2019-Octob, pp 3396–3402. https://doi.org/10.1109/SMC.2019.8914193

  26. Tan K, Xu Y, Zhang SX, Yu M, Yu D (2020) Audio-visual speech separation and dereverberation with a two-stage multimodal network. IEEE J Sel Top Signal Process 14(3):542–553. https://doi.org/10.1109/JSTSP.2020.2987209

    Article  Google Scholar 

  27. Meutzner H, Ma N, Nickel R, Schymura C, Kolossa D (2017) Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates. In: ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process.—Proc., pp 5320–5324. https://doi.org/10.1109/ICASSP.2017.7953172

  28. Tao F, Busso C (2021) End-to-end audiovisual speech recognition system with multitask learning. IEEE Trans Multimed 23:1–11. https://doi.org/10.1109/TMM.2020.2975922

    Article  Google Scholar 

  29. Frew BB (2019) Audio-visual speech recognition using LIP movement for amharic language. Int J Eng Res Technol (IJERT) 08

  30. Cornejo JYR, Pedrini H (2019) Audio-visual emotion recognition using a hybrid deep convolutional neural network based on census transform. In: 2019 IEEE international conference on systems, man and cybernetics (SMC), pp 3396–3402. https://doi.org/10.1109/SMC.2019.8914193

  31. Shashidhar R, Patilkulkarni S, Puneeth SB (2022) Combining audio and visual speech recognition using LSTM and deep convolutional neural network. Int J Inf Tecnol. https://doi.org/10.1007/s41870-022-00907-y

    Article  Google Scholar 

Download references

Funding

Not Applicable.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. Shashidhar.

Ethics declarations

Conflicts of interest

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

Availability of data and material

The data that support the findings of this study are available from the corresponding author on reasonable request.

Code availability

The code available on reasonable request to the corresponding author.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shashidhar, R., Patilkulkarni, S. Audiovisual speech recognition for Kannada language using feed forward neural network. Neural Comput & Applic 34, 15603–15615 (2022). https://doi.org/10.1007/s00521-022-07249-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07249-7

Keywords

Navigation