Skip to main content
Log in

Augmenting machine learning for Amharic speech recognition: a paradigm of patient’s lips motion detection

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The method of automatic lip motion recognition is an essential input for visual speech detection. It is a technological approach to demystify people who are hard to hear, deaf, and a challenge of silent communication in day-to-day life. However, the recognition process is a challenge in terms of pronunciation variation, speech speeds, gesture variation, color, makeup, the video quality of the camera, and the way of feature extraction. This paper proposed a solution for automatic lip motion recognition by identifying lip movements and characterizing their association with the spoken words for the Amharic language spoken using the information available in lip movements. The input video is converting into consecutive image frames. We use a Viola-Jones object detection algorithm to gain YIQ color space and apply the saturation components to detect lip images from the face area. Sobel’s edge detection and morphological image operations implement to identify and extract the exact contour of the lip. We applied ANN and SVM classifiers on averaging shape information features, and we gained 65.71% and 66.43% classification accuracies of ANN and SVM, respectively. The findings presented in the Amharic Speech Recognition is the newly introduced technology to enhance the academic and linguistic skills of hearing-problem people, health domain experts, physicians, researchers, etc. The future research work presents in the light of the findings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Abate ST, Menzel W, Tafira B (2005) An Amharic speech corpus for large vocabulary continuous speech recognition. Proceede9th Eur Conference Speech Comm Technol (January):1601–1604

  2. Acharya, T., & Ray, A. K. (2005). Image Process: Principl Appl (Vol. 15). A JOHN WILEY & SONS, MC., PUBLICATION. https://doi.org/10.1117/1.2348895

  3. Assefa D (2006) Amharic speech training for the deaf Addis Abeba

  4. Aybar E (2006) Sobel edge detection method for Matlab. Anadolu University, 3–7.

  5. Badura, S., & Mokrys, M. (2015). Feature extraction for automatic lips reading system for isolated vowels. Int Virtual Sci Conf Inform Manag Sci. 96–104. Retrieved from http://ictic.sk/archive/?vid=1&aid=2&kid=50401-241

  6. Borde P, Varpe A, Manza R, Yannawar P (2015) Recognition of isolated words using Zernike and MFCC features for audio-visual speech recognition. Int J Speech Technol 18(2):167–175. https://doi.org/10.1007/s10772-014-9257-1

    Article  Google Scholar 

  7. Carrie W, William SD (2020) Understanding visual lip-based biometric authentication for mobile devices. EURASIP J Inf Secur 2020(3):1–16. https://doi.org/10.1186/s13635-020-0102-6

    Article  Google Scholar 

  8. Bender et al (1976) Language in Ethiopia. (C. a. F. Lionel M. Bender, J. D. Bowen, R. L. Cooper, Ed.). Oxford University Press, London

  9. Dabre K, Dholay S (2014) A machine learning model for sign language interpretation using webcam images. Int Conf Circ, Syst, Commun Inform Technol Appl, CSC ITA 2014:317–321. https://doi.org/10.1109/CSCITA.2014.6839279

  10. Farid H (2001) Blind inverse gamma correction. IEEE Trans Image Process 10(10):1428–1433

    Article  Google Scholar 

  11. Gonzalez RC, Woods RE, Eddins SL (2009) Digital image processing with Matlab (second Edi). U.S: Gatesmark Publishing.

  12. Guan X, Jian S, Hongda P, Zhiguo Z, Haibin G (2009) An image enhancement method based on gamma correction. Second Int Symp Comput Intell Design, 60–63.

  13. Hayward K et al (1999) Amharic. In handbook of the international phonetic association: a guide to the use of the international phonetic alphabet, (3rd ed.). The University Press, Cambridge

  14. Heracleous P, Aboutabit N, Beautemps D (2009) Lip shape and hand position fusion for automatic vowel recognition in cued speech for French. IEEE Signal ProcessLett 16(5):339–342. https://doi.org/10.1109/LSP.2009.2016011

    Article  Google Scholar 

  15. Ichino M (2014) Lip-movement-based speaker recognition using a fusion of canonical angles. Int Conf Control Autom Robotics Vision, ICARCV:958–963. https://doi.org/10.1109/ICARCV.2014.7064435

  16. Jie L, Jigui S, Shengsheng W (2006) Pattern recognition: an overview. Int J Comp Sci Network Sec, IJCSNS 6(6):57–61. https://doi.org/10.5923/j.ajis.20120201.04

    Article  Google Scholar 

  17. Jinchang R (2012) ANN vs. SVM : which one performs better in the classification of MCCs in mammogram imaging. ELSEVIER. Knowl-Based Syst 26:144–153. https://doi.org/10.1016/j.knosys.2011.07.016

    Article  Google Scholar 

  18. Jinfeng Y, Zhouyu F, Tieniu T, Weiming H (2004) Skin color detection using multiple cues. 17th Int Conf Patt Recogn, ICPR’04, 1 1:632–635. https://doi.org/10.1109/ICPR.2004.1334237

    Article  Google Scholar 

  19. Jixin L (1998) An empirical comparison between SVMs and ANNs for speech recognition. Rutgers University, 4–7.

  20. Kalra A (2016) A hybrid approach using Sobel and canny operator for digital image edge detection. Int Conf Micro-Electron Telecomm Eng. https://doi.org/10.1109/ICMETE.2016.49

  21. Kim D (2013) Sobel operator and canny edge detector, 1–10

    Google Scholar 

  22. Liu H (2010) Study on lipreading recognition based on computer vision. 2nd international conference on information engineering and computer science - proceedings. ICIECS 2(1). https://doi.org/10.1109/ICIECS.2010.5677823

  23. Liu X, Cheung YM (2014) Learning multi-boosted HMMs for lip-password-based speaker verification. IEEE Trans Inform Forensics Sec 9(2):233–246. https://doi.org/10.1109/TIFS.2013.2293025

    Article  Google Scholar 

  24. Marathe A et al (2019) Iterative improved learning algorithm for petrographic image classification accuracy enhancement. Int J Electrical Comp Eng 9(1):289–296

    Google Scholar 

  25. Najafiana M, Russell M (2020) Automatic accent identification as an analytical tool for accent robust automatic speech recognition. Elsevier, Speech Commun 122:44–55

    Article  Google Scholar 

  26. Nixon M, Aguado A (2008) Feature extraction and image processing. Elsevier Ltd (Second). https://doi.org/10.1016/B978-0-12-396549-3.00001-X

  27. Park Y, West T (2011) Identifying optimal Gaussian filter for Gaussian noise removal. Third National Conf Comput Vision, Patt Recogn, Image Process, Graphic:126–129

  28. Stavros P et al (2020) End-to-end visual speech recognition for small-scale datasets, arXiv:1904.01954 [cs.CV], pp 1–8

  29. Javad Peymanfard et al (2021) Lip-reading using external viseme secoding, arXiv:2104.04784 [cs.CV], pp 1–6

  30. Pironkov G et al (2020) Hybrid-task learning for robust automatic speech recognition, Elsevier. Comput Speech Lang 64(101103):1–13

  31. Poomhiran L et al (2021) Improving the recognition performance of lip reading using the concatenated three sequence keyframe image technique, engineering. Technol Appl Sci Res 11(2):6986–6992

    Article  Google Scholar 

  32. Saitoh T, Konishi R (2006) Word recognition based on a two-dimensional lip motion trajectory. Int Symp Intell Signal Process Commun, ISPACS’06, 287–290. https://doi.org/10.1109/ISPACS.2006.364888

  33. Saitoh T, Konishi R (2011) Real-time word lip-reading system based on trajectory feature. IEEJ Trans Electr Electron Eng 6(3):289–291

    Article  Google Scholar 

  34. Saranya G, Pravin A (2020) A comprehensive study on disease risk predictions in machine learning. Int J Electrical Comp Eng 10(4):4217–4225

    Google Scholar 

  35. Sengupta S, Bhattacharya A, Desai P, Gupta A (2012) Automated lip Reading technique for password authentication. Int J Appl Inform Syst. IJAIS 4(3):18–24

    Google Scholar 

  36. Sharma H, Saurav S, Singh S, Saini AK, Saini R (2015) Analyzing the impact of image scaling algorithms on the viola-jones face detection framework. IEEE:1715–1718. https://doi.org/10.1109/ICACCI.2015.7275860

  37. Mohammed Q et al (2020) A new approach for content-based image retrieval for medical applications using low-level image descriptors. Int J Electr Comput Eng 10(4):4363–4371

  38. Kuldeep S et al (2019) Automatic detection of rust disease of Lentilby machine learning system using microscopic images. Int J Electrical Comput Eng (IJECE) 9(1):660–666

    Article  Google Scholar 

  39. Sowjanya KS, Devi YAS, Sandeep K (2015) User authentication using lip movement as a password. Int J Advanc Res Comput Science and Software Engineering 5(10):456–461

    Google Scholar 

  40. Swami JU, Jayasimha S (2021) Lip Reading recognition. J Emerg Technol Innov Res (JETIR) 8(5):1424–1230

    Google Scholar 

  41. Talha KS, Wan K, Za’ba SK, Razlan ZM, Shahriman AB (2013) Speech analysis based on image information from lip movement. IOP Conf Series: Mat Sci Eng 53:12016. https://doi.org/10.1088/1757-899X/53/1/012016

    Article  Google Scholar 

  42. Teferi D, Bigun J (2007) Damascening video databases for evaluation of face tracking and recognition – the DXM2VTS database. Pattern Recogn Lett 28:2143–2156. https://doi.org/10.1016/j.patrec.2007.06.007

    Article  Google Scholar 

  43. Ullendorff E (1973) The Ethiopians: an introduction to the country and people, 3rd edn. Oxford University Press, London. https://doi.org/10.1080/01434639608666286

    Book  Google Scholar 

  44. Vazifehdan M et al (2019) A hybrid Bayesian network and tensor factorization approach for missing value imputation to improve breast cancer recurrence prediction. J King Saud Univ – Comput Inform Sci 31:175–184

  45. Vincent OR, Folorunso O (2009) A descriptive algorithm for Sobel image edge detection. Proceed Inform Sci & IT Educ Conf (InSITE):97–107

  46. Viola P, Jones MJ (2001) Robust real-time face detection. Eighth IEEE Int Conf Comput Vision 20(2):7695. https://doi.org/10.1109/ICCV.2001.937709

    Article  Google Scholar 

  47. Wang J, Green JR, Samal A, Carrell TD (2010) Vowel recognition from continuous articulatory movements for speaker-dependent applications. Int Conf Signal Process Commun Syst, ICSPCS’2010

  48. Werda S, Mahdi W, Hamadou AB (2007) Lip localization and Viseme classification for visual speech recognition. Int J Comput Inform Sciences 5(1):62–75 Retrieved from http://arxiv.org/abs/1301.4558

    Google Scholar 

  49. Yang M, Kriegman DJ, Ahuja N (2002) Detecting faces in images : a survey. IEEE Trans Pattern Anal Mach 24(1):34–58

    Article  Google Scholar 

  50. Yargic A, Dogan M (2013) A lip-reading application on MS Kinect camera. IEEE. https://doi.org/10.1109/INISTA.2013.6577656

  51. Yau WC (2008) Video analysis of mouth movement using motion templates for computer-based lip Reading. RMIT University

    Google Scholar 

  52. Yimama B (1997) . Ethiopia J language Literature, 1–32.

  53. Yu D, Ghita O, Sutherland A, Whelan PF (2007) A new manifold representation for visual speech recognition. Int Mach Vision Image Process Conf (p. 210). https://doi.org/10.1109/IMVIP.2007.35

  54. Cuesta De La, AG, Zhang J, Miller P (2008) Biometric identification using motion history images of a speaker’s lip movements. Proceedings - IMVIP 2008, 2008 international machine vision and image processing conference, 83–88. https://doi.org/10.1109/IMVIP.2008.13

  55. Zhang Y et al (2020) Can we read speech beyond the lips? Rethinking RoI Selection for Deep Visual Speech Recognition, arXiv:2003.03206 [cs.CV], PP. 1–8

  56. Zhao G, Barnard M, Pietik M, Member S (2009) Lipreading with local spatiotemporal descriptors. IEEE Trans Multimedia:1–11

  57. Zheng GL, Zhu M, Feng L (2014) Review of lip-reading recognition. Int Symp Comput Intell Design, ISCID 1:293–298. https://doi.org/10.1109/ISCID.2014.110

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thanks the anonymous reviewers for their detailed review, valuable comments, and constructive suggestions. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gebeyehu Belay Gebremeskel.

Ethics declarations

Conflict of interest

On this paper, entitled “Augmenting Machine Learning for Amharic Speech Recognition: A paradigm of patient’s Lips Motions Detection” has no any conflict of interest, and I hereby affirm that the contents of this Technical Paper are original. Furthermore, it has neither been published elsewhere in any language fully or partly nor is it under review for publication elsewhere. And also there is no any funding about this manuscript.

I affirm that all the authors have seen and agreed to the submitted version of the technical paper and their inclusion of names as co-author. Also, if my/our technical paper is accepted, I/We agree to comply with the terms and conditions as given on the website of the journal & you are free to publish its contribution in your journal and website.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Birara, M., Gebremeskel, G.B. Augmenting machine learning for Amharic speech recognition: a paradigm of patient’s lips motion detection. Multimed Tools Appl 81, 24377–24397 (2022). https://doi.org/10.1007/s11042-022-12399-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12399-w

Keywords

Navigation