Abstract
The method of automatic lip motion recognition is an essential input for visual speech detection. It is a technological approach to demystify people who are hard to hear, deaf, and a challenge of silent communication in day-to-day life. However, the recognition process is a challenge in terms of pronunciation variation, speech speeds, gesture variation, color, makeup, the video quality of the camera, and the way of feature extraction. This paper proposed a solution for automatic lip motion recognition by identifying lip movements and characterizing their association with the spoken words for the Amharic language spoken using the information available in lip movements. The input video is converting into consecutive image frames. We use a Viola-Jones object detection algorithm to gain YIQ color space and apply the saturation components to detect lip images from the face area. Sobel’s edge detection and morphological image operations implement to identify and extract the exact contour of the lip. We applied ANN and SVM classifiers on averaging shape information features, and we gained 65.71% and 66.43% classification accuracies of ANN and SVM, respectively. The findings presented in the Amharic Speech Recognition is the newly introduced technology to enhance the academic and linguistic skills of hearing-problem people, health domain experts, physicians, researchers, etc. The future research work presents in the light of the findings.
Similar content being viewed by others
References
Abate ST, Menzel W, Tafira B (2005) An Amharic speech corpus for large vocabulary continuous speech recognition. Proceede9th Eur Conference Speech Comm Technol (January):1601–1604
Acharya, T., & Ray, A. K. (2005). Image Process: Principl Appl (Vol. 15). A JOHN WILEY & SONS, MC., PUBLICATION. https://doi.org/10.1117/1.2348895
Assefa D (2006) Amharic speech training for the deaf Addis Abeba
Aybar E (2006) Sobel edge detection method for Matlab. Anadolu University, 3–7.
Badura, S., & Mokrys, M. (2015). Feature extraction for automatic lips reading system for isolated vowels. Int Virtual Sci Conf Inform Manag Sci. 96–104. Retrieved from http://ictic.sk/archive/?vid=1&aid=2&kid=50401-241
Borde P, Varpe A, Manza R, Yannawar P (2015) Recognition of isolated words using Zernike and MFCC features for audio-visual speech recognition. Int J Speech Technol 18(2):167–175. https://doi.org/10.1007/s10772-014-9257-1
Carrie W, William SD (2020) Understanding visual lip-based biometric authentication for mobile devices. EURASIP J Inf Secur 2020(3):1–16. https://doi.org/10.1186/s13635-020-0102-6
Bender et al (1976) Language in Ethiopia. (C. a. F. Lionel M. Bender, J. D. Bowen, R. L. Cooper, Ed.). Oxford University Press, London
Dabre K, Dholay S (2014) A machine learning model for sign language interpretation using webcam images. Int Conf Circ, Syst, Commun Inform Technol Appl, CSC ITA 2014:317–321. https://doi.org/10.1109/CSCITA.2014.6839279
Farid H (2001) Blind inverse gamma correction. IEEE Trans Image Process 10(10):1428–1433
Gonzalez RC, Woods RE, Eddins SL (2009) Digital image processing with Matlab (second Edi). U.S: Gatesmark Publishing.
Guan X, Jian S, Hongda P, Zhiguo Z, Haibin G (2009) An image enhancement method based on gamma correction. Second Int Symp Comput Intell Design, 60–63.
Hayward K et al (1999) Amharic. In handbook of the international phonetic association: a guide to the use of the international phonetic alphabet, (3rd ed.). The University Press, Cambridge
Heracleous P, Aboutabit N, Beautemps D (2009) Lip shape and hand position fusion for automatic vowel recognition in cued speech for French. IEEE Signal ProcessLett 16(5):339–342. https://doi.org/10.1109/LSP.2009.2016011
Ichino M (2014) Lip-movement-based speaker recognition using a fusion of canonical angles. Int Conf Control Autom Robotics Vision, ICARCV:958–963. https://doi.org/10.1109/ICARCV.2014.7064435
Jie L, Jigui S, Shengsheng W (2006) Pattern recognition: an overview. Int J Comp Sci Network Sec, IJCSNS 6(6):57–61. https://doi.org/10.5923/j.ajis.20120201.04
Jinchang R (2012) ANN vs. SVM : which one performs better in the classification of MCCs in mammogram imaging. ELSEVIER. Knowl-Based Syst 26:144–153. https://doi.org/10.1016/j.knosys.2011.07.016
Jinfeng Y, Zhouyu F, Tieniu T, Weiming H (2004) Skin color detection using multiple cues. 17th Int Conf Patt Recogn, ICPR’04, 1 1:632–635. https://doi.org/10.1109/ICPR.2004.1334237
Jixin L (1998) An empirical comparison between SVMs and ANNs for speech recognition. Rutgers University, 4–7.
Kalra A (2016) A hybrid approach using Sobel and canny operator for digital image edge detection. Int Conf Micro-Electron Telecomm Eng. https://doi.org/10.1109/ICMETE.2016.49
Kim D (2013) Sobel operator and canny edge detector, 1–10
Liu H (2010) Study on lipreading recognition based on computer vision. 2nd international conference on information engineering and computer science - proceedings. ICIECS 2(1). https://doi.org/10.1109/ICIECS.2010.5677823
Liu X, Cheung YM (2014) Learning multi-boosted HMMs for lip-password-based speaker verification. IEEE Trans Inform Forensics Sec 9(2):233–246. https://doi.org/10.1109/TIFS.2013.2293025
Marathe A et al (2019) Iterative improved learning algorithm for petrographic image classification accuracy enhancement. Int J Electrical Comp Eng 9(1):289–296
Najafiana M, Russell M (2020) Automatic accent identification as an analytical tool for accent robust automatic speech recognition. Elsevier, Speech Commun 122:44–55
Nixon M, Aguado A (2008) Feature extraction and image processing. Elsevier Ltd (Second). https://doi.org/10.1016/B978-0-12-396549-3.00001-X
Park Y, West T (2011) Identifying optimal Gaussian filter for Gaussian noise removal. Third National Conf Comput Vision, Patt Recogn, Image Process, Graphic:126–129
Stavros P et al (2020) End-to-end visual speech recognition for small-scale datasets, arXiv:1904.01954 [cs.CV], pp 1–8
Javad Peymanfard et al (2021) Lip-reading using external viseme secoding, arXiv:2104.04784 [cs.CV], pp 1–6
Pironkov G et al (2020) Hybrid-task learning for robust automatic speech recognition, Elsevier. Comput Speech Lang 64(101103):1–13
Poomhiran L et al (2021) Improving the recognition performance of lip reading using the concatenated three sequence keyframe image technique, engineering. Technol Appl Sci Res 11(2):6986–6992
Saitoh T, Konishi R (2006) Word recognition based on a two-dimensional lip motion trajectory. Int Symp Intell Signal Process Commun, ISPACS’06, 287–290. https://doi.org/10.1109/ISPACS.2006.364888
Saitoh T, Konishi R (2011) Real-time word lip-reading system based on trajectory feature. IEEJ Trans Electr Electron Eng 6(3):289–291
Saranya G, Pravin A (2020) A comprehensive study on disease risk predictions in machine learning. Int J Electrical Comp Eng 10(4):4217–4225
Sengupta S, Bhattacharya A, Desai P, Gupta A (2012) Automated lip Reading technique for password authentication. Int J Appl Inform Syst. IJAIS 4(3):18–24
Sharma H, Saurav S, Singh S, Saini AK, Saini R (2015) Analyzing the impact of image scaling algorithms on the viola-jones face detection framework. IEEE:1715–1718. https://doi.org/10.1109/ICACCI.2015.7275860
Mohammed Q et al (2020) A new approach for content-based image retrieval for medical applications using low-level image descriptors. Int J Electr Comput Eng 10(4):4363–4371
Kuldeep S et al (2019) Automatic detection of rust disease of Lentilby machine learning system using microscopic images. Int J Electrical Comput Eng (IJECE) 9(1):660–666
Sowjanya KS, Devi YAS, Sandeep K (2015) User authentication using lip movement as a password. Int J Advanc Res Comput Science and Software Engineering 5(10):456–461
Swami JU, Jayasimha S (2021) Lip Reading recognition. J Emerg Technol Innov Res (JETIR) 8(5):1424–1230
Talha KS, Wan K, Za’ba SK, Razlan ZM, Shahriman AB (2013) Speech analysis based on image information from lip movement. IOP Conf Series: Mat Sci Eng 53:12016. https://doi.org/10.1088/1757-899X/53/1/012016
Teferi D, Bigun J (2007) Damascening video databases for evaluation of face tracking and recognition – the DXM2VTS database. Pattern Recogn Lett 28:2143–2156. https://doi.org/10.1016/j.patrec.2007.06.007
Ullendorff E (1973) The Ethiopians: an introduction to the country and people, 3rd edn. Oxford University Press, London. https://doi.org/10.1080/01434639608666286
Vazifehdan M et al (2019) A hybrid Bayesian network and tensor factorization approach for missing value imputation to improve breast cancer recurrence prediction. J King Saud Univ – Comput Inform Sci 31:175–184
Vincent OR, Folorunso O (2009) A descriptive algorithm for Sobel image edge detection. Proceed Inform Sci & IT Educ Conf (InSITE):97–107
Viola P, Jones MJ (2001) Robust real-time face detection. Eighth IEEE Int Conf Comput Vision 20(2):7695. https://doi.org/10.1109/ICCV.2001.937709
Wang J, Green JR, Samal A, Carrell TD (2010) Vowel recognition from continuous articulatory movements for speaker-dependent applications. Int Conf Signal Process Commun Syst, ICSPCS’2010
Werda S, Mahdi W, Hamadou AB (2007) Lip localization and Viseme classification for visual speech recognition. Int J Comput Inform Sciences 5(1):62–75 Retrieved from http://arxiv.org/abs/1301.4558
Yang M, Kriegman DJ, Ahuja N (2002) Detecting faces in images : a survey. IEEE Trans Pattern Anal Mach 24(1):34–58
Yargic A, Dogan M (2013) A lip-reading application on MS Kinect camera. IEEE. https://doi.org/10.1109/INISTA.2013.6577656
Yau WC (2008) Video analysis of mouth movement using motion templates for computer-based lip Reading. RMIT University
Yimama B (1997) . Ethiopia J language Literature, 1–32.
Yu D, Ghita O, Sutherland A, Whelan PF (2007) A new manifold representation for visual speech recognition. Int Mach Vision Image Process Conf (p. 210). https://doi.org/10.1109/IMVIP.2007.35
Cuesta De La, AG, Zhang J, Miller P (2008) Biometric identification using motion history images of a speaker’s lip movements. Proceedings - IMVIP 2008, 2008 international machine vision and image processing conference, 83–88. https://doi.org/10.1109/IMVIP.2008.13
Zhang Y et al (2020) Can we read speech beyond the lips? Rethinking RoI Selection for Deep Visual Speech Recognition, arXiv:2003.03206 [cs.CV], PP. 1–8
Zhao G, Barnard M, Pietik M, Member S (2009) Lipreading with local spatiotemporal descriptors. IEEE Trans Multimedia:1–11
Zheng GL, Zhu M, Feng L (2014) Review of lip-reading recognition. Int Symp Comput Intell Design, ISCID 1:293–298. https://doi.org/10.1109/ISCID.2014.110
Acknowledgments
We would like to thanks the anonymous reviewers for their detailed review, valuable comments, and constructive suggestions. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On this paper, entitled “Augmenting Machine Learning for Amharic Speech Recognition: A paradigm of patient’s Lips Motions Detection” has no any conflict of interest, and I hereby affirm that the contents of this Technical Paper are original. Furthermore, it has neither been published elsewhere in any language fully or partly nor is it under review for publication elsewhere. And also there is no any funding about this manuscript.
I affirm that all the authors have seen and agreed to the submitted version of the technical paper and their inclusion of names as co-author. Also, if my/our technical paper is accepted, I/We agree to comply with the terms and conditions as given on the website of the journal & you are free to publish its contribution in your journal and website.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Birara, M., Gebremeskel, G.B. Augmenting machine learning for Amharic speech recognition: a paradigm of patient’s lips motion detection. Multimed Tools Appl 81, 24377–24397 (2022). https://doi.org/10.1007/s11042-022-12399-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12399-w