Abstract
At present, the significance of humanoid robots dramatically increased while this kind of robots rarely enters human life because of its immature development. The lip shape of humanoid robots is crucial in the speech process since it makes humanoid robots look like real humans. Many studies show that vowels are the essential elements of pronunciation in all languages in the world. Based on the traditional research of viseme, we increased the priority of the smooth transition of lip between vowels and propose a lip matching scheme based on vowel priority. Additionally, we also designed a similarity evaluation model based on the Manhattan distance by using computer vision lip features, which quantifies the lip shape similarity between 0–1 provides an effective recommendation of evaluation standard. Surprisingly, this model successfully compensates the disadvantages of lip shape similarity evaluation criteria in this field. We applied this lip-matching scheme to Ren-Xin humanoid robot and performed robot teaching experiments as well as a similarity comparison experiment of 20 sentences with two males and two females and the robot. Notably, all the experiments have achieved excellent results.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Binyong Y, Felley M (1990) Chinese romanization: pronunciation & orthography. Peking
Cootes T, Baldock ER, Graham J (2000) An introduction to active shape models. Image Process Anal 243657:223–248
Cootes T, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685. https://doi.org/10.1109/34.927467
Dai K, Zhang Y, Wang D et al (2020) High-performance long-term tracking with meta-updater. arXiv preprint arXiv:2004.00305
Fan X, Yang X (2017) A speech-driven lip synchronization method. J Donghua Univ (Nat Sci) 4:2 (in Chinese)
Fu K, Sun L, Kang X et al (2019) Text detection for natural scene based on mobileNet V2 and U-Net. In: 2019 IEEE international conference on mechatronics and automation (ICMA), pp 1560–1564. https://doi.org/10.1109/ICMA.2019.8816384
Hara F, Endou K, Shirata S (1997) Lip-Configuration Control Of A Mouth Robot For Japanese Vowels. In: Proceedings 6th IEEE International workshop on robot and human communication, pp 412–418. https://doi.org/10.1109/ROMAN.1997.647022
Herath DC, Jochum E, Vlachos E (2017) An experimental study of embodied interaction and human perception of social presence for interactive robots in public settings. IEEE Trans Cogn Dev Syst 10(4):1096–1105. https://doi.org/10.1109/TCDS.2017.2787196
Hwang J, Tani J (2017) Seamless integration and coordination of cognitive skills in humanoid robots: a deep learning approach. IEEE Trans Cogn Dev Syst 10(2):345–358. https://doi.org/10.1109/TCDS.2017.2714170
Hyung HJ, Ahn BK, Choi D et al (2016) Evaluation of a Korean Lip-sync system for an android robot. In: 2016 13th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), pp 78–82. https://doi.org/10.1109/URAI.2016.7734025
Ishi CT, Machiyashiki D, Mikata R et al (2018) A speech-driven hand gesture generation method and evaluation in android robots. IEEE Robot Autom Lett 3(4):3757–3764. https://doi.org/10.1109/LRA.2018.2856281
Keating PA, Huffman MK (1984) Vowel variation in Japanese. Phonetica 41(4):191–207. https://doi.org/10.1159/000261726
Kim TH (2008) A study on Korean lip-sync for animation characters-based on lip-sync technique in english-speaking animations. Cartoon Animat Stud 13:97–114 (in Korean)
Kuindersma S, Deits R, Fallon M et al (2016) Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot. Auton Robots 40(3):429–455. https://doi.org/10.1007/s10514-015-9479-3
Li X, Wang T (2018) A long time tracking with BIN-NST and DRN. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-018-1025-7
Li P, Wang D, Wang L et al (2018) Deep visual tracking: review and experimental comparison. Pattern Recogn 76:323–338. https://doi.org/10.1016/j.patcog.2017.11.007
Liu Z, Ren F, Kang X (2019) Research on the effect of different speech segment lengths on speech emotion recognition based on LSTM. In: Proceedings of 2019 the 9th International Workshop on Computer Science and Engineering, pp 491–499. https://doi.org/10.18178/wcse.2019.06.073
Long T (2019) Research on application of athlete gesture tracking algorithms based on deep learning. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-019-01575-w
Lu H, Li Y, Chen M et al (2018) Brain intelligence: go beyond artificial intelligence. Mob Netw Appl 23(2):368–375. https://doi.org/10.1007/s11036-017-0932-8
Luo RC, Chang SR, Huang CC et al (2011) Human robot interactions using speech synthesis and recognition with lip synchronization. In: 2011 IECON 2011–37th Annual Conference of the IEEE Industrial Electronics Society, pp 171–176. https://doi.org/10.1109/IECON.2011.6119307
Miyazaki T, Nakashima T (2015) Analysis of mouth shape deformation rate for generation of Japanese utterance images automatically. In: Software engineering research, management and applications, pp 75–86. https://doi.org/10.1007/978-3-319-11265-7_6
Morishima S, Harashima H (1991) A media conversion from speech to facial image for intelligent man-machine interface. IEEE J Sel Areas Commun 9(4):594–600. https://doi.org/10.1109/49.81953
Nishikawa K, Takanobu H, Mochida T et al (2004) Speech production of an advanced talking robot based on human acoustic theory. In: 2004 IEEE International Conference on Robotics and Automation(ICRA), pp 3213–3219. https://doi.org/10.1109/ROBOT.2004.1308749
Oh KG, Jung C Y, Lee Y G et al (2010) Real-time lip synchronization between text-to-speech (TTS) system and robot mouth. In: 19th International symposium in robot and human interactive communication, pp 620–625. https://doi.org/10.1109/ROMAN.2010.5598656
Ren F (2009) Affective information processing and recognizing human emotion. Electron Notes Theor Comput Sci 225:39–50. https://doi.org/10.1016/j.entcs.2008.12.065
Ren F, Bao Y (2020) A review on human-computer interaction and intelligent robots. Int J Inf Technol Decis Mak 19(01):5–47. https://doi.org/10.1142/S0219622019300052
Ren F, Huang Z (2016) Automatic facial expression learning method based on humanoid robot XIN-REN. IEEE Trans Hum Mach Syst 46(6):810–821. https://doi.org/10.1109/THMS.2016.2599495
Ren F, Kang X, Quan C (2015) Examining accumulated emotional traits in suicide blogs with an emotion topic model. IEEE J Biomed Health Inform 20(5):1384–1396. https://doi.org/10.1109/JBHI.2015.2459683
Ren F, Matsumoto K (2015) Semi-automatic creation of youth slang corpus and its application to affective computing. IEEE Trans Affect Comput 7(2):176–189. https://doi.org/10.1109/TAFFC.2015.2457915
Saitoh T, Konishi R (2010) Profile lip reading for vowel and word recognition. In: 2010 20th International conference on pattern recognition, pp 1356–1359. https://doi.org/10.1109/ICPR.2010.335
Sulistijono IA, Baiqunni HH, Darojah Z et al (2014) Vowel recognition system of Lipsynchrobot in lips gesture using neural network. In: 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp 1751–1756. https://doi.org/10.1109/FUZZ-IEEE.2014.6891843
Sun Y, Wang X, Tang X (2013) Deep convolutional network cascade for facial point detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3476–3483. https://doi.org/10.1109/CVPR.2013.446
Verner IM, Polishuk A, Krayner N (2016) Science class with RoboThespian: using a robot teacher to make science fun and engage students. IEEE Robot Autom Mag 23(2):74–80. https://doi.org/10.1109/MRA.2016.2515018
Yan J (1998) Research on the viseme of chinese phonetics. Comput Eng Des 19(1):31–34 (in Chinese)
You ZJ, Shen CY, Chang C W et al (2006) A robot as a teaching assistant in an English class. In: Sixth IEEE international conference on advanced learning technologies (ICALT'06), pp 87–91. https://doi.org/10.1109/ICALT.2006.1652373
Zeng H, Hu D, Hu Z (2013) Simple analyzing on matching mechanism between Chinese speech and mouth shape. Audio Eng 10:44–48 (in Chinese)
Acknowledgements
This research has been partially supported by JSPS KAKENHI Grant no. 19K20345.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, Z., Kang, X., Nishide, S. et al. Vowel priority lip matching scheme and similarity evaluation model based on humanoid robot Ren-Xin. J Ambient Intell Human Comput 13, 5055–5066 (2022). https://doi.org/10.1007/s12652-020-02175-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-020-02175-9