Abstract
Knowledge about phoneme and viseme in a language is a vital component in the making of any speech-based applications in that language. A phoneme is an atomic unit in an acoustic speech that can differentiate meaning. Viseme is the equivalent atomic unit in the visual realm which describes distinct dynamic visual speech gestures. The initial phase of the paper introduces a many-to-one phoneme-to-viseme mapping for the Malayalam language based on linguistic knowledge and data-driven approach. At the next stage, the coarticulation effect in the visual speech studied by creating many-to-many allophone-to-viseme mapping based on the data-driven approach only. Since the linguistic history in the visual realm was less explored in the Malayalam language, both mapping methods make use of K-mean data clustering algorithm. The optimum cluster determined by using the Gap statistic method with prior knowledge about the range of clusters. This work was carried out on Malayalam audio-visual speech database created by the authors of this paper with consist of 50 isolated phonemes and 106 connected words. From 50 isolated Malayalam phonemes, 14 viseme were linguistically identified and compared with results obtained from a data-driven approach as whole phonemes and consonant phonemes. The many-to-many mapping studied as a whole allophone, vowel allophones, and consonant allophones. Geometric and DCT based parameters are extracted and examined to find the parametric phoneme and allophone clustering in the visual domain.
Similar content being viewed by others
References
Aghaahmadi, M., Dehshibi, M. M., Bastanfard, A., & Fazlali, M. (2013). Clustering persian viseme using phoneme subspace for developing visual speech application. Multimedia Tools and Applications,65(3), 521–541. https://doi.org/10.1007/s11042-012-1128-7.
Ahmad, N., Datta, S., Mulvaney, D., & Farooq, O. (2008). A comparison of visual features for audiovisual automatic speech recognition. The Journal of the Acoustical Society of America,123(5), 3939. https://doi.org/10.1121/1.2936016.
Alexandre, D. S., & Tavares, J. M. R. S. (2010). Introduction of human perception in visualization. International Journal of Imaging,4(10A), 60–70.
Alizadeh, S., Boostani, R., & Asadpour, V. (2008). Lip feature extraction and reduction for hmm-based visual speech recognition systems. In International conference on signal processing proceedings, ICSP (pp. 561–564). https://doi.org/10.1109/ICOSP.2008.4697195
Aschenberner, B., & Weiss, C. (2005). Phoneme-viseme mapping for German video-realistic audio-visual-speech-synthesis (pp. 1–11). Institut Für Kommunikationsforschung Und Phonetik, Universität Bonn.
Baswaraj, B. D., Govardhan, A., & Premchand, P. (2012). Active contours and image segmentation: The current state of the art. Global Journal of Computer Science and Technology Graphics & Vision, 12(11).
Bear, H. L., & Harvey, R. (2016). Decoding visemes: Improving machine lip-reading Helen L. Bear and Richard Harvey. In Icassp 2016, 2009–2013.
Bear, H. L., & Harvey, R. (2018). Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals. Computer Speech & Language,52, 165–190. https://doi.org/10.1016/j.csl.2018.05.001.
Bear, H. L., Harvey, R. W., & Lan, Y. (2017). Finding phonemes: Improving machine lip-reading (pp. 115–120). Retrieved from http://arxiv.org/abs/1710.01142
Binnie, C. A., Jackson, P. L., Montgomery, A. A. (1976). Visual intelligibility of consonants: A lipreading screening test with implications for aural rehabilitation. Journal of Speech and Hearing Disorders, 41(4), 530–539.
Biswas, A., Sahu, P. K., Bhowmick, A., & Chandra, M. (2015). VidTIMIT audio visual phoneme recognition using AAM visual features and human auditory motivated acoustic wavelet features. In 2015 IEEE 2nd international conference on recent trends in information systems, ReTIS 2015—Proceedings, (2004) (pp. 428–433). https://doi.org/10.1109/ReTIS.2015.7232917
Blokland, A., & Anderson, A. H. (1998). Effect of low frame-rate video on intelligibility of speech. Speech Communication,26(1–2), 97–103. https://doi.org/10.1016/S0167-6393(98)00053-3.
Bozkurt, E., Erdem, Ç. E., Erzin, E., Erdem, T., & Özkan, M. (2007). Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation. In Proceedings of 3DTV-CON. https://doi.org/10.1109/3DTV.2007.4379417
Brahme, A., & Bhadade, U. (2017). Phoneme visem mapping for Marathi language using linguistic approach. In Proceedings—International conference on global trends in signal processing, information computing and communication, ICGTSPICC 2016 (pp. 152–157). https://doi.org/10.1109/ICGTSPICC.2016.7955288
Chitu, A. G., & Rothkrantz, L. J. M. (2009). Visual speech recognition automatic system for lip reading of Dutch. Information Technologies and Control, year viii(3), 2–9.
Damien, P., Wakim, N., & Egéa, M. (2009). Phoneme-viseme mapping for modern, classical arabic language. In 2009 international conference on advances in computational tools for engineering applications, ACTEA 2009 (Vol. 2(1), pp. 547–552). https://doi.org/10.1109/ACTEA.2009.5227875
Farooq, O., Datta, S., Shrotriya, M. C., Sarikaya, R., Pellom, B. L., John, H. L., et al. (2015). Er Er. International Journal of Computer Applications,1(1), 1–4. https://doi.org/10.1109/ICASSP.2011.5947425.
Farooq, O., Upadhyaya, P., Farooq, O., Varshney, P., & Upadhyaya, A. (2013). Enhancement of VSR using low dimension visual feature enhancement of VSR using low dimension visual feature. (November). https://doi.org/10.1109/MSPCT.2013.6782090
Fisher, C. G. (1968). Confusions among visually perceived consonants. Journal of Speech and Hearing Research,11(4), 796–804.
Franks, J. R., Kimble, J. (1972). The confusion of English consonant clusters in lipreading. Journal of Speech and Hearing Research, 15(3), 474–482.
Gritzman, A. D., Rubin, D. M., & Pantanowitz, A. (2015). Comparison of colour transforms used in lip segmentation algorithms. Signal, Image and Video Processing,9(4), 947–957. https://doi.org/10.1007/s11760-014-0615-x.
Hazen, T. J., Saenko, K., La, C. H., & Glass, J. R. (2004). A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. In ICMI’04—Sixth international conference on multimodal interfaces (pp. 235–242).
He, J., & Zhang, H. (2009). Research on visual speech feature extraction. In Proceedings—2009 international conference on computer engineering and technology, ICCET 2009 (Vol. 2, pp. 499–502). https://doi.org/10.1109/ICCET.2009.63
Hilder, S., Theobald, B., & Harvey, R. (2010). In pursuit of visemes. In Proceedings of the international conference on auditory-visual speech processing (pp. 154–159). Retrieved from http://20.210-193-52.unknown.qala.com.sg/archive/avsp10/papers/av10_S8-2.pdf
Jachimski, D., Czyzewski, A., Ciszewski, T. (2018). A comparative study of English viseme recognition methods and algorithms. Multimedia Tools and Applications, 77(13), 16495–16532.
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters,31(8), 651–666. https://doi.org/10.1016/j.patrec.2009.09.011.
Katsaggelos, A. K., Bahaadini, S., & Molina, R. (2015). Audiovisual fusion: Challenges and new approaches. Proceedings of the IEEE,103(9), 1635–1653. https://doi.org/10.1109/JPROC.2015.2459017.
Lalitha, S. D., & Thyagharajan, K. K. (2016). A study on lip localization techniques used for lip reading from a video. International Journal of Applied Engineering Research,11(1), 611–615.
Lander, J. (1999). Read my lips: Facial animation techniques.
Lee, S., & Yook, D. (2002). Audio-to-visual conversion using hidden Markov models. In Lecture notes in computer science (Including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (Vol. 2417, pp. 563–570).
Li, N., Lefebvre, N., & Lengellé, R. (2014, January). Kernel hierarchical agglomerative clustering: Comparison of different gap statistics to estimate the number of clusters. In ICPRAM 2014—Proceedings of the 3rd international conference on pattern recognition applications and methods, (pp. 255–262). https://doi.org/10.5220/0004828202550262
Lucey, P., & Potamianos, G. (2007). Lipreading using profile versus frontal views. In 2006 IEEE 8th workshop on multimedia signal processing, MMSP 2006 (pp. 24–28). https://doi.org/10.1109/MMSP.2006.285261
Madhulatha, T. S. (2012). An overview on clustering methods. 2(4), 719–725. http://arxiv.org/abs/1205.1117
Mattheyses, W., Latacz, L., & Verhelst, W. (2013). Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis. Speech Communication,55(7–8), 857–876. https://doi.org/10.1016/j.specom.2013.02.005.
McLaren, M., & Lei, Y. (2015). Improved speaker recognition using DCT coefficients as features (pp. 4430–4434).
Meier, U., Stiefelhagen, R., Yang, J., & Waibel, A. (2000). Towards unrestricted lip reading. International Journal of Pattern Recognition and Artificial Intelligence,14(5), 571–585. https://doi.org/10.1142/S0218001400000374.
Melenchón, J., Simó, J., Cobo, G., Martínez, E., La, A., & Llull, U. R. (2007). Objective viseme extraction and audiovisual uncertainty: Estimation limits between auditory and visual modes.
Miglani, S., & Garg, K. (2013). Factors affecting efficiency of K-means algorithm 2, 85–87.
Mishra, A. N., Chandra, M., Biswas, A., & Sharan, S. N. (2013). Hindi phoneme-viseme recognition from continuous speech. International Journal of Signal and Imaging Systems Engineering,6(3), 164–171. https://doi.org/10.1504/IJSISE.2013.054793.
Mohajer, M., Englmeier, K.-H., & Schmid, V. J. (2011). A comparison of Gap statistic definitions with and without logarithm function. Retrieved from http://arxiv.org/abs/1103.4767
Montgomery, A. A., & Jackson, P. L. (1983). Physical characteristics of the lips underlying vowel lipreading performance. Journal of the Acoustical Society of America,73(6), 2134–2144. https://doi.org/10.1121/1.389537.
Morade, S. S. (2016). Visual lip reading using 3D-DCT and 3D-DWT and LSDA. International Journal of Computer Applications,136(4), 7–15.
Morade, S. S., & Patnaik, S. (2014). Lip reading by using 3-D discrete wavelet transform with Dmey wavelet. International Journal of Image Processing,8, 384–396.
Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., et al. (2000). Audio visual speech recognition (No. REP_WORK). IDIAP.
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H. G., & Ogata, T. (2015). Audio-visual speech recognition using deep learning. Applied Intelligence,42(4), 722–737. https://doi.org/10.1007/s10489-014-0629-7.
Puviarasan, N., & Palanivel, S. (2011). Lip reading of hearing impaired persons using HMM. Expert Systems with Applications,38(4), 4477–4481. https://doi.org/10.1016/j.eswa.2010.09.119.
Rajavel, R., & Sathidevi, P. S. (2009). Static and dynamic features for improved HMM based visual speech recognition. In Proceedings of the first international conference on intelligent human computer interaction (pp. 184–194). https://doi.org/10.1007/978-81-8489-203-1_17
Saitoh, T., & Konishi, R. (2010). A study of influence of word lip-reading by change of frame rate. Word Journal of the International Linguistic Association (pp. 400–407).
Sarma, M., & Sarma, K. K. (2015, May). Recent trends in intelligent and emerging systems (pp. 173–187). https://doi.org/10.1007/978-81-322-2407-5
Seko, T., Ukai, N., Tamura, S., & Hayamizu, S. (2013). Improvement of lipreading performance using discriminative feature and speaker adaptation. In Avsp.
Setyati, E., Sumpeno, S., Purnomo, M. H., Mikami, K., Kakimoto, M., & Kondo, K. (2015). Phoneme-viseme mapping for Indonesian language based on blend shape animation. IAENG International Journal of Computer Science,42(3), 1–12.
Stewart, D., Seymour, R., & Ming, J. (2008). Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. Eurasip Journal on Image and Video Processing,2008(2008), 1–9. https://doi.org/10.1155/2008/810362.
Sui, C., Bennamoun, M., & Togneri, R. (2016). Visual speech feature representations: recent advances. In Advances in Face Detection and Facial Image Analysis (pp. 377–396). Cham: Springer.
Taylor, S. L., Mahler, M., Theobald, B. J., & Matthews, I. (2012). Dynamic units of visual speech. In Computer animation 2012—ACM SIGGRAPH/eurographics symposium proceedings, SCA 2012, (pp. 275–284).
Taylor, S., Theobald, B. J., & Matthews, I. (2015). A mouth full of words: Visually consistent acoustic redubbing. In ICASSP, IEEE international conference on acoustics, speech and signal processing—proceedings, 2015–August (pp. 4904–4908). https://doi.org/10.1109/ICASSP.2015.7178903
Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423.
Upadhyaya, P., Farooq, O., Abidi, M. R., & Varshney, P. (2015). Comparative study of visual feature for bimodal hindi speech recognition. Archives of Acoustics,40(4), 609–619. https://doi.org/10.1515/aoa-2015-0061.
Varshney, P., Farooq, O., & Upadhyaya, P. (2014). Hindi viseme recognition using subspace DCT features. International Journal of Applied Pattern Recognition,1(3), 257. https://doi.org/10.1504/ijapr.2014.065768.
Websdale, D., & Milner, B. (2015). Analysing the importance of different visual feature coefficients. Faavsp,3, 137–142.
Xiaopeng, H., Hongxun, Y., Yuqi, W., & Rong, C. (2006). A PCA based visual DCT feature extraction method for lip-reading. In Proceedings—2006 international conference on intelligent information hiding and multimedia signal processing, IIH-MSP 2006 (December 2006) (pp. 321–324). https://doi.org/10.1109/IIH-MSP.2006.265008
Yu, D., Ghita, O., Sutherland, A., & Whelan, P. F. (2010). A novel visual speech representation and HMM classification for visual speech recognition. IPSJ Transactions on Computer Vision and Applications,2, 25–38. https://doi.org/10.2197/ipsjtcva.2.25.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bibish Kumar, K.T., Sunil Kumar, R.K., Sandesh, E.P.A. et al. Viseme set identification from Malayalam phonemes and allophones. Int J Speech Technol 22, 1149–1166 (2019). https://doi.org/10.1007/s10772-019-09655-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-019-09655-0