Skip to main content

Advertisement

Log in

Viseme set identification from Malayalam phonemes and allophones

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Knowledge about phoneme and viseme in a language is a vital component in the making of any speech-based applications in that language. A phoneme is an atomic unit in an acoustic speech that can differentiate meaning. Viseme is the equivalent atomic unit in the visual realm which describes distinct dynamic visual speech gestures. The initial phase of the paper introduces a many-to-one phoneme-to-viseme mapping for the Malayalam language based on linguistic knowledge and data-driven approach. At the next stage, the coarticulation effect in the visual speech studied by creating many-to-many allophone-to-viseme mapping based on the data-driven approach only. Since the linguistic history in the visual realm was less explored in the Malayalam language, both mapping methods make use of K-mean data clustering algorithm. The optimum cluster determined by using the Gap statistic method with prior knowledge about the range of clusters. This work was carried out on Malayalam audio-visual speech database created by the authors of this paper with consist of 50 isolated phonemes and 106 connected words. From 50 isolated Malayalam phonemes, 14 viseme were linguistically identified and compared with results obtained from a data-driven approach as whole phonemes and consonant phonemes. The many-to-many mapping studied as a whole allophone, vowel allophones, and consonant allophones. Geometric and DCT based parameters are extracted and examined to find the parametric phoneme and allophone clustering in the visual domain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

courtesy Tibshirani et al. 2001

Fig. 5

Similar content being viewed by others

References

  • Aghaahmadi, M., Dehshibi, M. M., Bastanfard, A., & Fazlali, M. (2013). Clustering persian viseme using phoneme subspace for developing visual speech application. Multimedia Tools and Applications,65(3), 521–541. https://doi.org/10.1007/s11042-012-1128-7.

    Article  Google Scholar 

  • Ahmad, N., Datta, S., Mulvaney, D., & Farooq, O. (2008). A comparison of visual features for audiovisual automatic speech recognition. The Journal of the Acoustical Society of America,123(5), 3939. https://doi.org/10.1121/1.2936016.

    Article  Google Scholar 

  • Alexandre, D. S., & Tavares, J. M. R. S. (2010). Introduction of human perception in visualization. International Journal of Imaging,4(10A), 60–70.

    MathSciNet  Google Scholar 

  • Alizadeh, S., Boostani, R., & Asadpour, V. (2008). Lip feature extraction and reduction for hmm-based visual speech recognition systems. In International conference on signal processing proceedings, ICSP (pp. 561–564). https://doi.org/10.1109/ICOSP.2008.4697195

  • Aschenberner, B., & Weiss, C. (2005). Phoneme-viseme mapping for German video-realistic audio-visual-speech-synthesis (pp. 1–11). Institut Für Kommunikationsforschung Und Phonetik, Universität Bonn.

  • Baswaraj, B. D., Govardhan, A., & Premchand, P. (2012). Active contours and image segmentation: The current state of the art. Global Journal of Computer Science and Technology Graphics & Vision, 12(11).

  • Bear, H. L., & Harvey, R. (2016). Decoding visemes: Improving machine lip-reading Helen L. Bear and Richard Harvey. In Icassp 2016, 2009–2013.

  • Bear, H. L., & Harvey, R. (2018). Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals. Computer Speech & Language,52, 165–190. https://doi.org/10.1016/j.csl.2018.05.001.

    Article  Google Scholar 

  • Bear, H. L., Harvey, R. W., & Lan, Y. (2017). Finding phonemes: Improving machine lip-reading (pp. 115–120). Retrieved from http://arxiv.org/abs/1710.01142

  • Binnie, C. A., Jackson, P. L., Montgomery, A. A. (1976). Visual intelligibility of consonants: A lipreading screening test with implications for aural rehabilitation. Journal of Speech and Hearing Disorders, 41(4), 530–539.

    Article  Google Scholar 

  • Biswas, A., Sahu, P. K., Bhowmick, A., & Chandra, M. (2015). VidTIMIT audio visual phoneme recognition using AAM visual features and human auditory motivated acoustic wavelet features. In 2015 IEEE 2nd international conference on recent trends in information systems, ReTIS 2015Proceedings, (2004) (pp. 428–433). https://doi.org/10.1109/ReTIS.2015.7232917

  • Blokland, A., & Anderson, A. H. (1998). Effect of low frame-rate video on intelligibility of speech. Speech Communication,26(1–2), 97–103. https://doi.org/10.1016/S0167-6393(98)00053-3.

    Article  Google Scholar 

  • Bozkurt, E., Erdem, Ç. E., Erzin, E., Erdem, T., & Özkan, M. (2007). Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation. In Proceedings of 3DTV-CON. https://doi.org/10.1109/3DTV.2007.4379417

  • Brahme, A., & Bhadade, U. (2017). Phoneme visem mapping for Marathi language using linguistic approach. In ProceedingsInternational conference on global trends in signal processing, information computing and communication, ICGTSPICC 2016 (pp. 152–157). https://doi.org/10.1109/ICGTSPICC.2016.7955288

  • Chitu, A. G., & Rothkrantz, L. J. M. (2009). Visual speech recognition automatic system for lip reading of Dutch. Information Technologies and Control, year viii(3), 2–9.

  • Damien, P., Wakim, N., & Egéa, M. (2009). Phoneme-viseme mapping for modern, classical arabic language. In 2009 international conference on advances in computational tools for engineering applications, ACTEA 2009 (Vol. 2(1), pp. 547–552). https://doi.org/10.1109/ACTEA.2009.5227875

  • Farooq, O., Datta, S., Shrotriya, M. C., Sarikaya, R., Pellom, B. L., John, H. L., et al. (2015). Er Er. International Journal of Computer Applications,1(1), 1–4. https://doi.org/10.1109/ICASSP.2011.5947425.

    Article  Google Scholar 

  • Farooq, O., Upadhyaya, P., Farooq, O., Varshney, P., & Upadhyaya, A. (2013). Enhancement of VSR using low dimension visual feature enhancement of VSR using low dimension visual feature. (November). https://doi.org/10.1109/MSPCT.2013.6782090

  • Fisher, C. G. (1968). Confusions among visually perceived consonants. Journal of Speech and Hearing Research,11(4), 796–804.

    Article  Google Scholar 

  • Franks, J. R., Kimble, J. (1972). The confusion of English consonant clusters in lipreading. Journal of Speech and Hearing Research, 15(3), 474–482.

    Article  Google Scholar 

  • Gritzman, A. D., Rubin, D. M., & Pantanowitz, A. (2015). Comparison of colour transforms used in lip segmentation algorithms. Signal, Image and Video Processing,9(4), 947–957. https://doi.org/10.1007/s11760-014-0615-x.

    Article  Google Scholar 

  • Hazen, T. J., Saenko, K., La, C. H., & Glass, J. R. (2004). A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. In ICMI’04Sixth international conference on multimodal interfaces (pp. 235–242).

  • He, J., & Zhang, H. (2009). Research on visual speech feature extraction. In Proceedings2009 international conference on computer engineering and technology, ICCET 2009 (Vol. 2, pp. 499–502). https://doi.org/10.1109/ICCET.2009.63

  • Hilder, S., Theobald, B., & Harvey, R. (2010). In pursuit of visemes. In Proceedings of the international conference on auditory-visual speech processing (pp. 154–159). Retrieved from http://20.210-193-52.unknown.qala.com.sg/archive/avsp10/papers/av10_S8-2.pdf

  • Jachimski, D., Czyzewski, A., Ciszewski, T. (2018). A comparative study of English viseme recognition methods and algorithms. Multimedia Tools and Applications, 77(13), 16495–16532.

    Article  Google Scholar 

  • Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters,31(8), 651–666. https://doi.org/10.1016/j.patrec.2009.09.011.

    Article  Google Scholar 

  • Katsaggelos, A. K., Bahaadini, S., & Molina, R. (2015). Audiovisual fusion: Challenges and new approaches. Proceedings of the IEEE,103(9), 1635–1653. https://doi.org/10.1109/JPROC.2015.2459017.

    Article  Google Scholar 

  • Lalitha, S. D., & Thyagharajan, K. K. (2016). A study on lip localization techniques used for lip reading from a video. International Journal of Applied Engineering Research,11(1), 611–615.

    Google Scholar 

  • Lander, J. (1999). Read my lips: Facial animation techniques.

  • Lee, S., & Yook, D. (2002). Audio-to-visual conversion using hidden Markov models. In Lecture notes in computer science (Including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (Vol. 2417, pp. 563–570).

  • Li, N., Lefebvre, N., & Lengellé, R. (2014, January). Kernel hierarchical agglomerative clustering: Comparison of different gap statistics to estimate the number of clusters. In ICPRAM 2014Proceedings of the 3rd international conference on pattern recognition applications and methods, (pp. 255–262). https://doi.org/10.5220/0004828202550262

  • Lucey, P., & Potamianos, G. (2007). Lipreading using profile versus frontal views. In 2006 IEEE 8th workshop on multimedia signal processing, MMSP 2006 (pp. 24–28). https://doi.org/10.1109/MMSP.2006.285261

  • Madhulatha, T. S. (2012). An overview on clustering methods. 2(4), 719–725. http://arxiv.org/abs/1205.1117

  • Mattheyses, W., Latacz, L., & Verhelst, W. (2013). Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis. Speech Communication,55(7–8), 857–876. https://doi.org/10.1016/j.specom.2013.02.005.

    Article  Google Scholar 

  • McLaren, M., & Lei, Y. (2015). Improved speaker recognition using DCT coefficients as features (pp. 4430–4434).

  • Meier, U., Stiefelhagen, R., Yang, J., & Waibel, A. (2000). Towards unrestricted lip reading. International Journal of Pattern Recognition and Artificial Intelligence,14(5), 571–585. https://doi.org/10.1142/S0218001400000374.

    Article  Google Scholar 

  • Melenchón, J., Simó, J., Cobo, G., Martínez, E., La, A., & Llull, U. R. (2007). Objective viseme extraction and audiovisual uncertainty: Estimation limits between auditory and visual modes.

  • Miglani, S., & Garg, K. (2013). Factors affecting efficiency of K-means algorithm 2, 85–87.

  • Mishra, A. N., Chandra, M., Biswas, A., & Sharan, S. N. (2013). Hindi phoneme-viseme recognition from continuous speech. International Journal of Signal and Imaging Systems Engineering,6(3), 164–171. https://doi.org/10.1504/IJSISE.2013.054793.

    Article  Google Scholar 

  • Mohajer, M., Englmeier, K.-H., & Schmid, V. J. (2011). A comparison of Gap statistic definitions with and without logarithm function. Retrieved from http://arxiv.org/abs/1103.4767

  • Montgomery, A. A., & Jackson, P. L. (1983). Physical characteristics of the lips underlying vowel lipreading performance. Journal of the Acoustical Society of America,73(6), 2134–2144. https://doi.org/10.1121/1.389537.

    Article  Google Scholar 

  • Morade, S. S. (2016). Visual lip reading using 3D-DCT and 3D-DWT and LSDA. International Journal of Computer Applications,136(4), 7–15.

    Article  Google Scholar 

  • Morade, S. S., & Patnaik, S. (2014). Lip reading by using 3-D discrete wavelet transform with Dmey wavelet. International Journal of Image Processing,8, 384–396.

    Google Scholar 

  • Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., et al. (2000). Audio visual speech recognition (No. REP_WORK). IDIAP.

  • Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H. G., & Ogata, T. (2015). Audio-visual speech recognition using deep learning. Applied Intelligence,42(4), 722–737. https://doi.org/10.1007/s10489-014-0629-7.

    Article  Google Scholar 

  • Puviarasan, N., & Palanivel, S. (2011). Lip reading of hearing impaired persons using HMM. Expert Systems with Applications,38(4), 4477–4481. https://doi.org/10.1016/j.eswa.2010.09.119.

    Article  Google Scholar 

  • Rajavel, R., & Sathidevi, P. S. (2009). Static and dynamic features for improved HMM based visual speech recognition. In Proceedings of the first international conference on intelligent human computer interaction (pp. 184–194). https://doi.org/10.1007/978-81-8489-203-1_17

    Chapter  Google Scholar 

  • Saitoh, T., & Konishi, R. (2010). A study of influence of word lip-reading by change of frame rate. Word Journal of the International Linguistic Association (pp. 400–407).

  • Sarma, M., & Sarma, K. K. (2015, May). Recent trends in intelligent and emerging systems (pp. 173–187). https://doi.org/10.1007/978-81-322-2407-5

    Google Scholar 

  • Seko, T., Ukai, N., Tamura, S., & Hayamizu, S. (2013). Improvement of lipreading performance using discriminative feature and speaker adaptation. In Avsp.

  • Setyati, E., Sumpeno, S., Purnomo, M. H., Mikami, K., Kakimoto, M., & Kondo, K. (2015). Phoneme-viseme mapping for Indonesian language based on blend shape animation. IAENG International Journal of Computer Science,42(3), 1–12.

    Google Scholar 

  • Stewart, D., Seymour, R., & Ming, J. (2008). Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. Eurasip Journal on Image and Video Processing,2008(2008), 1–9. https://doi.org/10.1155/2008/810362.

    Article  Google Scholar 

  • Sui, C., Bennamoun, M., & Togneri, R. (2016). Visual speech feature representations: recent advances. In Advances in Face Detection and Facial Image Analysis (pp. 377–396). Cham: Springer.

  • Taylor, S. L., Mahler, M., Theobald, B. J., & Matthews, I. (2012). Dynamic units of visual speech. In Computer animation 2012ACM SIGGRAPH/eurographics symposium proceedings, SCA 2012, (pp. 275–284).

  • Taylor, S., Theobald, B. J., & Matthews, I. (2015). A mouth full of words: Visually consistent acoustic redubbing. In ICASSP, IEEE international conference on acoustics, speech and signal processingproceedings, 2015August (pp. 4904–4908). https://doi.org/10.1109/ICASSP.2015.7178903

  • Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423.

    Article  MathSciNet  Google Scholar 

  • Upadhyaya, P., Farooq, O., Abidi, M. R., & Varshney, P. (2015). Comparative study of visual feature for bimodal hindi speech recognition. Archives of Acoustics,40(4), 609–619. https://doi.org/10.1515/aoa-2015-0061.

    Article  Google Scholar 

  • Varshney, P., Farooq, O., & Upadhyaya, P. (2014). Hindi viseme recognition using subspace DCT features. International Journal of Applied Pattern Recognition,1(3), 257. https://doi.org/10.1504/ijapr.2014.065768.

    Article  Google Scholar 

  • Websdale, D., & Milner, B. (2015). Analysing the importance of different visual feature coefficients. Faavsp,3, 137–142.

    Google Scholar 

  • Xiaopeng, H., Hongxun, Y., Yuqi, W., & Rong, C. (2006). A PCA based visual DCT feature extraction method for lip-reading. In Proceedings2006 international conference on intelligent information hiding and multimedia signal processing, IIH-MSP 2006 (December 2006) (pp. 321–324). https://doi.org/10.1109/IIH-MSP.2006.265008

  • Yu, D., Ghita, O., Sutherland, A., & Whelan, P. F. (2010). A novel visual speech representation and HMM classification for visual speech recognition. IPSJ Transactions on Computer Vision and Applications,2, 25–38. https://doi.org/10.2197/ipsjtcva.2.25.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. T. Bibish Kumar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bibish Kumar, K.T., Sunil Kumar, R.K., Sandesh, E.P.A. et al. Viseme set identification from Malayalam phonemes and allophones. Int J Speech Technol 22, 1149–1166 (2019). https://doi.org/10.1007/s10772-019-09655-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-019-09655-0

Keywords

Navigation