Viseme set identification from Malayalam phonemes and allophones

Bibish Kumar, K. T.; Sunil Kumar, R. K.; Sandesh, E. P. A.; Sourabh, S.; Lajish, V. L.

doi:10.1007/s10772-019-09655-0

Viseme set identification from Malayalam phonemes and allophones

Published: 04 November 2019

Volume 22, pages 1149–1166, (2019)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

K. T. Bibish Kumar¹,
R. K. Sunil Kumar²,
E. P. A. Sandesh³,
S. Sourabh¹ &
…
V. L. Lajish³

291 Accesses
1 Citation
Explore all metrics

Abstract

Knowledge about phoneme and viseme in a language is a vital component in the making of any speech-based applications in that language. A phoneme is an atomic unit in an acoustic speech that can differentiate meaning. Viseme is the equivalent atomic unit in the visual realm which describes distinct dynamic visual speech gestures. The initial phase of the paper introduces a many-to-one phoneme-to-viseme mapping for the Malayalam language based on linguistic knowledge and data-driven approach. At the next stage, the coarticulation effect in the visual speech studied by creating many-to-many allophone-to-viseme mapping based on the data-driven approach only. Since the linguistic history in the visual realm was less explored in the Malayalam language, both mapping methods make use of K-mean data clustering algorithm. The optimum cluster determined by using the Gap statistic method with prior knowledge about the range of clusters. This work was carried out on Malayalam audio-visual speech database created by the authors of this paper with consist of 50 isolated phonemes and 106 connected words. From 50 isolated Malayalam phonemes, 14 viseme were linguistically identified and compared with results obtained from a data-driven approach as whole phonemes and consonant phonemes. The many-to-many mapping studied as a whole allophone, vowel allophones, and consonant allophones. Geometric and DCT based parameters are extracted and examined to find the parametric phoneme and allophone clustering in the visual domain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 3

Phoneme Set and Pronouncing Dictionary Creation for Large Vocabulary Continuous Speech Recognition of Vietnamese

Perceptual Features Based Rapid and Robust Language Identification System for Various Indian Classical Languages

Phonetic Segmentation Using Knowledge from Visual and Perceptual Domain

References

Aghaahmadi, M., Dehshibi, M. M., Bastanfard, A., & Fazlali, M. (2013). Clustering persian viseme using phoneme subspace for developing visual speech application. Multimedia Tools and Applications,65(3), 521–541. https://doi.org/10.1007/s11042-012-1128-7.
Article Google Scholar
Ahmad, N., Datta, S., Mulvaney, D., & Farooq, O. (2008). A comparison of visual features for audiovisual automatic speech recognition. The Journal of the Acoustical Society of America,123(5), 3939. https://doi.org/10.1121/1.2936016.
Article Google Scholar
Alexandre, D. S., & Tavares, J. M. R. S. (2010). Introduction of human perception in visualization. International Journal of Imaging,4(10A), 60–70.
MathSciNet Google Scholar
Alizadeh, S., Boostani, R., & Asadpour, V. (2008). Lip feature extraction and reduction for hmm-based visual speech recognition systems. In International conference on signal processing proceedings, ICSP (pp. 561–564). https://doi.org/10.1109/ICOSP.2008.4697195
Aschenberner, B., & Weiss, C. (2005). Phoneme-viseme mapping for German video-realistic audio-visual-speech-synthesis (pp. 1–11). Institut Für Kommunikationsforschung Und Phonetik, Universität Bonn.
Baswaraj, B. D., Govardhan, A., & Premchand, P. (2012). Active contours and image segmentation: The current state of the art. Global Journal of Computer Science and Technology Graphics & Vision, 12(11).
Bear, H. L., & Harvey, R. (2016). Decoding visemes: Improving machine lip-reading Helen L. Bear and Richard Harvey. In Icassp 2016, 2009–2013.
Bear, H. L., & Harvey, R. (2018). Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals. Computer Speech & Language,52, 165–190. https://doi.org/10.1016/j.csl.2018.05.001.
Article Google Scholar
Bear, H. L., Harvey, R. W., & Lan, Y. (2017). Finding phonemes: Improving machine lip-reading (pp. 115–120). Retrieved from http://arxiv.org/abs/1710.01142
Binnie, C. A., Jackson, P. L., Montgomery, A. A. (1976). Visual intelligibility of consonants: A lipreading screening test with implications for aural rehabilitation. Journal of Speech and Hearing Disorders, 41(4), 530–539.
Article Google Scholar
Biswas, A., Sahu, P. K., Bhowmick, A., & Chandra, M. (2015). VidTIMIT audio visual phoneme recognition using AAM visual features and human auditory motivated acoustic wavelet features. In 2015 IEEE 2nd international conference on recent trends in information systems, ReTIS 2015—Proceedings, (2004) (pp. 428–433). https://doi.org/10.1109/ReTIS.2015.7232917
Blokland, A., & Anderson, A. H. (1998). Effect of low frame-rate video on intelligibility of speech. Speech Communication,26(1–2), 97–103. https://doi.org/10.1016/S0167-6393(98)00053-3.
Article Google Scholar
Bozkurt, E., Erdem, Ç. E., Erzin, E., Erdem, T., & Özkan, M. (2007). Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation. In Proceedings of 3DTV-CON. https://doi.org/10.1109/3DTV.2007.4379417
Brahme, A., & Bhadade, U. (2017). Phoneme visem mapping for Marathi language using linguistic approach. In Proceedings—International conference on global trends in signal processing, information computing and communication, ICGTSPICC 2016 (pp. 152–157). https://doi.org/10.1109/ICGTSPICC.2016.7955288
Chitu, A. G., & Rothkrantz, L. J. M. (2009). Visual speech recognition automatic system for lip reading of Dutch. Information Technologies and Control, year viii(3), 2–9.
Damien, P., Wakim, N., & Egéa, M. (2009). Phoneme-viseme mapping for modern, classical arabic language. In 2009 international conference on advances in computational tools for engineering applications, ACTEA 2009 (Vol. 2(1), pp. 547–552). https://doi.org/10.1109/ACTEA.2009.5227875
Farooq, O., Datta, S., Shrotriya, M. C., Sarikaya, R., Pellom, B. L., John, H. L., et al. (2015). Er Er. International Journal of Computer Applications,1(1), 1–4. https://doi.org/10.1109/ICASSP.2011.5947425.
Article Google Scholar
Farooq, O., Upadhyaya, P., Farooq, O., Varshney, P., & Upadhyaya, A. (2013). Enhancement of VSR using low dimension visual feature enhancement of VSR using low dimension visual feature. (November). https://doi.org/10.1109/MSPCT.2013.6782090
Fisher, C. G. (1968). Confusions among visually perceived consonants. Journal of Speech and Hearing Research,11(4), 796–804.
Article Google Scholar
Franks, J. R., Kimble, J. (1972). The confusion of English consonant clusters in lipreading. Journal of Speech and Hearing Research, 15(3), 474–482.
Article Google Scholar
Gritzman, A. D., Rubin, D. M., & Pantanowitz, A. (2015). Comparison of colour transforms used in lip segmentation algorithms. Signal, Image and Video Processing,9(4), 947–957. https://doi.org/10.1007/s11760-014-0615-x.
Article Google Scholar
Hazen, T. J., Saenko, K., La, C. H., & Glass, J. R. (2004). A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. In ICMI’04—Sixth international conference on multimodal interfaces (pp. 235–242).
He, J., & Zhang, H. (2009). Research on visual speech feature extraction. In Proceedings—2009 international conference on computer engineering and technology, ICCET 2009 (Vol. 2, pp. 499–502). https://doi.org/10.1109/ICCET.2009.63
Hilder, S., Theobald, B., & Harvey, R. (2010). In pursuit of visemes. In Proceedings of the international conference on auditory-visual speech processing (pp. 154–159). Retrieved from http://20.210-193-52.unknown.qala.com.sg/archive/avsp10/papers/av10_S8-2.pdf
Jachimski, D., Czyzewski, A., Ciszewski, T. (2018). A comparative study of English viseme recognition methods and algorithms. Multimedia Tools and Applications, 77(13), 16495–16532.
Article Google Scholar
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters,31(8), 651–666. https://doi.org/10.1016/j.patrec.2009.09.011.
Article Google Scholar
Katsaggelos, A. K., Bahaadini, S., & Molina, R. (2015). Audiovisual fusion: Challenges and new approaches. Proceedings of the IEEE,103(9), 1635–1653. https://doi.org/10.1109/JPROC.2015.2459017.
Article Google Scholar
Lalitha, S. D., & Thyagharajan, K. K. (2016). A study on lip localization techniques used for lip reading from a video. International Journal of Applied Engineering Research,11(1), 611–615.
Google Scholar
Lander, J. (1999). Read my lips: Facial animation techniques.
Lee, S., & Yook, D. (2002). Audio-to-visual conversion using hidden Markov models. In Lecture notes in computer science (Including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (Vol. 2417, pp. 563–570).
Li, N., Lefebvre, N., & Lengellé, R. (2014, January). Kernel hierarchical agglomerative clustering: Comparison of different gap statistics to estimate the number of clusters. In ICPRAM 2014—Proceedings of the 3rd international conference on pattern recognition applications and methods, (pp. 255–262). https://doi.org/10.5220/0004828202550262
Lucey, P., & Potamianos, G. (2007). Lipreading using profile versus frontal views. In 2006 IEEE 8th workshop on multimedia signal processing, MMSP 2006 (pp. 24–28). https://doi.org/10.1109/MMSP.2006.285261
Madhulatha, T. S. (2012). An overview on clustering methods. 2(4), 719–725. http://arxiv.org/abs/1205.1117
Mattheyses, W., Latacz, L., & Verhelst, W. (2013). Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis. Speech Communication,55(7–8), 857–876. https://doi.org/10.1016/j.specom.2013.02.005.
Article Google Scholar
McLaren, M., & Lei, Y. (2015). Improved speaker recognition using DCT coefficients as features (pp. 4430–4434).
Meier, U., Stiefelhagen, R., Yang, J., & Waibel, A. (2000). Towards unrestricted lip reading. International Journal of Pattern Recognition and Artificial Intelligence,14(5), 571–585. https://doi.org/10.1142/S0218001400000374.
Article Google Scholar
Melenchón, J., Simó, J., Cobo, G., Martínez, E., La, A., & Llull, U. R. (2007). Objective viseme extraction and audiovisual uncertainty: Estimation limits between auditory and visual modes.
Miglani, S., & Garg, K. (2013). Factors affecting efficiency of K-means algorithm 2, 85–87.
Mishra, A. N., Chandra, M., Biswas, A., & Sharan, S. N. (2013). Hindi phoneme-viseme recognition from continuous speech. International Journal of Signal and Imaging Systems Engineering,6(3), 164–171. https://doi.org/10.1504/IJSISE.2013.054793.
Article Google Scholar
Mohajer, M., Englmeier, K.-H., & Schmid, V. J. (2011). A comparison of Gap statistic definitions with and without logarithm function. Retrieved from http://arxiv.org/abs/1103.4767
Montgomery, A. A., & Jackson, P. L. (1983). Physical characteristics of the lips underlying vowel lipreading performance. Journal of the Acoustical Society of America,73(6), 2134–2144. https://doi.org/10.1121/1.389537.
Article Google Scholar
Morade, S. S. (2016). Visual lip reading using 3D-DCT and 3D-DWT and LSDA. International Journal of Computer Applications,136(4), 7–15.
Article Google Scholar
Morade, S. S., & Patnaik, S. (2014). Lip reading by using 3-D discrete wavelet transform with Dmey wavelet. International Journal of Image Processing,8, 384–396.
Google Scholar
Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., et al. (2000). Audio visual speech recognition (No. REP_WORK). IDIAP.
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H. G., & Ogata, T. (2015). Audio-visual speech recognition using deep learning. Applied Intelligence,42(4), 722–737. https://doi.org/10.1007/s10489-014-0629-7.
Article Google Scholar
Puviarasan, N., & Palanivel, S. (2011). Lip reading of hearing impaired persons using HMM. Expert Systems with Applications,38(4), 4477–4481. https://doi.org/10.1016/j.eswa.2010.09.119.
Article Google Scholar
Rajavel, R., & Sathidevi, P. S. (2009). Static and dynamic features for improved HMM based visual speech recognition. In Proceedings of the first international conference on intelligent human computer interaction (pp. 184–194). https://doi.org/10.1007/978-81-8489-203-1_17
Chapter Google Scholar
Saitoh, T., & Konishi, R. (2010). A study of influence of word lip-reading by change of frame rate. Word Journal of the International Linguistic Association (pp. 400–407).
Sarma, M., & Sarma, K. K. (2015, May). Recent trends in intelligent and emerging systems (pp. 173–187). https://doi.org/10.1007/978-81-322-2407-5
Google Scholar
Seko, T., Ukai, N., Tamura, S., & Hayamizu, S. (2013). Improvement of lipreading performance using discriminative feature and speaker adaptation. In Avsp.
Setyati, E., Sumpeno, S., Purnomo, M. H., Mikami, K., Kakimoto, M., & Kondo, K. (2015). Phoneme-viseme mapping for Indonesian language based on blend shape animation. IAENG International Journal of Computer Science,42(3), 1–12.
Google Scholar
Stewart, D., Seymour, R., & Ming, J. (2008). Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. Eurasip Journal on Image and Video Processing,2008(2008), 1–9. https://doi.org/10.1155/2008/810362.
Article Google Scholar
Sui, C., Bennamoun, M., & Togneri, R. (2016). Visual speech feature representations: recent advances. In Advances in Face Detection and Facial Image Analysis (pp. 377–396). Cham: Springer.
Taylor, S. L., Mahler, M., Theobald, B. J., & Matthews, I. (2012). Dynamic units of visual speech. In Computer animation 2012—ACM SIGGRAPH/eurographics symposium proceedings, SCA 2012, (pp. 275–284).
Taylor, S., Theobald, B. J., & Matthews, I. (2015). A mouth full of words: Visually consistent acoustic redubbing. In ICASSP, IEEE international conference on acoustics, speech and signal processing—proceedings, 2015–August (pp. 4904–4908). https://doi.org/10.1109/ICASSP.2015.7178903
Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423.
Article MathSciNet Google Scholar
Upadhyaya, P., Farooq, O., Abidi, M. R., & Varshney, P. (2015). Comparative study of visual feature for bimodal hindi speech recognition. Archives of Acoustics,40(4), 609–619. https://doi.org/10.1515/aoa-2015-0061.
Article Google Scholar
Varshney, P., Farooq, O., & Upadhyaya, P. (2014). Hindi viseme recognition using subspace DCT features. International Journal of Applied Pattern Recognition,1(3), 257. https://doi.org/10.1504/ijapr.2014.065768.
Article Google Scholar
Websdale, D., & Milner, B. (2015). Analysing the importance of different visual feature coefficients. Faavsp,3, 137–142.
Google Scholar
Xiaopeng, H., Hongxun, Y., Yuqi, W., & Rong, C. (2006). A PCA based visual DCT feature extraction method for lip-reading. In Proceedings—2006 international conference on intelligent information hiding and multimedia signal processing, IIH-MSP 2006 (December 2006) (pp. 321–324). https://doi.org/10.1109/IIH-MSP.2006.265008
Yu, D., Ghita, O., Sutherland, A., & Whelan, P. F. (2010). A novel visual speech representation and HMM classification for visual speech recognition. IPSJ Transactions on Computer Vision and Applications,2, 25–38. https://doi.org/10.2197/ipsjtcva.2.25.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Speech & Intelligence Research Centre, Department of Physics, Govt. College, Madappally, Vadakara, Calicut, Kerala, 673102, India
K. T. Bibish Kumar & S. Sourabh
School of Information Science and Technology, Kannur University, Kannur, India
R. K. Sunil Kumar
Department of Computer Science, University of Calicut, Calicut, Kerala, 673 635, India
E. P. A. Sandesh & V. L. Lajish

Authors

K. T. Bibish Kumar
View author publications
You can also search for this author in PubMed Google Scholar
R. K. Sunil Kumar
View author publications
You can also search for this author in PubMed Google Scholar
E. P. A. Sandesh
View author publications
You can also search for this author in PubMed Google Scholar
S. Sourabh
View author publications
You can also search for this author in PubMed Google Scholar
V. L. Lajish
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. T. Bibish Kumar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bibish Kumar, K.T., Sunil Kumar, R.K., Sandesh, E.P.A. et al. Viseme set identification from Malayalam phonemes and allophones. Int J Speech Technol 22, 1149–1166 (2019). https://doi.org/10.1007/s10772-019-09655-0

Download citation

Received: 31 March 2019
Accepted: 24 October 2019
Published: 04 November 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10772-019-09655-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Viseme set identification from Malayalam phonemes and allophones

Abstract

Access this article

Similar content being viewed by others

Phoneme Set and Pronouncing Dictionary Creation for Large Vocabulary Continuous Speech Recognition of Vietnamese

Perceptual Features Based Rapid and Robust Language Identification System for Various Indian Classical Languages

Phonetic Segmentation Using Knowledge from Visual and Perceptual Domain

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Viseme set identification from Malayalam phonemes and allophones

Abstract

Access this article

Similar content being viewed by others

Phoneme Set and Pronouncing Dictionary Creation for Large Vocabulary Continuous Speech Recognition of Vietnamese

Perceptual Features Based Rapid and Robust Language Identification System for Various Indian Classical Languages

Phonetic Segmentation Using Knowledge from Visual and Perceptual Domain

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation