Visual Speech Recognition Using Optical Flow and Hidden Markov Model

Sharma, Usha; Maheshkar, Sushila; Mishra, A. N.; Kaushik, Rahul

doi:10.1007/s11277-018-5930-z

Visual Speech Recognition Using Optical Flow and Hidden Markov Model

Published: 10 September 2018

Volume 106, pages 2129–2147, (2019)
Cite this article

Wireless Personal Communications Aims and scope Submit manuscript

Usha Sharma ORCID: orcid.org/0000-0001-5481-7647¹,
Sushila Maheshkar²,
A. N. Mishra³ &
…
Rahul Kaushik⁴

419 Accesses
12 Citations
Explore all metrics

Abstract

The present work proposes audio-visual speech recognition with the use of Gammatone frequency cepstral coefficient (GFCC) and optical flow (OF) features with Hindi speech database. The OF refers to the distribution of apparent velocities of brightness pattern movements in an image. In this technique, OF is determined without extracting the location and contours of pair of lips of individual speaker. The visual features as horizontal component and vertical components of flow velocities have been calculated. Furthermore, the visual features are combined with audio features using early integration method followed by classification using hidden Markov model. The isolated Hindi digits were evaluated for their recognition performance using GFCC features not only in clean environment but also tested under noisy environment and compared with existing Mel frequency cepstral coefficient (MFCC) features. The GFCC shows almost comparable result with MFCC in clean environment; however, its performance goes down in noisy environment. Futhermore, the visual features obtained by the OF analysis when combine with GFCC audio features give significant improvement of ~ 12%, ~ 12%, and ~ 14% at different SNRs (5 dB, 10 dB, and 20 dB, respectively) in recognition performance under noisy environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition

Article 21 October 2014

Speech Recognition Using Spectrogram-Based Visual Features

Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition

Article 11 June 2020

References

Sharma, U., Maheshkar, S., & Mishra, A. N. (2015). Study of robust feature extraction techniques for speech recognition system. In 1st international conference on futuristic trend in computational analysis and knowledge management ABLAZE 2015 (pp. 654–658). Greater Noida.
Sukale, S., Borde, P., Gornale, S., & Yannawar, P. (2016). Recognition of isolated marathi words from side pose for multi-pose audio visual speech recognition. ADBU-Journal of Engineering Technology, 5, 0051606.
Google Scholar
Shaikh, A. A., Kumar, D. K., & Gubbi, J. (2011). Visual speech recognition using optical flow and support vector machines. International Journal of Computational Intelligence and Applications, 10(2), 167–187.
Article Google Scholar
Memon, I., Chen, L., Majid, A., Lv, M., Hussain, I., & Chen, G. (2015). Travel recommendation using geo-tagged photos in social media for tourist. Wireless Personal Communications, 80, 1347–1362.
Article Google Scholar
Memon, M. H., Li, J. P., Memon, I., & Arain, Q. A. (2017). GEO matching regions: multiple regions of interests using content based image retrieval based on relative locations. Multimedia Tools and Applications, 76(14), 377–411.
Article Google Scholar
Arain, Q. A., Memon, H., Memon, I., Memon, M. H., Shaikh, R. A., & Ali Mangi, F. (2017). Intelligent travel information platform based on location base services to predict user travel behavior from user-generated GPS traces. International Journal of Computers and Applications. https://doi.org/10.1080/1206212X.2017.1309222.
Google Scholar
Shaikh, R. A., Mmon, I., Mahar, J. A., & Shaikh, H. (2016). Database technology on the web: Query interface determining algorithm for deep web based on HTML features and hierarchical clustering. Sindh University Research Journal, 48(1), 145–150.
Google Scholar
Arain, Q. A., Uqaili, M. A., Deng, Z., Memon, I., Jiao, J., Shaikh, M. A., et al. (2016). Clustering based energy efficient and communication protocol for multiple mix-zones over road networks. Wireless Personal Communications. https://doi.org/10.1007/s11277-016-3900-x.
Google Scholar
Potamianos, G., Neti, C., Luettin, J., & Matthews, I. (2004). Audio-visual automatic speech recognition: An overview. In G. Bailly, E. V. Bateson, & P. Perrier (Eds.), Issues in visual and audio-visual speech processing. Cambridge: MIT Press.
Google Scholar
Zhou, Z., Guoying, Z., Xiaopeng, H., & Matti, P. (2014). A review of recent advances in visual speech decoding. Image and Vision Computing, 32(9), 590–605.
Article Google Scholar
Borde, P., Varpe, A., Manza, R., & Yannawar, P. (2014). Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition. International Journal of Speech Technology, 18(2), 167–175.
Article Google Scholar
Maurya, A., Kumar, D., & Agarwal, R. K. (2018). Speaker recognition for Hindi speech signal using MFCC-GMM approach. Procedia Computer Science, 125, 880–887.
Article Google Scholar
Noda, K., Yamaguchi, Y., Nkadai, K., Ouno, H. G., & Ogata, T. (2015). Audio-visual speech recognition using deep learning. Applied Intelligence, 42(4), 722–737.
Article Google Scholar
Song, D., Kim, C., & Park, S. K. (2018). A multi-temporal framework for high level activity analysis: Violent event detection in visual surveillance. Information Sciences. https://doi.org/10.1016/j.ins.2018.02.065.
MathSciNet Google Scholar
Iwano, K., Tamura, S., & Furui, S. (2001). Bimodal speech recognition using lip movement measured by optical-flow analysis. In Proceedings of international workshop on hands-free speech communication HSC 2001 (pp. 187–190). Kyoto.
Yoshinaga, T., Tamura, S., Iwano, K., & Furui, S. (2003). Audio-visual speech recognition using lip movement extracted from side-face images. In International conference on audio-visual speech processing AVSP-2003. St. Jorioz.
Sharma, U., Maheshkar, S., & Mishra, A. N. (2017). Hindi numerals classification using Gammatone frequency cepstral coefficients features. In Proceedings of 4th international conference on computing for sustainable global development INDIACom-2017 (pp. 2171–2175). New Delhi: IEEE Conference.
Mishra, A. N., Chandra, M., Biswas, A., & Sharan, S. N. (2011). Robust features for connected Hindi digits recognition. International Journal of Signal Processing, Image Processing and Pattern Recognition, 4(2), 79–90.
Google Scholar
Shao, Y., Jin, Z., & Wang, D. (2009). An auditory-based features for robust speech recognition. In IEEE international conference on acoustic speech and signal processing. Taipei: Taipei International Convention Center.
Shaikh, R. A., Li, J. P., Khan, A., Dep, S., Kumar, K., & Memon, I. (2014). Contemporary integration of content based image retrieval. In 11th conference on wavelet active media technology and information processing (ICCWAMTIP). Chengdu.
Memon, M. H., Li, J. P., Memon, I., Shaikh, R. A., Khan, A., & Deep, S. (2014). Unsupervised feature approach for content based image retrieval using principal component analysis. In 11th conference on wavelet active media technology and information processing (ICCWAMTIP). Chengdu.
Memon, M. H., Li, J. P., Memon, I., Shaikh, R. A., Khan, A., & Deep, S. (2014). Content based image retrieval based on geo-location driven image tagging on the social web. In: 11th conference on wavelet active media technology and information processing (ICCWAMTIP). Chengdu.
Horn, B. K. P., & Schunck, B. G. (1981). Determining optical flow. Artificial Intelligence, 17(1–3), 185–203.
Article Google Scholar
Chitu, A. G., & Rothkrantz, L. J. M. (2009). Visual speech recognition automatic system for lip reading of Dutch. Information Technologies and Control, 3, 2–9.
Google Scholar
Mishra, A. N., Chandra, M., Biswas, A., & Sharan, S. N. (2013). Hindi phoneme-viseme recognition from continuous speech. International Journal of Signal and Imaging Systems Engineering, 6(3), 164–171.
Article Google Scholar
Koprinska, I., & Carrato, S. (2001). Temporal video segmentation: A survey. Signal Processing: Image Communication, 16, 477–500.
Google Scholar
Ooi, W. C., Jeon, C., Kim, K., Ko, H., & Han, D. K. (2009). Effective lip localization and tracking for achieving multimodal speech recognition. Multisensor Fusion and Integration for Intelligent Systems, Lecture Notes in Electrical Engineering, 35(1), 33–43.
Article Google Scholar
Luettin, J., Tracker, N. A., & Beet, S. W. (1995). Active shape models for visual speech feature extraction. Electronic system group report no. 95/44, University of Sheffield, UK.
Eveno, N., Caplier, A., & Coulon, P. Y. (2001). A new color transformation for lips segmentation. In IEEE workshop on multimedia signal processing (MMSP’01). Cannes.
Eveno, N., Caplier, A., & Coulon, P. Y. (2004). Accurate and quasi-automatic lip tracking. IEEE, Transactions on Circuit and Systems for Video Technology, 14(5), 706–715.
Article Google Scholar
Rabiner, L. R., & Juang, B. H. (1993). Fundamental of speech recognition. Upper Saddle River: Prentice Hall.
Google Scholar
Young, S. J., & Woodland, P. C. (1993). The use of state tying in continuous speech recognition. In 3rd European conference on speech communication and technology EUROSPEECH 93 (pp. 2203–2206). Berlin.

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology (Indian School of Mines), Dhanbad, Jharkhand, 826004, India
Usha Sharma
Department of Computer Science and Engineering, National Institute of Technology, Delhi, 110040, India
Sushila Maheshkar
Department of Electronics and Communication Engineering, Krishna Engineering College, Ghaziabad, Uttar Pradesh, 201001, India
A. N. Mishra
Department of Electronics and Communication Engineering, Jaypee Institute of Information Technology, Noida, Uttar Pradesh, 201307, India
Rahul Kaushik

Authors

Usha Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Sushila Maheshkar
View author publications
You can also search for this author in PubMed Google Scholar
A. N. Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Rahul Kaushik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Usha Sharma.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sharma, U., Maheshkar, S., Mishra, A.N. et al. Visual Speech Recognition Using Optical Flow and Hidden Markov Model. Wireless Pers Commun 106, 2129–2147 (2019). https://doi.org/10.1007/s11277-018-5930-z

Download citation

Published: 10 September 2018
Issue Date: 30 June 2019
DOI: https://doi.org/10.1007/s11277-018-5930-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Speech Recognition Using Optical Flow and Hidden Markov Model

Abstract

Access this article

Similar content being viewed by others

Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition

Speech Recognition Using Spectrogram-Based Visual Features

Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Visual Speech Recognition Using Optical Flow and Hidden Markov Model

Abstract

Access this article

Similar content being viewed by others

Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition

Speech Recognition Using Spectrogram-Based Visual Features

Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation