Visual-speech-pass filtering for robust automatic lip-reading

Lee, Jong-Seok

doi:10.1007/s10044-013-0350-x

Visual-speech-pass filtering for robust automatic lip-reading

Theoretical Advances
Published: 03 September 2013

Volume 17, pages 611–621, (2014)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Jong-Seok Lee¹

360 Accesses
6 Citations
Explore all metrics

Abstract

This paper proposes a temporal filtering technique used in extraction of visual features for improved robustness of automatic lip-reading, called visual-speech-pass filtering. A band-pass filter is applied to the pixel value sequence of the images containing the speaker’s lip region to remove unwanted variations that are not relevant to the speech information. The filter is carefully designed based on psychological, spectral, and experimental analyses. Experimental results on two speaker-independent and one speaker-dependent recognition tasks demonstrate that the proposed technique significantly improves recognition performance in both clean and visually noisy conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Automatic lipreading based on optimized OLSDA and HMM

Article 01 March 2022

Yuanyao Lu, Ke Gu & Ying Cai

Visual Speech Recognition with Selected Boundary Descriptors

Lip-Reading Using Pixel-Based and Geometry-Based Features for Multimodal Human–Robot Interfaces

Notes

This is opposite to the acoustic speech case where the noise interference appears in the low frequency range, and thus high-pass filtering of the temporal trajectories of filterbank energies is helpful for noise-robustness [11].

References

Amer A, Dubois E (2005) Fast and reliable structure-oriented video noise estimation. IEEE Trans Circuits Syst Video Technol 15(1):113–118
Article Google Scholar
Arsic I, Thiran JP (2006) Mutual information eigenlips for audio-visual speech recognition. In: Proceedings of European Signal Processing Conference Florence, Italy
Bregler C, Konig Y (1994) Eigenlips for robust speech recognition. In: Proceedings of International Conference Acoustics, Speech and Signal Processing, Adelaide, Australia, vol. 2, pp 669–672
Chibelushi CC, Deravi F, Mason JSD (2002) A review of speech-based bimodal recognition. IEEE Trans Multimed 4(1):23–37
Article Google Scholar
Chiou GI, Hwang JN (1997) Lipreading from color video. IEEE Trans Image Process 6(8):1192–1195
Article Google Scholar
Dupont S, Luettin J (2000) Audio-visual speech modeling for continuous speech recognition. IEEE Trans Multimed 2(3):141–151
Article Google Scholar
Fox NA, O’Mullane BA, Reilly RB (2005) VALID: a new practical audio-visual database, and comparative results. In: Proceedings of International Conference Audio- and Video-Based Biometric Person Authentication, New York, USA, pp 777–786
Frowein HW, Smoorenburg GF, Pyters L, Schinkel D (1991) Improved speech recognition through videotelephony: experiments with the hard of hearing. IEEE J Sel Areas Commun 9(4):611–616
Article Google Scholar
Gurbuz S, Tufekci Z, Patterson E, Gowdy J (2001) Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, Salt Lake City, UT, USA. vol 1, pp 177–180
Hennecke ME, Prasad KV, Stork DG (1995) Automatic speech recognition system using acoustic and visual signals. In: Proceedings of 29th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, vol 2, pp 1214–1218
Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578–589
Article Google Scholar
Huang X, Acero A, Hon HW (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice-Hall, Upper Saddle River
Jung HY, Lee SY (2000) On the temporal decorrelation of feature parameters for noise-robust speech recognition. IEEE Trans Speech Audio Process 8(4):407–416
Article MathSciNet Google Scholar
Kaynak MN, Zhi Q, Cheok AD, Sengupta K, Jian Z, Chung KC (2004) Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis. Speech Commun 43(1–2):1–16
Article Google Scholar
Lan, Y, Harvey, R, Theobald, BJ, Ong, EJ., Bowden, R (2009) Comparing visual features for lipreading. In: Proceedings of International Conference on Audio-Visual Speech Processing, Norwich, UK, pp 102–106
Lan, Y, Theobald, BJ, Harvey, R, Ong, EJ, Bowden, R (2010) Improving visual features for lip-reading. In: Proceedings of International Conference on Audio-Visual Speech Processing, Kanagawa, Japan, pp 142–147
Lee JS, Park CH (2008) Robust audio-visual speech recognition based on late integration. IEEE Trans Multimed 10(5):767–779
Article Google Scholar
Lee JS, Park CH (2010) Hybrid simulated annealing and its application to optimization of hidden markov models for visual speech recognition. IEEE Trans Syst Man Cybern B 40(4):1188–1196
Article Google Scholar
Lucey S (2003) An evaluation of visual speech features for the tasks of speech and speaker recognition. In: Proceedings of International Conference on Audio- and Video-Based Biometric Person Authentication, Guildford, UK, pp 260–267 (2003)
Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R (2002) Extraction of visual features for lipreading. IEEE Trans Pattern Anal Mach Intell 24(2):198–213
Article Google Scholar
Matthews I, Potamianos G, Neti C, Luettin J (2001) A comparison of model and transform-based visual features for audio-visual LVCSR. In: Proceedings of International Conference on Multimedia and Expo, Tokyo, Japan, pp 22–25
Munhall K, Vatikiotis-Bateson E (1998) The moving face during speech communication. In: Campbell R, Dodd B, Burnham, D (eds) Hearing by eye II: advances in the psychology of speechreading and audio-visual speech. Psychology Press, Hove, pp 123–142
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of International Conference on Machine Learning, Bellevue, WA, USA (2011)
Ohala JJ (1975) The temporal regulation of speech. In: Fant G, Tatham MA (eds) Auditory analysis and perception. Academic Press, London, pp 431–453
Oppenheim AV, Schafer RW (1999) Discrete-time signal processing, 2nd edn. Prentice-Hall, Upper Saddle River (1999)
O’Shaughnessy D (2008) Automatic speech recognition: history, methods and challenges. Pattern Recognit 41:2965–2979
Article MATH Google Scholar
Petajan ED (1985) Automatic lipreading to enhance speech recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, pp. 40–47
Potamianos G, Graf HP (1998) Linear discriminant analysis for speechreading. In: Proceedings of IEEE Workshop on Multimedia Processing, Redeondo Beach, CA, USA, pp 221–226
Potamianos G, Graf HP, Cosatto E (1998) An image transform approach for HMM based automatic lipreading. In: Proceedings of International Conference on Image Processing, Chicago, IL, USA, vol 3, pp 173–177
Potamianos G, Neti C (2003) Audio-visual speech recognition in challenging environments. In: Proceedings of Eurospeech, Geneva, Switzerland, pp 1293–1296
Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audiovisual speech. Proc IEEE 91(9):1306–1326
Article Google Scholar
Rabi, G, Lu, SW (1997) Energy minimization for extracting mouth curves in a facial image. In: Proceedings of International Conference on Intelligent Information Systems, Bahamas, pp 381–385
Saenko K, Darrell T, Glass J (2004) Articulatory features for robust visual speech recognition. In: Proceedings of International Conference on Multimodal Interfaces, State College, PA, USA, pp 152–158
Saenko K, Livescu K, Glass J, Darrell T (2009) Multistream articulatory feature-based models for visual speech recognition IEEE Trans Pattern Anal Mach Intell 31:1700–1707
Article Google Scholar
Seymour R, Stewart D, Ming J (2008) Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. EURASIP J Image Video Process
Silsbee PL, Bovik AC (1996) Computer lipreading for improved accuracy in automatic speech recognition. IEEE Trans Speech Audio Process 4(5):337–351
Article Google Scholar
Vitkovitch M, Barber P (1996) Visible speech as a function of image quality: effects of display parameters on lipreading ability. Appl Cogn Psychol 10:121–140
Article Google Scholar
Zhao G, Barnard M, Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Trans Multimed 11(7):1254–1265
Article Google Scholar

Download references

Acknowledgments

This research was supported by the Ministry of Science, ICT & Future Planning (MSIP), Korea, in the ICT R&D Program 2013, and by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the MSIP (No. 2013R1A1A1007822).

Author information

Authors and Affiliations

School of Integrated Technology, Yonsei University, Seoul, Korea
Jong-Seok Lee

Authors

Jong-Seok Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jong-Seok Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, JS. Visual-speech-pass filtering for robust automatic lip-reading. Pattern Anal Applic 17, 611–621 (2014). https://doi.org/10.1007/s10044-013-0350-x

Download citation

Received: 14 February 2012
Accepted: 20 August 2013
Published: 03 September 2013
Issue Date: August 2014
DOI: https://doi.org/10.1007/s10044-013-0350-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Visual-speech-pass filtering for robust automatic lip-reading

Abstract

Access this article

Similar content being viewed by others

Automatic lipreading based on optimized OLSDA and HMM

Visual Speech Recognition with Selected Boundary Descriptors

Lip-Reading Using Pixel-Based and Geometry-Based Features for Multimodal Human–Robot Interfaces

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Visual-speech-pass filtering for robust automatic lip-reading

Abstract

Access this article

Similar content being viewed by others

Automatic lipreading based on optimized OLSDA and HMM

Visual Speech Recognition with Selected Boundary Descriptors

Lip-Reading Using Pixel-Based and Geometry-Based Features for Multimodal Human–Robot Interfaces

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation