Abstract
This paper proposes a temporal filtering technique used in extraction of visual features for improved robustness of automatic lip-reading, called visual-speech-pass filtering. A band-pass filter is applied to the pixel value sequence of the images containing the speaker’s lip region to remove unwanted variations that are not relevant to the speech information. The filter is carefully designed based on psychological, spectral, and experimental analyses. Experimental results on two speaker-independent and one speaker-dependent recognition tasks demonstrate that the proposed technique significantly improves recognition performance in both clean and visually noisy conditions.
Similar content being viewed by others
Notes
This is opposite to the acoustic speech case where the noise interference appears in the low frequency range, and thus high-pass filtering of the temporal trajectories of filterbank energies is helpful for noise-robustness [11].
References
Amer A, Dubois E (2005) Fast and reliable structure-oriented video noise estimation. IEEE Trans Circuits Syst Video Technol 15(1):113–118
Arsic I, Thiran JP (2006) Mutual information eigenlips for audio-visual speech recognition. In: Proceedings of European Signal Processing Conference Florence, Italy
Bregler C, Konig Y (1994) Eigenlips for robust speech recognition. In: Proceedings of International Conference Acoustics, Speech and Signal Processing, Adelaide, Australia, vol. 2, pp 669–672
Chibelushi CC, Deravi F, Mason JSD (2002) A review of speech-based bimodal recognition. IEEE Trans Multimed 4(1):23–37
Chiou GI, Hwang JN (1997) Lipreading from color video. IEEE Trans Image Process 6(8):1192–1195
Dupont S, Luettin J (2000) Audio-visual speech modeling for continuous speech recognition. IEEE Trans Multimed 2(3):141–151
Fox NA, O’Mullane BA, Reilly RB (2005) VALID: a new practical audio-visual database, and comparative results. In: Proceedings of International Conference Audio- and Video-Based Biometric Person Authentication, New York, USA, pp 777–786
Frowein HW, Smoorenburg GF, Pyters L, Schinkel D (1991) Improved speech recognition through videotelephony: experiments with the hard of hearing. IEEE J Sel Areas Commun 9(4):611–616
Gurbuz S, Tufekci Z, Patterson E, Gowdy J (2001) Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, Salt Lake City, UT, USA. vol 1, pp 177–180
Hennecke ME, Prasad KV, Stork DG (1995) Automatic speech recognition system using acoustic and visual signals. In: Proceedings of 29th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, vol 2, pp 1214–1218
Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578–589
Huang X, Acero A, Hon HW (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice-Hall, Upper Saddle River
Jung HY, Lee SY (2000) On the temporal decorrelation of feature parameters for noise-robust speech recognition. IEEE Trans Speech Audio Process 8(4):407–416
Kaynak MN, Zhi Q, Cheok AD, Sengupta K, Jian Z, Chung KC (2004) Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis. Speech Commun 43(1–2):1–16
Lan, Y, Harvey, R, Theobald, BJ, Ong, EJ., Bowden, R (2009) Comparing visual features for lipreading. In: Proceedings of International Conference on Audio-Visual Speech Processing, Norwich, UK, pp 102–106
Lan, Y, Theobald, BJ, Harvey, R, Ong, EJ, Bowden, R (2010) Improving visual features for lip-reading. In: Proceedings of International Conference on Audio-Visual Speech Processing, Kanagawa, Japan, pp 142–147
Lee JS, Park CH (2008) Robust audio-visual speech recognition based on late integration. IEEE Trans Multimed 10(5):767–779
Lee JS, Park CH (2010) Hybrid simulated annealing and its application to optimization of hidden markov models for visual speech recognition. IEEE Trans Syst Man Cybern B 40(4):1188–1196
Lucey S (2003) An evaluation of visual speech features for the tasks of speech and speaker recognition. In: Proceedings of International Conference on Audio- and Video-Based Biometric Person Authentication, Guildford, UK, pp 260–267 (2003)
Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R (2002) Extraction of visual features for lipreading. IEEE Trans Pattern Anal Mach Intell 24(2):198–213
Matthews I, Potamianos G, Neti C, Luettin J (2001) A comparison of model and transform-based visual features for audio-visual LVCSR. In: Proceedings of International Conference on Multimedia and Expo, Tokyo, Japan, pp 22–25
Munhall K, Vatikiotis-Bateson E (1998) The moving face during speech communication. In: Campbell R, Dodd B, Burnham, D (eds) Hearing by eye II: advances in the psychology of speechreading and audio-visual speech. Psychology Press, Hove, pp 123–142
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of International Conference on Machine Learning, Bellevue, WA, USA (2011)
Ohala JJ (1975) The temporal regulation of speech. In: Fant G, Tatham MA (eds) Auditory analysis and perception. Academic Press, London, pp 431–453
Oppenheim AV, Schafer RW (1999) Discrete-time signal processing, 2nd edn. Prentice-Hall, Upper Saddle River (1999)
O’Shaughnessy D (2008) Automatic speech recognition: history, methods and challenges. Pattern Recognit 41:2965–2979
Petajan ED (1985) Automatic lipreading to enhance speech recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, pp. 40–47
Potamianos G, Graf HP (1998) Linear discriminant analysis for speechreading. In: Proceedings of IEEE Workshop on Multimedia Processing, Redeondo Beach, CA, USA, pp 221–226
Potamianos G, Graf HP, Cosatto E (1998) An image transform approach for HMM based automatic lipreading. In: Proceedings of International Conference on Image Processing, Chicago, IL, USA, vol 3, pp 173–177
Potamianos G, Neti C (2003) Audio-visual speech recognition in challenging environments. In: Proceedings of Eurospeech, Geneva, Switzerland, pp 1293–1296
Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audiovisual speech. Proc IEEE 91(9):1306–1326
Rabi, G, Lu, SW (1997) Energy minimization for extracting mouth curves in a facial image. In: Proceedings of International Conference on Intelligent Information Systems, Bahamas, pp 381–385
Saenko K, Darrell T, Glass J (2004) Articulatory features for robust visual speech recognition. In: Proceedings of International Conference on Multimodal Interfaces, State College, PA, USA, pp 152–158
Saenko K, Livescu K, Glass J, Darrell T (2009) Multistream articulatory feature-based models for visual speech recognition IEEE Trans Pattern Anal Mach Intell 31:1700–1707
Seymour R, Stewart D, Ming J (2008) Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. EURASIP J Image Video Process
Silsbee PL, Bovik AC (1996) Computer lipreading for improved accuracy in automatic speech recognition. IEEE Trans Speech Audio Process 4(5):337–351
Vitkovitch M, Barber P (1996) Visible speech as a function of image quality: effects of display parameters on lipreading ability. Appl Cogn Psychol 10:121–140
Zhao G, Barnard M, Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Trans Multimed 11(7):1254–1265
Acknowledgments
This research was supported by the Ministry of Science, ICT & Future Planning (MSIP), Korea, in the ICT R&D Program 2013, and by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the MSIP (No. 2013R1A1A1007822).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lee, JS. Visual-speech-pass filtering for robust automatic lip-reading. Pattern Anal Applic 17, 611–621 (2014). https://doi.org/10.1007/s10044-013-0350-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-013-0350-x