DBN Based Models for Audio-Visual Speech Analysis and Recognition

Ravyse, Ilse; Jiang, Dongmei; Jiang, Xiaoyue; Lv, Guoyun; Hou, Yunshu; Sahli, Hichem; Zhao, Rongchun

doi:10.1007/11922162_3

Ilse Ravyse²⁰,
Dongmei Jiang²²,
Xiaoyue Jiang²²,
Guoyun Lv²²,
Yunshu Hou²²,
Hichem Sahli^20,21 &
…
Rongchun Zhao²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4261))

Included in the following conference series:

Pacific-Rim Conference on Multimedia

735 Accesses
1 Citations

Abstract

We present an audio-visual automatic speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system consists of three components: (i) a visual module, (ii) an acoustic module, and (iii) a Dynamic Bayesian Network-based recognition module. The vision module, locates and tracks the speaker head, and mouth movements and extracts relevant speech features represented by contour information and 3D deformations of lip movements. The acoustic module extracts noise-robust features, i.e. the Mel Filterbank Cepstrum Coefficients (MFCCs). Finally we propose two models based on Dynamic Bayesian Networks (DBN) to either consider the single audio and video streams or to integrate the features from the audio and visual streams. We also compare the proposed DBN based system with classical Hidden Markov Model. The novelty of the developed framework is the persistence of the audiovisual speech signal characteristics from the extraction step, through the learning step. Experiments on continuous audiovisual speech show that the segmentation boundaries of phones in the audio stream and visemes in the video stream are close to manual segmentation boundaries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bilmes, J., Zweig, G.: The graphical modelds toolkit:an open source software system for speech and time-series processing. In: Proceedings of the IEEE Internation Conf. on Acoustic Speech and Signal Processing (ICASSP), vol. 4, pp. 3916–3919 (2002)
Google Scholar
Jeff Bilmes, G.Z., et al.: Discriminatively structured dynamic graphical models for speech recognition. Technical report, JHU 2001 Summer Workshop (2001)
Google Scholar
Zhang, Y., Diao, Q., C., S.W., Bilmes, J.: Dbn based multi-stream models for speech. In: Proceedings of the IEEE Internation Conf. on Acoustic Speech and Signal Processing (ICASSP) (2003)
Google Scholar
Gravier, G., Potamianos, G., Neti, C.: Asynchrony modeling for audio visual speech recognition. In: Proceedings of Human Language Technology Conference (2002)
Google Scholar
Gawdy, G.N., Subramanya, A., C.J.: Dbn based multi-stream models for audio visual speech recognition. In: Proceedings of the IEEE Internation Conf. on Acoustic Speech and Signal Processing (ICASSP) (2004)
Google Scholar
Lei, X., Ji, G., Ng, T., Bilmes, J., Ostendorf, M.: Dbn-based multi-stream mandarin toneme recogntion. In: Proceedings of the IEEE Internation Conf. on Acoustic Speech and Signal Processing (ICASSP) (2005)
Google Scholar
Bilmes, J., Bartels, C.: Graphical model architecture for speech recognition. IEEE signal processing magazine 89 (2005)
Google Scholar
Lei, X., Dongmei, J., Ravyse, I., Verhelst, W., Sahli, H., Slavova, V., Rongchun, Z.: Context dependent viseme models for voice driven animation. In: 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications, EC-VIP-MC 2003, Zagreb, Croatia, July 2-4, 2003, vol. 2, pp. 649–654 (2003)
Google Scholar
Ravyse, I., Enescu, V., Sahli, H.: Kernel-based head tracker for videophony. In: The IEEE International Conference on Image Processing 2005 (ICIP 2005), Genoa, Italy, 11-14/09/2005, vol. 3, pp. 1068–1071 (2005)
Google Scholar
Zhou, Y., Gu, L., Zhang, H.J.: Bayesian tangent shape model: Estimating shape and pose parameters via bayesian inference. In: Proceedings of the 2003 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2003), vol. 1, pp. 109–118 (2003)
Google Scholar
Ravyse, I.: Facial Analysis and Synthesis. PhD thesis, Vrije Universiteit Brussel, Dept. Electronics and Informatics, Belgium (2006), online: http://www.etro.vub.ac.be/Personal/icravyse/RavysePhDThesis.pdf
Vedula, S., Baker, S., Rander, P., Collins, R., Kanade, T.: Three-dimensional scene flow. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 137–154 (2005)
Article Google Scholar
Lee, Y., Terzopoulos, D., Waters, K.: Constructing physics based facial models of individuals. In: Proceedings of the Graphics Interface 1993 Conference, Toronto, ON, Canada, pp. 1–8 (1993)
Google Scholar
Eisert, P.: Very Low Bit-Rate Video Coding Using 3-D Models. PhD thesis, Universitat Erlangen, Shaker Verlag, Aachen, Germany (2000) ISBN 3-8265-8308-6
Google Scholar
Davis, B.S., Mermelstein, P.: Comparison of parametric representation for monosyllable word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 357–366 (1980)
Article Google Scholar
http://www.ldc.upenn.edu/doc/timit/timitdic.txt
Beskow, J., Karlson, I., Kewley, J., Salvi, G.: Synface-a talking head telephone for the hearing-impaired. In: Miesenberger, K., Klaus, J., Zagler, W., Burger, D. (eds.) ICCHP 2004. LNCS, vol. 3118, pp. 1178–1186. Springer, Heidelberg (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department ETRO, Joint Research Group on Audio Visual Signal Processing (AVSP), Vrije Universiteit Brussel, Pleinlaan 2, 1050, Brussel
Ilse Ravyse & Hichem Sahli
IMEC, Kapeldreef 75, 3001, Leuven
Hichem Sahli
School of Computer Science, Northwestern Polytechnical University, 127 Youyi Xilu, Xi’an, 710072, P.R. China
Dongmei Jiang, Xiaoyue Jiang, Guoyun Lv, Yunshu Hou & Rongchun Zhao

Authors

Ilse Ravyse
View author publications
You can also search for this author in PubMed Google Scholar
Dongmei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyue Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Guoyun Lv
View author publications
You can also search for this author in PubMed Google Scholar
Yunshu Hou
View author publications
You can also search for this author in PubMed Google Scholar
Hichem Sahli
View author publications
You can also search for this author in PubMed Google Scholar
Rongchun Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

College of Computer Science, Zhejiang University, China
Yueting Zhuang
Department of Computer Science and Technology, Tsinghua University, P.R. China
Shi-Qiang Yang
Microsoft Corporation, Microsoft China R&D Group, 49 Zhichun Road, 100080, Beijing, China
Yong Rui
College of Computer Science and Technology, Zhejiang University, 310027, Hangzhou, Zhejiang Province, China
Qinming He

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ravyse, I. et al. (2006). DBN Based Models for Audio-Visual Speech Analysis and Recognition. In: Zhuang, Y., Yang, SQ., Rui, Y., He, Q. (eds) Advances in Multimedia Information Processing - PCM 2006. PCM 2006. Lecture Notes in Computer Science, vol 4261. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11922162_3

Download citation

DOI: https://doi.org/10.1007/11922162_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-48766-1
Online ISBN: 978-3-540-48769-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics