Some Experiments in Audio-Visual Speech Processing

Chollet, G.; Landais, R.; Hueber, T.; Bredin, H.; Mokbel, C.; Perrot, P.; Zouari, L.

doi:10.1007/978-3-540-77347-4_2

G. Chollet¹,
R. Landais¹,
T. Hueber^1,2,
H. Bredin¹,
C. Mokbel⁴,
P. Perrot^1,3 &
…
L. Zouari¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4885))

Included in the following conference series:

International Conference on Nonlinear Speech Processing

614 Accesses
1 Citations

Abstract

Natural speech is produced by the vocal organs of a particular talker. The acoustic features of the speech signal must therefore be correlated with the movements of the articulators (lips, jaw, tongue, velum,...). For instance, hearing impaired people (and not only them) improve their understanding of speech by lip reading. This chapter is an overview of audiovisual speech processing with emphasis on some experiments concerning recognition, speaker verification, indexing and corpus based synthesis from tongue and lips movements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chollet, G., Cernocky, J., Constantinescu, A., Deligne, S., Bimbot, F.: Towards ALISP: a Proposal for Automatic Language Independent Speech Processing. In: Computational Models of Speech Pattern Processing. NATO ASI Series, Series F: Computer and System Sciences, vol. 169, pp. 375–387. Springer, Heidelberg (1999)
Google Scholar
Bimbot, F., Chollet, G., Deleglise, P., Montacié, C.: Temporal Decomposition and Acoustic-Phonetic Decoding of Speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 445–448 (1988)
Google Scholar
Gersho, A., Gray, R.: Vector Quantization and Signal Compression. Kluwer, Boston (1992)
MATH Google Scholar
Petrovska-Delacretaz, D., Chollet, G.: Searching Through a Speech Memory for Efficient Coding, Recognition and Synthesis. In: Braun, A., Masthoff, H. (eds.) Phonetics and its Applications. Festschrift for Jens-Peter Köster on the occasion of his 60th birthday, pp. 453–464. Franz Steiner Verlag (2002)
Google Scholar
Yang, M.H., Kriegman, D., Ahuja, N.: Detecting Faces in Images: a Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(1), 34–58 (2002)
Article Google Scholar
Viola, P.A., Jones, M.J.: Robust Real-Time Object Detection. International Journal of Computer Vision 57(2), 137–154 (2002)
Article Google Scholar
Fasel, I., Fortenberry, B., Movellan, J.: A Generative Framework for Real-Time Object Detection and Classification. Computer Vision and Image Understanding 98(1), 182–210 (2004)
Article Google Scholar
Santana, M.C., Navarro, J.L., Suárez, O.D., Martel, A.F.: Multiple Face Detection at Different Resolutions for Perceptual User Interfaces. In: 2nd Iberian Conference on Pattern Recognition and Image Analysis, Estoril, Portugal (June 2005)
Google Scholar
Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988)
Article Google Scholar
Davis, S., Mermelstein, P.: Comparaison of Parametric Representations of Monosyllabic Word Recognition in Continuously Spoken Sentences. In: IEEE International Conference on Acoustics, Speech ans Signal Processing, pp. 357–366 (April 1980)
Google Scholar
Hermansky, H.: Perceptual Linear Predictive (plp) Analysis of Speech. J. Acoust. Soc. America 87, 1738–1752 (1990)
Article Google Scholar
Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. Int. Journal of Computer Vision 60(2), 91–110 (2004)
Article Google Scholar
Mikolajczyk, K., Schmid, C.: A Performance Evaluation of Local Descriptors. IEEE trans on Pattern Analysis and Machine Intelligence 27(10) (2005)
Google Scholar
Witkin, A.: Scale-Space Filtering. In: Proceedings of the 8th International Joint Conference on Artificial Intelligence, pp. 1019–1022 (1983)
Google Scholar
Koenderink, J.: The Structure of Images. Biological Cybernetics 50, 363–370 (1984)
Article MATH MathSciNet Google Scholar
Turk, M., Pentland, A.: Eigenfaces for Recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991)
Article Google Scholar
Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998)
Article Google Scholar
Dolédec, S., Chessel, D.: Co-Inertia Analysis: an Alternative Method for Studying Species-Environment Relationships. Freshwater Biology 31, 277–294 (1994)
Article Google Scholar
Reynolds, D., Quatieri, T., Dunn, R.: Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing (10), 19–41 (2000)
Google Scholar
Mokbel, C.: Online Adaptation of HMMs to Real-Life Conditions: A Unified Framework. IEEE Trans. On Speech and Audio Processing 9(4), 342–357 (2001)
Article Google Scholar
Rabiner, L.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
Article Google Scholar
Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-Visual Automatic Speech Recognition: An Overview. In: Bailly, G., Vatikiotis-Bateson, E., Perrier, P. (eds.) Issues in Visual and Audio-Visual Speech Processing, MIT Press, Cambridge (2004)
Google Scholar
Argones-Rúa, E., García-Mateo, C., Bredin, H., Chollet, G.: Aliveness Detection using Coupled Hidden Markov Models. In: SWB 2007. First Spanish Workshop on Biometrics, Girona, Spain (June 2007)
Google Scholar
Brand, M., Oliver, N., Pentland, A.: Coupled hidden markov models for complex action recognition (1996)
Google Scholar
Misra, H.: Multi-stream processing for noise robust speech recognition. PhD thesis, Lausanne (2006)
Google Scholar
Bailly-Baillière, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J., Matas, J., Messer, K., Popovici, V., Porée, F., Ruiz, B., Thiran, J.P.: The BANCA and Evaluation Protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 625–638. Springer, Heidelberg (2003)
Chapter Google Scholar
Hazen, T.: Visual Model Structures and Synchrony Constraints for Audio-Visual Speech Recognition. IEEE Transactions on Audio, Speech and Language Processing 14(3) (2006)
Google Scholar
Dupont, S., Luettin, J.: Audio-Visual Speech Modeling for Continuous Speech Recognition. IEEE Transcations on Multimedia 2(3) (2000)
Google Scholar
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.: Recent Advances in the Automatic Recognition of Audiovisual Speech. IEEE 91(9) (2003)
Google Scholar
Chu, S., Huang, T.: Audio Visual Speech Modelling using Coupled Hidden Markov Models. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2009–2012 (2002)
Google Scholar
Nakamura, S.: Statistical Multimodal Integration for Audio-Visual Speech Processing. IEEE Transactions on Neural Networks 13(4), 854–866 (2002)
Article Google Scholar
Brugger, F., Zouari, L., Bredin, H., Amehraye, A., Chollet, G., Pastor, D., Ni, Y.: Reconnaissance Audio-Visuelle de la Parole par VMike. In: JEP 2006. XXVIème Journés d’Étude sur la Parole, Dinard, France, pp. 417–420 (June 2006)
Google Scholar
The NoiseX Database: http://spib.rice.edu/spib
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.2). Cambridge University Engineering Department (December 2002)
Google Scholar
Ross, A.A., Nandakumar, K., Jain, A.K.: Handbook of Multibiometrics. Springer, Heidelberg (2006)
Google Scholar
Scott, G., Longuet-Higgins, H.: An Algorithm for Associating the Features of Two Images. Proc. of the Royal Society of London. Series B. Biological Sciences 244(1309), 21–26 (1991)
Article Google Scholar
Pilu, M.: Uncalibrated Stereo Correspondence by Singular Value Decomposition. Technical Report HPL-97-96, Digital Media Department HP Laboratories (1997)
Google Scholar
Delponte, E., Isgr, F., Odone, F., Verri, A.: SVD-Matching using SIFT Features. In: Proc. of the Int. Conf. on Vision, Video and Graphics, pp. 125–132 (2005)
Google Scholar
Bicego, M., Lagorio, A., Grosso, E., Tistarelli, M.: On the Use of SIFT Features for Face Authentication. In: CVPRW. Conf. on Computer Vision and Pattern Recognition Workshop (2006)
Google Scholar
Ullman, S.: The Interpretation of Visual Motion. MIT Press, Cambridge, MA (1979)
Google Scholar
Golub, G., Loan, C.V.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore, MD (1996)
MATH Google Scholar
Pilu, M.: A Direct Method for Stereo Correspondence based on Singular Value Decomposition. In: Proceedings of CVPR, pp. 261–266 (1997)
Google Scholar
Bredin, H., Miguel, A., Witten, I.H., Chollet, G.: Detecting Replay Attacks in Audiovisual Identity Verification. In: ICASSP 2006. 31st IEEE International Conference on Acoustics, Speech, and Signal Processing, Toulouse, France, vol. 1, pp. 621–624 (May 2006)
Google Scholar
Abe, M., Nakamura, S., Shikano, K., Kuwabara, H.: Voice Conversion through Vector Quantization. In: International Conference on Acoustics, Speech and Signal Processing (1988)
Google Scholar
Cappé, O., Stylianou, Y., Moulines, E.: Statistical Methods for Voice Quality Transformation. In: EUROSPEECH (1995)
Google Scholar
Sundermann, D., Hge, H., Bonafonte, A., Ney, H., Black, A., Narayanan, S.: Text-Independent Voice Conversion Based on Unit Selection. In: International Conference on Acoustics, Speech and Signal Processing, Toulouse, France (2006)
Google Scholar
Genoud, D., Chollet, G.: Voice Transformations: Some Tools for the Imposture of Speaker Verification Systems, pp. 375–387 Franz Steiner Verlag (1999)
Google Scholar
Stylianou, Y., Cappé, O.: A System for Voice Conversion Based on Probabilistic Classification and a Harmonic Plus Noise Model. In: International Conference on Acoustics, Speech and Signal Processing (1998)
Google Scholar
Valbret, H., Moulines, E., Tubach, J.: Voice Transformation Using TDPSOLA Technique. In: International Conference on Acoustics, Speech and Signal Processing (1992)
Google Scholar
Perrot, P., Aversano, G., Blouet, R., Charbit, M., Chollet, G.: Voice Forgery using ALISP. In: International Conference on Acoustics, Speech and Signal Processing (2005)
Google Scholar
Jou, S.C.S., Schultz, T., Waibel, A.: Continuous Electromyographic Speech Recognition with a Multi-Stream Decoding Architecture. In: International Conference on Communication Audio and Speech Processing, Honolulu, Hawaii (April 2007)
Google Scholar
Heracleous, P., Nakajima, Y., Saruwatari, H., Shikano, K.: A Tissue-Conductive Acoustic Sensor Applied in Speech Recognition for Privacy. In: sOc-EUSAI 2005. Proceedings of the 2005 joint conference on Smart objects and ambient intelligence, pp. 93–97. ACM Press, New York (2005)
Chapter Google Scholar
Denby, B., Oussar, Y., Dreyfus, G., Stone, M.: Prospect for a Silent Speech Interface Using Ultrasound Imaging. In: International Conference on Acoustics, Speech and Signal Processing, Toulouse, France (2006)
Google Scholar
Hueber, T., Chollet, C., Denby, B., Stone, M., Zouari, L.: Ouisper: Corpus Based Synthesis Driven by Articulatory Data. In: International Conference on Phonetic Science (to appear, 2007)
Google Scholar

Download references

Author information

Authors and Affiliations

CNRS LTCI/TSI Paris, 46 rue Barrault, 75634 Paris Cedex 13, France
G. Chollet, R. Landais, T. Hueber, H. Bredin, P. Perrot & L. Zouari
Laboratoire d’Electronique - ESPCI - 10 rue Vauquelin - 75005 Paris, France
T. Hueber
Institut de Recherche Criminelle de la Gendarmerie Nationale (IRCGN), 93110, Rosny sous bois, France
P. Perrot
University of Balamand, Po Box 100, Tripoli, Lebanon
C. Mokbel

Authors

G. Chollet
View author publications
You can also search for this author in PubMed Google Scholar
R. Landais
View author publications
You can also search for this author in PubMed Google Scholar
T. Hueber
View author publications
You can also search for this author in PubMed Google Scholar
H. Bredin
View author publications
You can also search for this author in PubMed Google Scholar
C. Mokbel
View author publications
You can also search for this author in PubMed Google Scholar
P. Perrot
View author publications
You can also search for this author in PubMed Google Scholar
L. Zouari
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Mohamed Chetouani Amir Hussain Bruno Gas Maurice Milgram Jean-Luc Zarader

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chollet, G. et al. (2007). Some Experiments in Audio-Visual Speech Processing. In: Chetouani, M., Hussain, A., Gas, B., Milgram, M., Zarader, JL. (eds) Advances in Nonlinear Speech Processing. NOLISP 2007. Lecture Notes in Computer Science(), vol 4885. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77347-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-540-77347-4_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77346-7
Online ISBN: 978-3-540-77347-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics