Abstract
Natural speech is produced by the vocal organs of a particular talker. The acoustic features of the speech signal must therefore be correlated with the movements of the articulators (lips, jaw, tongue, velum,...). For instance, hearing impaired people (and not only them) improve their understanding of speech by lip reading. This chapter is an overview of audiovisual speech processing with emphasis on some experiments concerning recognition, speaker verification, indexing and corpus based synthesis from tongue and lips movements.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chollet, G., Cernocky, J., Constantinescu, A., Deligne, S., Bimbot, F.: Towards ALISP: a Proposal for Automatic Language Independent Speech Processing. In: Computational Models of Speech Pattern Processing. NATO ASI Series, Series F: Computer and System Sciences, vol. 169, pp. 375–387. Springer, Heidelberg (1999)
Bimbot, F., Chollet, G., Deleglise, P., Montacié, C.: Temporal Decomposition and Acoustic-Phonetic Decoding of Speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 445–448 (1988)
Gersho, A., Gray, R.: Vector Quantization and Signal Compression. Kluwer, Boston (1992)
Petrovska-Delacretaz, D., Chollet, G.: Searching Through a Speech Memory for Efficient Coding, Recognition and Synthesis. In: Braun, A., Masthoff, H. (eds.) Phonetics and its Applications. Festschrift for Jens-Peter Köster on the occasion of his 60th birthday, pp. 453–464. Franz Steiner Verlag (2002)
Yang, M.H., Kriegman, D., Ahuja, N.: Detecting Faces in Images: a Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(1), 34–58 (2002)
Viola, P.A., Jones, M.J.: Robust Real-Time Object Detection. International Journal of Computer Vision 57(2), 137–154 (2002)
Fasel, I., Fortenberry, B., Movellan, J.: A Generative Framework for Real-Time Object Detection and Classification. Computer Vision and Image Understanding 98(1), 182–210 (2004)
Santana, M.C., Navarro, J.L., Suárez, O.D., Martel, A.F.: Multiple Face Detection at Different Resolutions for Perceptual User Interfaces. In: 2nd Iberian Conference on Pattern Recognition and Image Analysis, Estoril, Portugal (June 2005)
Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988)
Davis, S., Mermelstein, P.: Comparaison of Parametric Representations of Monosyllabic Word Recognition in Continuously Spoken Sentences. In: IEEE International Conference on Acoustics, Speech ans Signal Processing, pp. 357–366 (April 1980)
Hermansky, H.: Perceptual Linear Predictive (plp) Analysis of Speech. J. Acoust. Soc. America 87, 1738–1752 (1990)
Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. Int. Journal of Computer Vision 60(2), 91–110 (2004)
Mikolajczyk, K., Schmid, C.: A Performance Evaluation of Local Descriptors. IEEE trans on Pattern Analysis and Machine Intelligence 27(10) (2005)
Witkin, A.: Scale-Space Filtering. In: Proceedings of the 8th International Joint Conference on Artificial Intelligence, pp. 1019–1022 (1983)
Koenderink, J.: The Structure of Images. Biological Cybernetics 50, 363–370 (1984)
Turk, M., Pentland, A.: Eigenfaces for Recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991)
Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998)
Dolédec, S., Chessel, D.: Co-Inertia Analysis: an Alternative Method for Studying Species-Environment Relationships. Freshwater Biology 31, 277–294 (1994)
Reynolds, D., Quatieri, T., Dunn, R.: Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing (10), 19–41 (2000)
Mokbel, C.: Online Adaptation of HMMs to Real-Life Conditions: A Unified Framework. IEEE Trans. On Speech and Audio Processing 9(4), 342–357 (2001)
Rabiner, L.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-Visual Automatic Speech Recognition: An Overview. In: Bailly, G., Vatikiotis-Bateson, E., Perrier, P. (eds.) Issues in Visual and Audio-Visual Speech Processing, MIT Press, Cambridge (2004)
Argones-Rúa, E., García-Mateo, C., Bredin, H., Chollet, G.: Aliveness Detection using Coupled Hidden Markov Models. In: SWB 2007. First Spanish Workshop on Biometrics, Girona, Spain (June 2007)
Brand, M., Oliver, N., Pentland, A.: Coupled hidden markov models for complex action recognition (1996)
Misra, H.: Multi-stream processing for noise robust speech recognition. PhD thesis, Lausanne (2006)
Bailly-Baillière, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J., Matas, J., Messer, K., Popovici, V., Porée, F., Ruiz, B., Thiran, J.P.: The BANCA and Evaluation Protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 625–638. Springer, Heidelberg (2003)
Hazen, T.: Visual Model Structures and Synchrony Constraints for Audio-Visual Speech Recognition. IEEE Transactions on Audio, Speech and Language Processing 14(3) (2006)
Dupont, S., Luettin, J.: Audio-Visual Speech Modeling for Continuous Speech Recognition. IEEE Transcations on Multimedia 2(3) (2000)
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.: Recent Advances in the Automatic Recognition of Audiovisual Speech. IEEE 91(9) (2003)
Chu, S., Huang, T.: Audio Visual Speech Modelling using Coupled Hidden Markov Models. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2009–2012 (2002)
Nakamura, S.: Statistical Multimodal Integration for Audio-Visual Speech Processing. IEEE Transactions on Neural Networks 13(4), 854–866 (2002)
Brugger, F., Zouari, L., Bredin, H., Amehraye, A., Chollet, G., Pastor, D., Ni, Y.: Reconnaissance Audio-Visuelle de la Parole par VMike. In: JEP 2006. XXVIème Journés d’Étude sur la Parole, Dinard, France, pp. 417–420 (June 2006)
The NoiseX Database: http://spib.rice.edu/spib
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.2). Cambridge University Engineering Department (December 2002)
Ross, A.A., Nandakumar, K., Jain, A.K.: Handbook of Multibiometrics. Springer, Heidelberg (2006)
Scott, G., Longuet-Higgins, H.: An Algorithm for Associating the Features of Two Images. Proc. of the Royal Society of London. Series B. Biological Sciences 244(1309), 21–26 (1991)
Pilu, M.: Uncalibrated Stereo Correspondence by Singular Value Decomposition. Technical Report HPL-97-96, Digital Media Department HP Laboratories (1997)
Delponte, E., Isgr, F., Odone, F., Verri, A.: SVD-Matching using SIFT Features. In: Proc. of the Int. Conf. on Vision, Video and Graphics, pp. 125–132 (2005)
Bicego, M., Lagorio, A., Grosso, E., Tistarelli, M.: On the Use of SIFT Features for Face Authentication. In: CVPRW. Conf. on Computer Vision and Pattern Recognition Workshop (2006)
Ullman, S.: The Interpretation of Visual Motion. MIT Press, Cambridge, MA (1979)
Golub, G., Loan, C.V.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore, MD (1996)
Pilu, M.: A Direct Method for Stereo Correspondence based on Singular Value Decomposition. In: Proceedings of CVPR, pp. 261–266 (1997)
Bredin, H., Miguel, A., Witten, I.H., Chollet, G.: Detecting Replay Attacks in Audiovisual Identity Verification. In: ICASSP 2006. 31st IEEE International Conference on Acoustics, Speech, and Signal Processing, Toulouse, France, vol. 1, pp. 621–624 (May 2006)
Abe, M., Nakamura, S., Shikano, K., Kuwabara, H.: Voice Conversion through Vector Quantization. In: International Conference on Acoustics, Speech and Signal Processing (1988)
Cappé, O., Stylianou, Y., Moulines, E.: Statistical Methods for Voice Quality Transformation. In: EUROSPEECH (1995)
Sundermann, D., Hge, H., Bonafonte, A., Ney, H., Black, A., Narayanan, S.: Text-Independent Voice Conversion Based on Unit Selection. In: International Conference on Acoustics, Speech and Signal Processing, Toulouse, France (2006)
Genoud, D., Chollet, G.: Voice Transformations: Some Tools for the Imposture of Speaker Verification Systems, pp. 375–387 Franz Steiner Verlag (1999)
Stylianou, Y., Cappé, O.: A System for Voice Conversion Based on Probabilistic Classification and a Harmonic Plus Noise Model. In: International Conference on Acoustics, Speech and Signal Processing (1998)
Valbret, H., Moulines, E., Tubach, J.: Voice Transformation Using TDPSOLA Technique. In: International Conference on Acoustics, Speech and Signal Processing (1992)
Perrot, P., Aversano, G., Blouet, R., Charbit, M., Chollet, G.: Voice Forgery using ALISP. In: International Conference on Acoustics, Speech and Signal Processing (2005)
Jou, S.C.S., Schultz, T., Waibel, A.: Continuous Electromyographic Speech Recognition with a Multi-Stream Decoding Architecture. In: International Conference on Communication Audio and Speech Processing, Honolulu, Hawaii (April 2007)
Heracleous, P., Nakajima, Y., Saruwatari, H., Shikano, K.: A Tissue-Conductive Acoustic Sensor Applied in Speech Recognition for Privacy. In: sOc-EUSAI 2005. Proceedings of the 2005 joint conference on Smart objects and ambient intelligence, pp. 93–97. ACM Press, New York (2005)
Denby, B., Oussar, Y., Dreyfus, G., Stone, M.: Prospect for a Silent Speech Interface Using Ultrasound Imaging. In: International Conference on Acoustics, Speech and Signal Processing, Toulouse, France (2006)
Hueber, T., Chollet, C., Denby, B., Stone, M., Zouari, L.: Ouisper: Corpus Based Synthesis Driven by Articulatory Data. In: International Conference on Phonetic Science (to appear, 2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chollet, G. et al. (2007). Some Experiments in Audio-Visual Speech Processing. In: Chetouani, M., Hussain, A., Gas, B., Milgram, M., Zarader, JL. (eds) Advances in Nonlinear Speech Processing. NOLISP 2007. Lecture Notes in Computer Science(), vol 4885. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77347-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-77347-4_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77346-7
Online ISBN: 978-3-540-77347-4
eBook Packages: Computer ScienceComputer Science (R0)