Skip to main content

Some Experiments in Audio-Visual Speech Processing

  • Conference paper
Book cover Advances in Nonlinear Speech Processing (NOLISP 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4885))

Included in the following conference series:

Abstract

Natural speech is produced by the vocal organs of a particular talker. The acoustic features of the speech signal must therefore be correlated with the movements of the articulators (lips, jaw, tongue, velum,...). For instance, hearing impaired people (and not only them) improve their understanding of speech by lip reading. This chapter is an overview of audiovisual speech processing with emphasis on some experiments concerning recognition, speaker verification, indexing and corpus based synthesis from tongue and lips movements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chollet, G., Cernocky, J., Constantinescu, A., Deligne, S., Bimbot, F.: Towards ALISP: a Proposal for Automatic Language Independent Speech Processing. In: Computational Models of Speech Pattern Processing. NATO ASI Series, Series F: Computer and System Sciences, vol. 169, pp. 375–387. Springer, Heidelberg (1999)

    Google Scholar 

  2. Bimbot, F., Chollet, G., Deleglise, P., Montacié, C.: Temporal Decomposition and Acoustic-Phonetic Decoding of Speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 445–448 (1988)

    Google Scholar 

  3. Gersho, A., Gray, R.: Vector Quantization and Signal Compression. Kluwer, Boston (1992)

    MATH  Google Scholar 

  4. Petrovska-Delacretaz, D., Chollet, G.: Searching Through a Speech Memory for Efficient Coding, Recognition and Synthesis. In: Braun, A., Masthoff, H. (eds.) Phonetics and its Applications. Festschrift for Jens-Peter Köster on the occasion of his 60th birthday, pp. 453–464. Franz Steiner Verlag (2002)

    Google Scholar 

  5. Yang, M.H., Kriegman, D., Ahuja, N.: Detecting Faces in Images: a Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(1), 34–58 (2002)

    Article  Google Scholar 

  6. Viola, P.A., Jones, M.J.: Robust Real-Time Object Detection. International Journal of Computer Vision 57(2), 137–154 (2002)

    Article  Google Scholar 

  7. Fasel, I., Fortenberry, B., Movellan, J.: A Generative Framework for Real-Time Object Detection and Classification. Computer Vision and Image Understanding 98(1), 182–210 (2004)

    Article  Google Scholar 

  8. Santana, M.C., Navarro, J.L., Suárez, O.D., Martel, A.F.: Multiple Face Detection at Different Resolutions for Perceptual User Interfaces. In: 2nd Iberian Conference on Pattern Recognition and Image Analysis, Estoril, Portugal (June 2005)

    Google Scholar 

  9. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988)

    Article  Google Scholar 

  10. Davis, S., Mermelstein, P.: Comparaison of Parametric Representations of Monosyllabic Word Recognition in Continuously Spoken Sentences. In: IEEE International Conference on Acoustics, Speech ans Signal Processing, pp. 357–366 (April 1980)

    Google Scholar 

  11. Hermansky, H.: Perceptual Linear Predictive (plp) Analysis of Speech. J. Acoust. Soc. America 87, 1738–1752 (1990)

    Article  Google Scholar 

  12. Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. Int. Journal of Computer Vision 60(2), 91–110 (2004)

    Article  Google Scholar 

  13. Mikolajczyk, K., Schmid, C.: A Performance Evaluation of Local Descriptors. IEEE trans on Pattern Analysis and Machine Intelligence 27(10) (2005)

    Google Scholar 

  14. Witkin, A.: Scale-Space Filtering. In: Proceedings of the 8th International Joint Conference on Artificial Intelligence, pp. 1019–1022 (1983)

    Google Scholar 

  15. Koenderink, J.: The Structure of Images. Biological Cybernetics 50, 363–370 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  16. Turk, M., Pentland, A.: Eigenfaces for Recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991)

    Article  Google Scholar 

  17. Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998)

    Article  Google Scholar 

  18. Dolédec, S., Chessel, D.: Co-Inertia Analysis: an Alternative Method for Studying Species-Environment Relationships. Freshwater Biology 31, 277–294 (1994)

    Article  Google Scholar 

  19. Reynolds, D., Quatieri, T., Dunn, R.: Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing (10), 19–41 (2000)

    Google Scholar 

  20. Mokbel, C.: Online Adaptation of HMMs to Real-Life Conditions: A Unified Framework. IEEE Trans. On Speech and Audio Processing 9(4), 342–357 (2001)

    Article  Google Scholar 

  21. Rabiner, L.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  22. Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-Visual Automatic Speech Recognition: An Overview. In: Bailly, G., Vatikiotis-Bateson, E., Perrier, P. (eds.) Issues in Visual and Audio-Visual Speech Processing, MIT Press, Cambridge (2004)

    Google Scholar 

  23. Argones-Rúa, E., García-Mateo, C., Bredin, H., Chollet, G.: Aliveness Detection using Coupled Hidden Markov Models. In: SWB 2007. First Spanish Workshop on Biometrics, Girona, Spain (June 2007)

    Google Scholar 

  24. Brand, M., Oliver, N., Pentland, A.: Coupled hidden markov models for complex action recognition (1996)

    Google Scholar 

  25. Misra, H.: Multi-stream processing for noise robust speech recognition. PhD thesis, Lausanne (2006)

    Google Scholar 

  26. Bailly-Baillière, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J., Matas, J., Messer, K., Popovici, V., Porée, F., Ruiz, B., Thiran, J.P.: The BANCA and Evaluation Protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 625–638. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  27. Hazen, T.: Visual Model Structures and Synchrony Constraints for Audio-Visual Speech Recognition. IEEE Transactions on Audio, Speech and Language Processing 14(3) (2006)

    Google Scholar 

  28. Dupont, S., Luettin, J.: Audio-Visual Speech Modeling for Continuous Speech Recognition. IEEE Transcations on Multimedia 2(3) (2000)

    Google Scholar 

  29. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.: Recent Advances in the Automatic Recognition of Audiovisual Speech. IEEE 91(9) (2003)

    Google Scholar 

  30. Chu, S., Huang, T.: Audio Visual Speech Modelling using Coupled Hidden Markov Models. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2009–2012 (2002)

    Google Scholar 

  31. Nakamura, S.: Statistical Multimodal Integration for Audio-Visual Speech Processing. IEEE Transactions on Neural Networks 13(4), 854–866 (2002)

    Article  Google Scholar 

  32. Brugger, F., Zouari, L., Bredin, H., Amehraye, A., Chollet, G., Pastor, D., Ni, Y.: Reconnaissance Audio-Visuelle de la Parole par VMike. In: JEP 2006. XXVIème Journés d’Étude sur la Parole, Dinard, France, pp. 417–420 (June 2006)

    Google Scholar 

  33. The NoiseX Database: http://spib.rice.edu/spib

  34. Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.2). Cambridge University Engineering Department (December 2002)

    Google Scholar 

  35. Ross, A.A., Nandakumar, K., Jain, A.K.: Handbook of Multibiometrics. Springer, Heidelberg (2006)

    Google Scholar 

  36. Scott, G., Longuet-Higgins, H.: An Algorithm for Associating the Features of Two Images. Proc. of the Royal Society of London. Series B. Biological Sciences 244(1309), 21–26 (1991)

    Article  Google Scholar 

  37. Pilu, M.: Uncalibrated Stereo Correspondence by Singular Value Decomposition. Technical Report HPL-97-96, Digital Media Department HP Laboratories (1997)

    Google Scholar 

  38. Delponte, E., Isgr, F., Odone, F., Verri, A.: SVD-Matching using SIFT Features. In: Proc. of the Int. Conf. on Vision, Video and Graphics, pp. 125–132 (2005)

    Google Scholar 

  39. Bicego, M., Lagorio, A., Grosso, E., Tistarelli, M.: On the Use of SIFT Features for Face Authentication. In: CVPRW. Conf. on Computer Vision and Pattern Recognition Workshop (2006)

    Google Scholar 

  40. Ullman, S.: The Interpretation of Visual Motion. MIT Press, Cambridge, MA (1979)

    Google Scholar 

  41. Golub, G., Loan, C.V.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore, MD (1996)

    MATH  Google Scholar 

  42. Pilu, M.: A Direct Method for Stereo Correspondence based on Singular Value Decomposition. In: Proceedings of CVPR, pp. 261–266 (1997)

    Google Scholar 

  43. Bredin, H., Miguel, A., Witten, I.H., Chollet, G.: Detecting Replay Attacks in Audiovisual Identity Verification. In: ICASSP 2006. 31st IEEE International Conference on Acoustics, Speech, and Signal Processing, Toulouse, France, vol. 1, pp. 621–624 (May 2006)

    Google Scholar 

  44. Abe, M., Nakamura, S., Shikano, K., Kuwabara, H.: Voice Conversion through Vector Quantization. In: International Conference on Acoustics, Speech and Signal Processing (1988)

    Google Scholar 

  45. Cappé, O., Stylianou, Y., Moulines, E.: Statistical Methods for Voice Quality Transformation. In: EUROSPEECH (1995)

    Google Scholar 

  46. Sundermann, D., Hge, H., Bonafonte, A., Ney, H., Black, A., Narayanan, S.: Text-Independent Voice Conversion Based on Unit Selection. In: International Conference on Acoustics, Speech and Signal Processing, Toulouse, France (2006)

    Google Scholar 

  47. Genoud, D., Chollet, G.: Voice Transformations: Some Tools for the Imposture of Speaker Verification Systems, pp. 375–387 Franz Steiner Verlag (1999)

    Google Scholar 

  48. Stylianou, Y., Cappé, O.: A System for Voice Conversion Based on Probabilistic Classification and a Harmonic Plus Noise Model. In: International Conference on Acoustics, Speech and Signal Processing (1998)

    Google Scholar 

  49. Valbret, H., Moulines, E., Tubach, J.: Voice Transformation Using TDPSOLA Technique. In: International Conference on Acoustics, Speech and Signal Processing (1992)

    Google Scholar 

  50. Perrot, P., Aversano, G., Blouet, R., Charbit, M., Chollet, G.: Voice Forgery using ALISP. In: International Conference on Acoustics, Speech and Signal Processing (2005)

    Google Scholar 

  51. Jou, S.C.S., Schultz, T., Waibel, A.: Continuous Electromyographic Speech Recognition with a Multi-Stream Decoding Architecture. In: International Conference on Communication Audio and Speech Processing, Honolulu, Hawaii (April 2007)

    Google Scholar 

  52. Heracleous, P., Nakajima, Y., Saruwatari, H., Shikano, K.: A Tissue-Conductive Acoustic Sensor Applied in Speech Recognition for Privacy. In: sOc-EUSAI 2005. Proceedings of the 2005 joint conference on Smart objects and ambient intelligence, pp. 93–97. ACM Press, New York (2005)

    Chapter  Google Scholar 

  53. Denby, B., Oussar, Y., Dreyfus, G., Stone, M.: Prospect for a Silent Speech Interface Using Ultrasound Imaging. In: International Conference on Acoustics, Speech and Signal Processing, Toulouse, France (2006)

    Google Scholar 

  54. Hueber, T., Chollet, C., Denby, B., Stone, M., Zouari, L.: Ouisper: Corpus Based Synthesis Driven by Articulatory Data. In: International Conference on Phonetic Science (to appear, 2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Mohamed Chetouani Amir Hussain Bruno Gas Maurice Milgram Jean-Luc Zarader

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chollet, G. et al. (2007). Some Experiments in Audio-Visual Speech Processing. In: Chetouani, M., Hussain, A., Gas, B., Milgram, M., Zarader, JL. (eds) Advances in Nonlinear Speech Processing. NOLISP 2007. Lecture Notes in Computer Science(), vol 4885. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77347-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-77347-4_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-77346-7

  • Online ISBN: 978-3-540-77347-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics