Skip to main content

Extraction of Features for Lip-reading Using Autoencoders

  • Conference paper
Book cover Speech and Computer (SPECOM 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8773))

Included in the following conference series:

Abstract

We study the incorporation of facial depth data in the task of isolated word visual speech recognition. We propose novel features based on unsupervised training of a single layer autoencoder. The features are extracted from both video and depth channels obtained by Microsoft Kinect device. We perform all experiments on our database of 54 speakers, each uttering 50 words. We compare our autoencoder features to traditional methods such as DCT or PCA. The features are further processed by simplified variant of hierarchical linear discriminant analysis in order to capture the speech dynamics. The classification is performed using a multi-stream Hidden Markov Model for various combinations of audio, video, and depth channels. We also evaluate visual features in the join audio-video isolated word recognition in noisy environments. English

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent Advances in the Automatic Recognition of Audiovisual Speech. Proc. of the IEEE 91(9), 1306–1326 (2003)

    Google Scholar 

  2. Goecke, R.: Current Trends in Joint Audio-Video Signal Processing: A Review. In: Proc. of the Eighth International Symposium on Signal Processing and Its Applications, pp. 70–73 (2005)

    Google Scholar 

  3. Liew, A.W.Ch., W.S.: Visual Speech Recognition: Lip Segmentation and Mapping. Information Science Reference – Imprint. IGI Publishing, New York (2009)

    Google Scholar 

  4. Lan, Y., Theobald, B.J., Harvey, R., Bowden, R.: Comparing Visual Features for Lipreading. In: Proc. AVSP, pp. 102–106 (2009)

    Google Scholar 

  5. Paleček, K., Chaloupka, J.: Audio-visual Speech Recognition in Noisy Audio Environments. In: 36th International Conference on Telecommunications and Signal Processing (TSP), pp. 484–487 (2013)

    Google Scholar 

  6. Goecke, R., Millar, J.B., Zelinovsky, A., Ribes, R.J.: Stereo Vision Lip-Tracking for Audio-Video Speech Processing. In: Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, Signal Processing (2001)

    Google Scholar 

  7. Císař, P., Krňoul, Z., Železný, M.: 3D Lip-Tracking for Audio-Visual Speech Recognition in Real Applications. In: Proc. INTERSPEECH (2004)

    Google Scholar 

  8. Galatas, G., Potamianos, G., Makedon, F.: Audio-visual Speech Recognition Incorporating Facial Depth Information Captured by the Kinect. In: Proc. EUSIPCO, pp. 2714–2717 (2012)

    Google Scholar 

  9. Pei, Y., Kim, T.-K., Zha, H.: Unsupervised Random Forest Manifold Alignment for Lipreading. In: Proc. ICCV, pp. 129–136 (2013)

    Google Scholar 

  10. Bengio, Y.: Learning Deep Architectures for AI. Foundations and Trends in Machine Learning 2(1), 1–127 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  11. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal Deep Learning. In: Proc. ICML, pp. 689–696 (2011)

    Google Scholar 

  12. Huang, J., Kingsbury, B.: Audio-visual Deep Learning for Noise Robust Speech Recognition. In: Proc. ICASSP, pp. 7596–7599 (2013)

    Google Scholar 

  13. Viola, P.A., Jones, M.J.: Robust Real-Time Face Detection. International Journal of Computer Vision 57, 137–154 (2004)

    Article  Google Scholar 

  14. Cao, X., Wei, Y., Wen, F., Sun, J.: Face Alignment by Explicit Shape Regression. In: Proc. CVPR, pp. 2887–2894 (2012)

    Google Scholar 

  15. Steve, Y., Odel, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book, version 2.1. Cambridge University, United Kingdom (1997)

    Google Scholar 

  16. Varga, A.P., Steeneken, H.J.M., Tomlinson, M., Jones, D.: The NOISEX-92 Study on the Effect of Additive Noise on Automatic Speech Recognition. Technical Report, DRA Speech Research Unit (1992)

    Google Scholar 

  17. Kamath, S., Loizou, P.: A Multi-band Spectral Subtraction Method for Enhancing Speech Corrupted by Colored Noise. In: Proc. ICASSP, pp. IV-4164 (2002)

    Google Scholar 

  18. Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian Optimization of Machine Learning Algorithms. Advances in Neural Information Processing Systems 25, 2951–2959 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Paleček, K. (2014). Extraction of Features for Lip-reading Using Autoencoders. In: Ronzhin, A., Potapova, R., Delic, V. (eds) Speech and Computer. SPECOM 2014. Lecture Notes in Computer Science(), vol 8773. Springer, Cham. https://doi.org/10.1007/978-3-319-11581-8_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11581-8_26

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11580-1

  • Online ISBN: 978-3-319-11581-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics