Skip to main content

Bimodality Streams Integration for Audio-Visual Speech Recognition Systems

  • Conference paper
  • First Online:
Hybrid Intelligent Systems (HIS 2016)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 420))

Included in the following conference series:

Abstract

This paper demonstrates the state-of-the-art of ‘whole-word-state Dynamic Bayesian Network (DBN)’ model of audio and visual integration. In fact, many DBN models have been proposed in recent years for speech recognition due to its strong description ability and flexible structure. DBN is a statistic model that can represent multiple collections of random variables as they evolve over time. However, DBN model with whole-word-state structure, does not allow making speech as subunit segmentation. In this study, single stream DBN (SDBN) model is proposed where speech recognition and segmentation experiments are done on audio and visual speech respectively. In order to evaluate the performances of the proposed model, the timing boundaries of the segmented syllable word is compared to those obtained from the well trained tri-phone Hidden Markov Models (HMM). Besides the word recognition results, word syllable recognition rate and segmentation outputs are also obtained from the audio and visual speech features streams. Experiment results shows that, the integration of SDBN model with perceptual linear prediction (PLP) feature stream produce higher word recognition performance rate of 98.50 % compared with the tri-phone HMM model in clean environment. Meanwhile, with the increasing noise in the audio stream, the SDBN model shows more robust promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Nefian, A.V., Liang, L., Pi, X., Xiaoxiang, L., Mao, C., Murphy, K.: A coupled HMM for audio-visual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2013–2016 (2002)

    Google Scholar 

  2. Zweig, G.: Speech recognition with dynamic Bayesian networks, Ph.D. Dissertation, University of California, Berkeley (1998)

    Google Scholar 

  3. Bilmes, J., Zweig, G.: Discriminatively structured dynamic graphical models for speech recognition, Technical report, JHU 2001 Summer Workshop (2001)

    Google Scholar 

  4. Zhang, Y., Diao, Q., Huang, S.: DBN based multi-stream models for speech. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 836–839 (2003)

    Google Scholar 

  5. Gowdy, J., Subramanya, A., Bartels, C., Bilmes, J.: DBN based multistream models for audio-visual speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 993–996 (2004)

    Google Scholar 

  6. Bilmes, J., Bartels, C.: Graphical model architectures for speech recognition. IEEE Signal Process. Mag. 22, 89–100 (2005)

    Article  Google Scholar 

  7. Ravyse, I.: Facial analysis and synthesis. Ph.D. thesis, Vrije Universiteit Brussel, Dept. Electronics and Informatics, Belgium. Online: www.etro.vub.ac.be/Personal/icravyse/RavysePhDThesis.pdf (2006)

  8. Zhou, Y., Gu, L., Zhang, H.J.: Bayesian tangent shape model: estimating shape and pose parameters via bayesian inference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2003), vol. 1. pp. 109–118 (2003

    Google Scholar 

  9. Terry, L.: A phone-viseme dynamic bayesian network for ausio-visual automatic speech recognition. In: The 19th International Conference on Pattern Recognition, pp. 1–4 (2008)

    Google Scholar 

  10. Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)

    Google Scholar 

  11. Bilmes, J., Zweig, G.: The graphical models toolkit: an open source software system for speech and time-series processing. In: Proceedings of the IEEE International Conference on Acoustic Speech and Signal Processing (ICASSP), vol. 4, pp. 3916–3919 (2002)

    Google Scholar 

  12. Lee, L., Low, W., Mohamed, A.R.A.: A comparative analysis of word structures in malay and english children’s stories. Soc. Sci. Humanit. J. 21(1), 67–84 (2013)

    Google Scholar 

  13. Young, S.J., Kershaw, D., Odell, J., Woodland, P.: The HTK Book (for HTK Version 3.4) (2006). http://htk.eng.cam.ac.uk/docs/docs.shtml

Download references

Acknowledgements

Due acknowledgement is accorded to the Research Management Centre (RMC), Universiti Teknologi MARA for the funding received through the RAGS/1/2014/ICT07/UiTM//3.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Noraini Seman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Seman, N., Roslan, R., Jamil, N., Ardi, N. (2016). Bimodality Streams Integration for Audio-Visual Speech Recognition Systems. In: Abraham, A., Han, S., Al-Sharhan, S., Liu, H. (eds) Hybrid Intelligent Systems. HIS 2016. Advances in Intelligent Systems and Computing, vol 420. Springer, Cham. https://doi.org/10.1007/978-3-319-27221-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27221-4_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27220-7

  • Online ISBN: 978-3-319-27221-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics