Skip to main content
Log in

Abstract

This paper discusses robust speech section detection by audio and video modalities. Most of today's speech recognition systems require speech section detection prior to any further analysis, and the accuracy of detected speech section s is said to affect the speech recognition accuracy. Because audio modalities are intrinsically disturbed by audio noise, we have been researching video modality speech section detection by detecting deformations in speech organ images. Video modalities are robust to audio noise, but their detection sections are longer than audio speech sections because deformations in related organs start before the speech to prepare for the articulation of the first phoneme, and also because the settling down motion lasts longer than the speech. We have verified that inaccurate detected sections caused by this excess length degrade the speech recognition rate, leading to speech recognition errors by insertions. To reduce insertion errors, and enhance the robustness of speech detection, we propose a method that takes advantage of the two types of modalities. According to our experiment, the proposed method is confirmed to reduce the insertion error rate as well as increase the recognition rate in noisy environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. D.G. Stork and M.E. Hennecke, Speechreading by Humans and Machines, Berlin, Springer, 1996.

    Book  MATH  Google Scholar 

  2. G. Potamianos, C. Neti, G. Iyengar, and E. HelmuthLarge-Vocabulary Audio-Visual Speech Recognition by Machines and Humans,’ in Proc. Eurospeech 2001, 2001, pp. 1027-1030.

  3. J.C. Junqua, B. Reaves, and B. MakA Study of Endpoint Detection Algorithms in Adverse Conditions: Incidence on a DTW and HMM Recognizer,’ in Proc. Eurospeech 1991, 1991, pp. 1371-1374.

  4. M. Naito, H. Singer, H. Yamamoto, H. Nakajima, T. Matsui, H. Tsukada, A. Nakamura, and Y. SagisakaEvaluation of ATRSPREC for Travel Arrangement Task,’ in Proc. ASJ Fall Fall Meeting 1999, 1999, pp. 113-114.

  5. K. Kumatani, S. Nakamura, and K. ShikanoAn Adaptive Integration Method Based on Product HMM for Bi-Modal Speech Recognition,’ in Proc. HSC2001, 2001, pp. 195-198.

  6. K. Murai, K. Kumatani, and S. NakamuraA Robust End Point Detection by Speaker's Facial Motion,’ in Proc. HSC2001, 2001, pp. 199-202.

  7. W.H. Lin and K. SenguptaManipulation of Remote 3D Remote Avatar Through Facial Feature Detection and Real Time Tracking,’ in Proc. ICME 2001, 2001.

  8. H.A. Rowley, S. Baluja, and T. KanadeNeural Network-Based Face Detection,’ IEEE PAMI, vol. 20, no. 1, 1998, pp. 23-38.

    Article  Google Scholar 

  9. K. Fukui and O. YamaguchiFacial Feature Point Extraction Method Based on Combination of Shape Extraction and Pattern Matching,’ IEICE D-II, vol. J80-DII, no. 8, 1997, pp. 2170-2177.

    Google Scholar 

  10. http://www.cs.rug.nl/\t~peterkr/FACE/frhp.html.

  11. Harashima et al.Facial Image Processing System for Humanlike ‘Kansei’ Agent,’ IPA, 1998.

  12. Noise Database Published by Japan Electronics and Information Technology Institutes Association, 1990.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Murai, K., Nakamura, S. A Robust Bimodal Speech Section Detection. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 36, 81–90 (2004). https://doi.org/10.1023/B:VLSI.0000015088.91532.7a

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:VLSI.0000015088.91532.7a

Navigation