Skip to main content
Log in

Acoustic Feature Analysis and Discriminative Modeling of Filled Pauses for Spontaneous Speech Recognition

  • Published:
Journal of VLSI signal processing systems for signal, image and video technology Aims and scope Submit manuscript

Abstract

Most automatic speech recognizers (ASRs) concentrate on read speech, which is different from spontaneous speech with disfluencies. ASRs cannot deal with speech with a high rate of disfluencies such as filled pauses, repetitions, lengthening, repairs, false starts and silence pauses. In this paper, we focus on the feature analysis and modeling of the filled pauses “ah,” “ung,” “um,” “em,” and “hem” in spontaneous speech. Karhunen-Loéve transform (KLT) and linear discriminant analysis (LDA) were adopted to select discriminant features for filled pause detection. In order to suitably determine the number of discriminant features, Bartlett hypothesis testing was adopted. Twenty-six features were selected using Bartlett hypothesis testing. Gaussian mixture models (GMMs), trained with a gradient decent algorithm, were used to improve the filled pause detection performance. The experimental results show that the filled pause detection rates using KLT and LDA were 84.4% and 86.8%, respectively. A significant improvement was obtained in the filled pause detection rate using the discriminative GMM with KLT and LDA. In addition, the LDA features outperformed the KLT features in the detection of filled pauses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. W. Ward, ‘Understanding Spontaneous Speech: The Phoenix System,’ Proc. of ICASSP-91, 1991, pp. 365-367.

  2. A. Kai and S. Nakagawa, ‘Investigation on Unknown Word Processing and Strategies for Spontaneous Speech Understanding,’ Proc. of Eurospeech'95, 1995, pp. 2095-2098.

  3. A. Stolcke and E. Shriberg, ‘Statistical Language Model for Speech Disfluencies,’ Proc. of ICASSP-96, vol. 1, 1996, pp. 405-408.

    Google Scholar 

  4. M. Siu and M. Ostendorf, ‘Modeling Disfluencies in Conversation Speech,’ Proc. of ICSLP-96, vol. 1, 1996, pp. 386-389.

    Google Scholar 

  5. M. Siu and M. Ostendorf, ‘Variable N-Grams and Extensions for Conversational Speech Language Modeling,’ IEEE Trans. Speech and Audio Processing, vol. 8, no. 1, 2000, pp. 63-75.

    Article  Google Scholar 

  6. L.M. Tomokiyo, ‘Linguistic Properties of Non-Native Speech,’ Proc. of ICASSP-2000, vol. 3, 2000, pp. 1335-1338.

    Google Scholar 

  7. M. Swerts, A. Wichmann, and R.J. Beun, ‘Filled Pauses as Markers of Discourse Structure,’ Proc. ICSLP-96, vol. 2, 1996, pp. 1033-1036.

    Google Scholar 

  8. D. O'Shaughnessy, ‘Recognition of Hesitations in Spontaneous Speech,’ Proc. of ICASSP-92, vol. 1, 1992, pp. 521-524.

    Google Scholar 

  9. M. Gabrea and D. O'Shaughnessy, ‘Detection of Filled Pauses in Spontaneous Conversation Speech,’ Proc. of ICSLP-2000, 2000.

  10. G. Feng and E. Castelli, ‘Some Acoustic Feature of Nasal and Nasalized Vowels: A Target for Vowel Nasalization,’ J. Acoust. Soc. Am., vol. 99, no. 6, 1996, pp. 3694-3706.

    Article  Google Scholar 

  11. M.Y. Chen, ‘Acoustic Correlates of English and French Nasalized Vowels,’ J. Acoust. Soc. Am., vol. 102, no. 4, 1997, pp. 2360-2370.

    Article  Google Scholar 

  12. O. Fujimura, ‘Analysis of Nasal Consonants,’ J. Acoust. Soc. Am., vol. 34, 1962, pp. 1865-1875.

    Article  Google Scholar 

  13. D. Recasens, ‘Place Cues for Nasal Consonants with Special Reference to Catalan,’ J. Acoust. Soc. Am., vol. 73, no. 4, 1983, pp. 1346-1353.

    Article  MATH  Google Scholar 

  14. C.-H. Wu and G.-L. Yan, ‘Discriminative Disfluency Modeling for Spontaneous Speech Recognition,’ EuroSpeech, vol. 3, 2001, pp. 1955-1958.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, CH., Yan, GL. Acoustic Feature Analysis and Discriminative Modeling of Filled Pauses for Spontaneous Speech Recognition. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 36, 91–104 (2004). https://doi.org/10.1023/B:VLSI.0000015089.17975.f4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:VLSI.0000015089.17975.f4

Navigation