Abstract
Most automatic speech recognizers (ASRs) concentrate on read speech, which is different from spontaneous speech with disfluencies. ASRs cannot deal with speech with a high rate of disfluencies such as filled pauses, repetitions, lengthening, repairs, false starts and silence pauses. In this paper, we focus on the feature analysis and modeling of the filled pauses “ah,” “ung,” “um,” “em,” and “hem” in spontaneous speech. Karhunen-Loéve transform (KLT) and linear discriminant analysis (LDA) were adopted to select discriminant features for filled pause detection. In order to suitably determine the number of discriminant features, Bartlett hypothesis testing was adopted. Twenty-six features were selected using Bartlett hypothesis testing. Gaussian mixture models (GMMs), trained with a gradient decent algorithm, were used to improve the filled pause detection performance. The experimental results show that the filled pause detection rates using KLT and LDA were 84.4% and 86.8%, respectively. A significant improvement was obtained in the filled pause detection rate using the discriminative GMM with KLT and LDA. In addition, the LDA features outperformed the KLT features in the detection of filled pauses.
Similar content being viewed by others
References
W. Ward, ‘Understanding Spontaneous Speech: The Phoenix System,’ Proc. of ICASSP-91, 1991, pp. 365-367.
A. Kai and S. Nakagawa, ‘Investigation on Unknown Word Processing and Strategies for Spontaneous Speech Understanding,’ Proc. of Eurospeech'95, 1995, pp. 2095-2098.
A. Stolcke and E. Shriberg, ‘Statistical Language Model for Speech Disfluencies,’ Proc. of ICASSP-96, vol. 1, 1996, pp. 405-408.
M. Siu and M. Ostendorf, ‘Modeling Disfluencies in Conversation Speech,’ Proc. of ICSLP-96, vol. 1, 1996, pp. 386-389.
M. Siu and M. Ostendorf, ‘Variable N-Grams and Extensions for Conversational Speech Language Modeling,’ IEEE Trans. Speech and Audio Processing, vol. 8, no. 1, 2000, pp. 63-75.
L.M. Tomokiyo, ‘Linguistic Properties of Non-Native Speech,’ Proc. of ICASSP-2000, vol. 3, 2000, pp. 1335-1338.
M. Swerts, A. Wichmann, and R.J. Beun, ‘Filled Pauses as Markers of Discourse Structure,’ Proc. ICSLP-96, vol. 2, 1996, pp. 1033-1036.
D. O'Shaughnessy, ‘Recognition of Hesitations in Spontaneous Speech,’ Proc. of ICASSP-92, vol. 1, 1992, pp. 521-524.
M. Gabrea and D. O'Shaughnessy, ‘Detection of Filled Pauses in Spontaneous Conversation Speech,’ Proc. of ICSLP-2000, 2000.
G. Feng and E. Castelli, ‘Some Acoustic Feature of Nasal and Nasalized Vowels: A Target for Vowel Nasalization,’ J. Acoust. Soc. Am., vol. 99, no. 6, 1996, pp. 3694-3706.
M.Y. Chen, ‘Acoustic Correlates of English and French Nasalized Vowels,’ J. Acoust. Soc. Am., vol. 102, no. 4, 1997, pp. 2360-2370.
O. Fujimura, ‘Analysis of Nasal Consonants,’ J. Acoust. Soc. Am., vol. 34, 1962, pp. 1865-1875.
D. Recasens, ‘Place Cues for Nasal Consonants with Special Reference to Catalan,’ J. Acoust. Soc. Am., vol. 73, no. 4, 1983, pp. 1346-1353.
C.-H. Wu and G.-L. Yan, ‘Discriminative Disfluency Modeling for Spontaneous Speech Recognition,’ EuroSpeech, vol. 3, 2001, pp. 1955-1958.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Wu, CH., Yan, GL. Acoustic Feature Analysis and Discriminative Modeling of Filled Pauses for Spontaneous Speech Recognition. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 36, 91–104 (2004). https://doi.org/10.1023/B:VLSI.0000015089.17975.f4
Published:
Issue Date:
DOI: https://doi.org/10.1023/B:VLSI.0000015089.17975.f4