Phoneme-group specific octave-band weights in predicting speech intelligibility

doi:10.1016/S0167-6393(02)00011-0

Speech Communication

Volume 38, Issues 3–4, November 2002, Pages 399-411

https://doi.org/10.1016/S0167-6393(02)00011-0 Get rights and content

Abstract

In an earlier study we derived robust frequency-weighting functions for prediction of the intelligibility of short nonsense words. These frequency-weighting functions are applied for prediction of intelligibility such as with the speech transmission index (STI). Six independent experiments revealed essentially similar frequency-weighting functions for the prediction of the nonsense word scores with respect to signal-to-noise ratio and gender [Speech Communication 28 (1999) 109]. Although the frequency weightings do not vary significantly for signal-to-noise ratio or gender, other studies have shown that using different types of speech material (i.e., nonsense words, phonetically balanced words and connected discourse) resulted in quite different frequency-weighting functions. This may be related to the distribution of specific phonemes in the test material. In order to obtain a more generic description of the frequency weighting, four relevant groups of phonemes were identified. In situations with reduced intelligibility, a small confusion rate of the phonemes between the groups and a high confusion rate of the phonemes within each group was observed. For each group a specific frequency-weighting function and a good prediction of the phoneme group scores could be obtained. It was shown that from these (weighted) phoneme group scores, word scores could be predicted with a prediction accuracy of ca. 4% (this corresponds to a signal-to-noise ratio of about 1 dB). Hence, this method provides a more generic way to predict intelligibility scores for different types of speech material.

Introduction

Octave-band weighting functions represent the contribution of each octave band to the intelligibility of a speech signal. In an earlier study it was found that these weighting functions are robust for signal-to-noise ratio and gender of the speaker (Steeneken, 1992; Steeneken and Houtgast, 1999). However, experiments based on different types of speech material showed quite different frequency-weighting functions (French and Steinberg, 1947, Steeneken and Houtgast, 1980, Steeneken and Houtgast, 1999, Pavlovic, 1987, Studebaker et al., 1987, Duggirala et al., 1988). This may be related to the specific distribution of phonemes in the test material, since the frequency-weighting functions do vary significantly according to phonetic content. In Fig. 1 typical weighting functions are given that are derived from two standards on the objective prediction of speech intelligibility (speech transmission index, STI, described by IEC 60286-16, 1998; speech intelligibility index, SII, by ANSI S3.5, 1997) and for consonants and vowels from a study by Steeneken (1992). Fig. 1 shows a large difference between the curves for consonants and for vowels. The weighting function for the vowels has a maximum for the contributions in the 0.5 and 2 kHz octave band. Consonants and equally balanced CVC words (words of the type consonant–vowel–consonant with an equally balanced phoneme distribution) cover a wider frequency range (125 Hz–8 kHz). Obviously the octave-band contributions depend on the type of speech considered.

This is in agreement with the differences found for the effect of various types of distortions on the intelligibility of vowels and consonants. This is illustrated in Fig. 2, showing a scatter diagram of the initial consonant score versus vowel scores for male speech in 78 transmission conditions with various combinations of bandwidth and signal-to-noise ratio.

For diagnostic assessment of speech communication systems it is of interest to consider not only the overall performance derived from a specific intelligibility test (i.e., related to the speech material) but also to identify the performance for specific phonemes or groups of phonemes (Miller and Nicely, 1955). For example, the standard for the SII recommends six groups of frequency-weighting factors for prediction of different subjective intelligibility measures.

From an experiment with CVC-word tests we obtained confusions among consonants and among vowels for many different transmission conditions. For the consonants, a clustering of three groups of consonants with many intra-group confusions was found (fricatives, plosives, and vowel-like consonants). The phonemes within each of these groups show a quite similar response for various types of degradation, and confusions are mainly between phonemes within each group (Steeneken, 1992). This is given in Table 1 for 17 representative Dutch initial consonants obtained for male speech and 26 different combinations of band-pass limiting.

In the table the phonemes (SAMPA notation, 1987) that show many mutual confusions are grouped together. This results in a clustering of the plosives (p, t, k, b, d), the fricatives (f, s, v, z, x), and the vowel-like consonants (m, n, l, R, w, j, h). Some confusions are found between phonemes belonging to different clusters: f, s → p, t; v, z → w, j, h, and b → w.

A similar representation for 15 Dutch vowels derived from the same set of transmission conditions did not show a systematic clustering. Hence, for the determination of (Dutch) phoneme-specific octave-band weights, in total four clusters of phonemes are likely to be considered: fricatives, plosives, vowel-like consonants and vowels. These clusters of frequently used phonemes (>2% for Dutch language) consist of 17 initial consonants and 15 vowels. For reasons of simplicity the final consonants were not considered separately as these consonants (11) are mainly a sub-set of the initial consonants.

Section snippets

Experimental design

For the determination of the octave-band-specific frequency weighting of each phoneme group, both the phoneme scores and the related octave-band-specific signal-to-noise ratios are required for a large number of different conditions. These data can be obtained by making use of a universal communication channel of which the transfer conditions (i.e., bandwidth, additive noise type, and signal-to-noise ratio) can be adjusted. For each condition the phoneme-group-specific score and the mean

Experimental results

The experiments were based on the determination of the subjective and objective transmission quality of 78 transmission conditions for male speech and 51 transmission conditions for female speech. These transmission conditions were combinations of band-pass limiting (respectively 26 for male and 17 for female) and noise (3 signal-to-noise ratios). The subjective data included the individual phoneme-group scores and the CVC scores for male and female speakers. The objective data included the

Frequency weighting with respect to the type of speech

Several different frequency-weighting factors to predict speech intelligibility have been found in various studies. As given in Fig. 1 these are all related to different types of speech. The goal of this study is to develop a more generic model for general application. There are many differences between the studies that derived the frequency weighting functions, in particular with respect to the method by which the frequency-weighting function is derived from the subjective scores and the

Conclusions

The frequency-weighting functions used with the objective prediction of speech intelligibility depend on the type of speech material used for the development of such a method (Fig. 1). This study was focused to develop a more generic model for the objective prediction of speech intelligibility independent of the type of speech. For this purpose four phoneme groups were used (fricatives, plosives, vowel-like consonants, and vowels) for which four different sets of frequency-weighting functions

References (16)

H.J.M Steeneken et al.
Mutual dependency of the octave-band weights in predicting speech intelligibility
Speech Communication
(1999)
ANSI S3.5, 1997. American National Standard, Methods for the calculation of the speech intelligibility index. Standards...
A.W Bronkhorst et al.
A model for context effects in speech recognition
J. Acoust. Soc. Amer.
(1992)
V Duggirala et al.
Frequency importance functions for a feature recognition test material
J. Acoust. Soc. Amer.
(1988)
N.R French et al.
Factors governing the intelligibility of speech sounds
J. Acoust. Soc. Amer.
(1947)
IEC International Standard, 1998. Sound system equipment – Part 16. Objective rating of speech intelligibility by...
T Houtgast et al.
A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria
J. Acoust. Soc. Amer.
(1985)
G.A Miller et al.
An analysis of perceptual confusions among some English consonants
J. Acoust. Soc. Amer.
(1955)

There are more references available in the full text version of this article.

Cited by (30)

Audibility emphasis of low-level sounds improves consonant identification while preserving vowel identification for cochlear implant users
2022, Speech Communication
Citation Excerpt :
After adjusting loudness levels, participants pressed an “Okay” button and the level specified as “Medium” was used for subsequent phoneme identification procedures. We chose 1 kHz as the comparison frequency because of its central position in predicting speech intelligibility (Steeneken and Houtgast 2002); though we note that the current international standard recommends an extended set of frequencies from 500 to 4000 Hz to characterize loudness (ISO 16832). Pure tone detection thresholds were measured for 500, 1000, 2000, and 4000 Hz tones.
Consonant perception is challenging for listeners with hearing loss, and transmission of speech over communication channels further deteriorates the acoustics of consonants. Part of the challenge arises from the short-term low energy spectro-temporal profile of consonants (for example, relative to vowels). We hypothesized that an audibility enhancement approach aimed at boosting the energy of low-level sounds would improve identification of consonants without diminishing vowel identification. We tested this hypothesis with 11 cochlear implant users, who completed an online listening experiment remotely using the media device and implant settings that they most commonly use when making video calls. Loudness growth and detection thresholds were measured for pure tone stimuli to characterize the relative loudness of test conditions. Consonant and vowel identification were measured in quiet and in speech-shaped noise for progressively difficult signal-to-noise ratios (+12, +6, 0, -6 dB SNR). These conditions were tested with and without an audibility-emphasis algorithm designed to enhance consonant identification at the source. The results show that the algorithm improves consonant identification in noise for cochlear implant users without diminishing vowel identification. We conclude that low-level emphasis of audio can improve speech recognition for cochlear implant users in the case of video calls or other telecommunications where the target speech can be preprocessed separately from environmental noise.
The speech intelligibility and applicability of the speech transmission index in large spaces
2020, Applied Acoustics
This paper aims to explore the influence factors of speech intelligibility and the applicability of the speech transmission index (STI) in large spaces, where the sound energy is unevenly distributed and non-exponentially decays. The subjective speech intelligibility tests were conducted in Mandarin (China mainland) in two large spaces with volumes of 97,000 m³ and 246,000 m³. Objective indicators such as the Reverberation Time (RT), Early Decay Time (EDT), Definition (D₅₀), and Speech Transmission Index (STI) under different signal-to-noise ratio (SNRs) were also measured in these two spaces. The results showed that both the SNR and room acoustics had significant effect on the speech intelligibility in these two spaces, but the effect of room acoustics on speech intelligibility was also affected by SNR. The speech intelligibility scores significantly increased with the increase in SNR when the SNR was less than 14.4 dB. In terms of room acoustics, D₅₀ was more relevant to speech intelligibility than RT and EDT in these two large spaces when SNR ranged from −5dB to 15 dB. The STI value in large spaces should not be used as in ordinary spaces to evaluate the speech intelligibility. Based on the tests in this paper, the corresponding relation between STI and speech intelligibility in large spaces was modified, and a new rating threshold of STI was also proposed according to the revised relation, which indicated a necessity to modify the rating criteria of using STI to predict speech intelligibility in large spaces.
Relationship between Chinese speech intelligibility and speech transmission index in rooms based on auralization
2011, Speech Communication
Citation Excerpt :
The speech transmission index (STI) developed by Houtgast and Steeneken (1973), combines both a room acoustics and an SNR component into a single objective index. The STI measure was further improve and extended by Steeneken and Houtgast (1999, 2002a,b) with respect to mutual dependence of the octave-band weight, phoneme-group specific octave-band weights, the effect of a discontinuous frequency transfer and high signal and noise levels. So far, the STI has been suggested as the objective index of speech intelligibility in rooms by IEC 60268-16 Ed. 3.0 (2003) and has been shown to be successful for the evaluation and prediction of speech intelligibility for Western languages in rooms.
Based on simulated monaural and binaural room impulse responses, the relationship between Chinese speech intelligibility scores and speech transmission index (STI) including the effect of noise is investigated using a phonetically balanced test in virtual rooms. The results show that Chinese speech intelligibility scores increase monotonically with STI values. The correlation coefficients are 0.95, 0.90 and the standard deviation is 5.6%, 6.7% under diotic and dichotic listening conditions, respectively. Compared with diotic listening based on monaural room impulse responses, dichotic listening based on binaural room impulse responses can improve by 2.7 dB signal-to-noise ratio for Chinese speech intelligibility. The STI method can better predict and evaluate Chinese speech intelligibility in rooms.
Speech intelligibility from image processing
2010, Speech Communication
Hearing loss research has traditionally been based on perceptual criteria, speech intelligibility and threshold levels. The development of computational models of the auditory periphery has allowed experimentation via simulation to provide quantitative, repeatable results at a more granular level than would be practical with clinical research on human subjects. The responses of the model used in this study have been previously shown to be consistent with a wide range of physiological data from both normal and impaired ears for stimuli presentation levels spanning the dynamic range of hearing.
The model output can be assessed by examination of the spectro-temporal output visualised as neurograms. The effect of sensorineural hearing loss (SNHL) on phonemic structure was evaluated in this study using two types of neurograms: temporal fine structure (TFS) and average discharge rate or temporal envelope. A new systematic way of assessing phonemic degradation is proposed using the outputs of an auditory nerve model for a range of SNHLs. The mean structured similarity index (MSSIM) is an objective measure originally developed to assess perceptual image quality. The measure is adapted here for use in measuring the phonemic degradation in neurograms derived from impaired auditory nerve outputs. A full evaluation of the choice of parameters for the metric is presented using a large amount of natural human speech.
The metric’s boundedness and the results for TFS neurograms indicate it is a superior metric to standard point to point metrics of relative mean absolute error and relative mean squared error. MSSIM as an indicative score of intelligibility is also promising, with results similar to those of the standard speech intelligibility index metric.
Relationship between Chinese speech intelligibility and speech transmission index using diotic listening
2007, Speech Communication
The speech intelligibility in rooms is evaluated using the room impulse responses obtained from the room acoustical simulation software ODEON. The simulated room impulse responses are first convolved with the speech intelligibility test signals recorded in an anechoic chamber, then reproduced through the earphone. The subjective Chinese speech intelligibility scores are obtained and the relationship between Chinese speech intelligibility scores and speech transmission index (STI) is built and validated. The result shows that there is high correlation between Chinese speech intelligibility scores and STI. The STI method can predict and evaluate the speech intelligibility for Mandarin Chinese without changes in the algorithm of the weighting values for diotic listening in rooms.
Validation of the revised STI<inf>r</inf> method
2002, Speech Communication
The revised model for the speech transmission index (STI_r, Speech Communication 28 (1999) 109), was validated with an independent set of 68 test conditions. For a subset of 18 conditions, including only additive noise and band-pass limiting, it was verified that the STI_r provides a good prediction of the CVC-word score. The additional 50 conditions included non-linear distortion, echoes, automatic gain control, and waveform coding. For conditions with these types of distortion specific parameters of the test signal are of interest. The parameters of the STI model were tuned in an earlier study for an optimal fit between the traditional STI and the CVC score, for a similar set of transmission conditions [J. Acoust. Soc. Amer. 67 (1980) 318]. It was found that the parameter settings also apply to the present revised model. The prediction accuracy for both male and female speech is 4–6% when expressed in CVC-word scores. This corresponds to a signal-to-noise ratio of about 1–2 dB.
In diesem Beitrag wird eine Überarbeitung des speech transmission index (STI_r, Speech Communication 28 (1999) 109) für einen unabhängigen Satz von 68 Testbedingungen validiert. Bei 18 dieser Testbedigungen, die ausschliesslich additives Rauschen und Bandbegrenzungen beinhalten, kan der STI_r eine gute Vorhersage von CVC-Identifikationsraten liefern. Die restlichen 50 Testbedingungen enthalten nichtlineare Verzerrungen, Echos, automatic gain control, und Signalform-Kodierer. Für diese Arten von Störungen spielen die Parameter des Testsignals eine Rolle. Sie wurden in frühere Experimenten (bei ähnlichen Übertragungskanälen) für eine optimale Vorhersage der CVC-Identifikationsrate aus dem STI bestimmt [J. Acoust. Soc. Amer. 67 (1980) 318]. Diese Parametereinstellungen liefern auch für das überarbeitete Modell gute Ergebnisse. Die Vorhersagegenauigkeit beträgt für männliche und weibliche Sprache etwa 4–6% (bezogen auf CVC_Erkennungsraten). Dies entspricht einem Störabstand von ca. 1–2 dB.

View all citing articles on Scopus

View full text

Phoneme-group specific octave-band weights in predicting speech intelligibility

Abstract

Introduction

Section snippets

Experimental design

Experimental results

Frequency weighting with respect to the type of speech

Conclusions

Speech Communication

A model for context effects in speech recognition

J. Acoust. Soc. Amer.

Frequency importance functions for a feature recognition test material

J. Acoust. Soc. Amer.

Factors governing the intelligibility of speech sounds

J. Acoust. Soc. Amer.

A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria

J. Acoust. Soc. Amer.

An analysis of perceptual confusions among some English consonants

J. Acoust. Soc. Amer.