Elsevier

Speech Communication

Volume 65, November–December 2014, Pages 75-93
Speech Communication

The Hearing-Aid Speech Perception Index (HASPI)

https://doi.org/10.1016/j.specom.2014.06.002Get rights and content

Highlights

  • We propose a new intelligibility index based on the outputs of an auditory model.

  • The auditory model incorporates peripheral hearing loss and is accurate for both normal and impaired hearing.

  • The index compares the model outputs for a processed signal with the outputs for an unprocessed reference signal.

  • The index combines measurements of envelope and temporal fine structure fidelity.

  • Index results are presented for noise, distortion, and nonlinear signal processing outputs.

Abstract

This paper presents a new index for predicting speech intelligibility for normal-hearing and hearing-impaired listeners. The Hearing-Aid Speech Perception Index (HASPI) is based on a model of the auditory periphery that incorporates changes due to hearing loss. The index compares the envelope and temporal fine structure outputs of the auditory model for a reference signal to the outputs of the model for the signal under test. The auditory model for the reference signal is set for normal hearing, while the model for the test signal incorporates the peripheral hearing loss. The new index is compared to indices based on measuring the coherence between the reference and test signals and based on measuring the envelope correlation between the two signals. HASPI is found to give accurate intelligibility predictions for a wide range of signal degradations including speech degraded by noise and nonlinear distortion, speech processed using frequency compression, noisy speech processed through a noise-suppression algorithm, and speech where the high frequencies are replaced by the output of a noise vocoder. The coherence and envelope metrics used for comparison give poor performance for at least one of these test conditions.

Introduction

Signal degradations, such as additive noise or nonlinear distortion, can reduce speech intelligibility for both normal-hearing and hearing-impaired listeners, even when hearing aids are used. Hearing aids, in particular, can present a wide range of signal modifications since the input signal may be noisy and the hearing aid may incorporate several nonlinear processing algorithms (Kates, 2008). Hearing aid processing includes dynamic-range compression, in which low-level portions of the signal receive greater amplification than the high-level portions, and the time-varying gain causes distortion of the signal envelope and introduces modulation sidebands. Noise suppression algorithms attenuate the noisier portions of the noisy speech signal, and like dynamic range compression modify the signal envelope and introduce modulation sidebands. Frequency compression (Souza et al., 2013), in which high-frequency portions of the spectrum are shifted to lower frequencies where a hearing-impaired listener may have better sound thresholds, is also implemented in several hearing aids. The frequency shifting causes inherent distortions including reducing spacing between harmonics, altered spectral peak levels, and modified spectral shape (McDermott, 2011).

Many of these degradation mechanisms simultaneously affect the signal envelope and the signal temporal fine structure (TFS). Additive noise, for example, reduces the envelope modulation depth by filling in the pauses in the speech and also corrupts the TFS of the speech by adding timing jitter corresponding to the random fluctuations of the noise. Peak clipping, which may be used to prevent unacceptable loud sounds, reduces the signal modulation depth by removing the signal peaks and also modifies the TFS by introducing additional frequency components corresponding to the harmonic distortion products. Thus for many forms of signal degradation, changes to the signal envelope and to the TFS are closely related.

Changes to the signal TFS have been successfully used to predict speech intelligibility. The TFS changes are often measured using the coherence function (Carter et al., 1973, Shaw, 1981, Kates, 1992). In the time domain, the coherence is computed by taking the cross-correlation between a noise-free unprocessed reference signal and the noisy processed signal and dividing by the product of the root-mean-squared (RMS) intensities of the two signals. The magnitude-squared coherence is converted to a signal-to-distortion ratio (SDR) which can be used in a manner similar to the signal-to-noise ratio (SNR) in computing the Speech Intelligibility Index (SII) (ANSI, 1997) to produce the coherence SII (CSII) (Kates and Arehart, 2005).

Changes to the signal envelope have also been used to predict speech intelligibility. The original version of the Speech Transmission Index (STI) (Houtgast and Steeneken, 1971, Steeneken and Houtgast, 1980), for example, used bands of amplitude-modulated noise as the probe signals and measured the reduction in signal modulation depth. However, this original version of the STI is not accurate for hearing-aid processing such as dynamic-range compression (Hohmann and Kollmeier, 1995). Speech-based versions of the STI have been developed that are based on estimating the SNR from cross-correlations of the signal envelopes in each frequency band (Ludvigsen et al., 1990, Holube and Kollmeier, 1996, Goldsworthy and Greenberg, 2004, Payton and Shrestha, 2008). An intelligibility index based on averaging envelope correlations for 20-ms speech segments has been developed by Christiansen et al. (2010), and Taal et al. (2011b) have developed the short-time objective intelligibility measure (STOI) which uses envelope correlations computed for 382-ms speech segments. Changes in the envelope time–frequency modulation have also been used as the basis of a speech intelligibility index (Elhilali et al., 2003).

If intelligibility can be predicted using either signal coherence or envelope correlation, is there any reason to prefer one approach over the other? A procedure that combines coherence with changes in the signal envelope may be more robust than one that uses just the coherence because there are several situations where a coherence-based approach can fail. One example where coherence will perform poorly is frequency compression. Frequency compression (Aguilera Muñoz et al., 1999, Simpson et al., 2005, Glista et al., 2009) is intended to improve the audibility of high-frequency speech sounds by shifting them to lower frequency regions where listeners with high-frequency hearing loss have better hearing thresholds. However, the cross-correlation between a sinusoid and a frequency-shifted version of the sinusoid will approach zero as the duration of the observation interval is increased. Thus frequency compression will lead to predictions of lower intelligibility as the amount of frequency shift is increased even if the intelligibility has not actually been affected, and the predicted loss in intelligibility will depend on the size of the speech segments used in computing the intelligibility index.

A second situation where coherence has limitations is for some forms of noise suppression, specifically the ideal binary mask (IBM). In IBM processing, the speech is divided into frequency bands and each band further divided into time segments to produce time–frequency cells. If the SNR in a time–frequency cell is greater than a preset threshold (e.g. 0 dB) the gain for that cell is set to 1, otherwise the cell is attenuated (Wang et al., 2008, Kjems et al., 2009). High intelligibility is found for noisy speech when the ideal mask, computed from the speech and noise with the threshold set to the signal-to-noise ratio, is applied to a signal comprised of noise alone (Wang et al., 2008). The IBM output in this case is amplitude-modulated noise. The cross-correlation between the reference speech and modulated noise is therefore low and a coherence-based procedure would predict low intelligibility. Poor correlation of the CSII with IBM-processed speech has been reported by Christiansen et al. (2010) and by Taal et al., 2011a, Taal et al., 2011b.

A third example is the noise vocoder (Dudley, 1939, Shannon et al., 1995), in which the speech is replaced by bands of noise having the same envelope modulation as the speech. Excellent intelligibility can be obtained even though the speech TFS has been replaced by the random fluctuations of the noise (Shannon et al., 1995, Stone et al., 2008, Souza and Rosen, 2009, Anderson, 2010). However, a coherence-based calculation will predict lower intelligibility because of the reduction in the cross-correlation between the original speech and the noise vocoder output. Poor correlation of the CSII with noise-vocoded speech has been reported by Cosentino et al. (2012), although Chen and Loizou (2011) found comparable performance between the CSII and an envelope-based version of the STI.

These weaknesses in the use of coherence to predict intelligibility suggest that a procedure that combines coherence with changes in the envelope modulation may be more accurate than one that is based on coherence alone. For example, the results of Gómez et al. (2012) show that combining the CSII with an envelope measurement improves the accuracy in comparison to the CSII alone when predicting speech intelligibility for normal-hearing listeners for speech corrupted by various forms of additive noise.

An additional concern is predicting speech intelligibility for hearing-impaired listeners. An accurate intelligibility index for hearing-aid users has to deal with noisy input signals, the distortion introduced by the hearing-aid processing, and the hearing loss. Hearing loss is most often modeled as a shift in auditory threshold, and this threshold shift has been represented as an increase in the internal auditory noise level in the SII calculation procedure (Pavlovic et al., 1986, Humes et al., 1986, Payton et al., 1994, Holube and Kollmeier, 1996, Ching et al., 1998, Kates and Arehart, 2005). A similar modification of the hearing threshold has been applied to the STI (Humes et al., 1986, Payton et al., 1994, Holube and Kollmeier, 1996). Limitations in the accuracy of the predictions have led to empirical modifications of the SII, including a “desensitization factor” that increases with increasing hearing loss (Pavlovic et al., 1986) and a frequency-dependent proficiency factor that also depends on the hearing loss (Ching et al., 1998).

A more thorough model of peripheral hearing loss would be expected to yield more accurate intelligibility predictions. An auditory model (Dau et al., 1996) was used by Holube and Kollmeier (1996) for intelligibility predictions, and hearing loss was first implemented as a threshold shift based on the audiogram. Individual adjustments of the filter bandwidths and forward masking time constants were then incorporated into the model, which resulted in a small improvement in the accuracy of the intelligibility predictions for speech in noise. Hines and Harte (2010) also used a cochlear model (Zilany and Bruce, 2006) as an auditory front end for their intelligibility calculations. However, they only present simulation results, so the benefit of their approach in predicting intelligibility for hearing-impaired listeners has not been verified.

The purpose of this paper is to present a new intelligibility index that (1) combines measurements of coherence with measurements of envelope fidelity to give improved accuracy for a wide range of processing conditions, and (2) is accurate for hearing-impaired as well as normal-hearing listeners. The new index, the Hearing Aid Speech Perception Index (HASPI), uses an auditory model that incorporates aspects of normal and impaired peripheral auditory function (Kates, 2013). The auditory coherence is computed from the modeled basilar-membrane vibration output in each frequency band, and provides a measurement sensitive to the changes in the speech temporal fine structure. The cepstral correlation is computed from the envelope output in each frequency band, and provides a measurement of the fidelity with which the envelope time–frequency modulation has been preserved.

The remainder of the paper starts with a description of the data used to train and evaluate the intelligibility indices. The datasets include noise and nonlinear distortion, frequency compression for speech in babble noise, noisy speech processed using an ideal binary mask noise suppression algorithm, and speech partially replaced by the output of a noise vocoder; these data are described next. The auditory model used for the new index is then described, followed by a description of how the outputs of the auditory model are combined to produce the new HASPI index. The CSII and an envelope-based index based on the STOI are used as comparisons in the paper. A modified version of the STOI was derived because the STOI as published does not take auditory threshold or hearing loss into account. The revised CSII and modified STOI calculations are then described. Results are presented for the four different datasets, followed by a discussion of the factors that influence the model accuracy.

Section snippets

Intelligibility data

The original CSII was fitted to speech corrupted by noise and distortion (Kates and Arehart, 2005), and those data are described below. The revised CSII and HASPI are fit to four datasets, which comprise the noise and distortion data used for the original CSII plus results from three additional experiments. These additional datasets comprise frequency compression, noise suppression, and noise vocoder data. For all experiments, subjects listened to speech presented monaurally over headphones in

Auditory model

The approach to predicting speech intelligibility used in HASPI is to compare the output of an auditory model for a degraded test signal with the output for an unprocessed input signal. A detailed description of the auditory model is presented in Kates (2013) and is summarized here. The model is an extension of the Kates and Arehart (2010) auditory model; that model has been shown to give outputs that can be used to produce accurate predictions of speech quality for a wide variety of hearing

Intelligibility indices

This paper compares HASPI to the CSII and to an envelope-based index based on the STOI. The CSII, which is based on coherence, is described first. This is followed by HASPI, which combines coherence and envelope. The final index described is an envelope-based index motivated by the STOI, but which is adapted for hearing-impaired as well as normal-hearing listeners.

Results

Scatter plots for the index predictions are presented in Fig. 8. The open circles represent each processing condition averaged over the NH listeners, while the filled squares give each processing condition averaged over the HI listeners. The diagonal line represents perfect predictions; a point above the line indicates that the model prediction is less than the observed intelligibility, while a point below the line indicates that the model prediction is higher than the observed intelligibility.

Discussion

It was proposed that an index based on coherence alone would not perform as well as one incorporating both coherence and envelope modulation when applied to frequency compression, noise suppression, and noise vocoder data. The results presented in this paper mainly support that hypothesis. Both HASPI and CSII work well for the noise and distortion dataset, with HASPI having slightly better accuracy than CSII for the predictions when averaged over all of the subjects. Both indices also work well

Summary and conclusions

This paper has presented a new index for predicting speech intelligibility. HASPI compares the envelope and TFS outputs of an auditory model for a reference signal to the outputs of the model for a degraded signal. The model for the reference signal is adjusted for normal hearing, while the model for the degraded signal incorporates the peripheral hearing loss. The auditory model includes the middle-ear transfer function, an auditory filterbank, outer hair-cell dynamic-range compression,

Acknowledgments

The authors thank Dr. Rosalinda Baca for providing the statistical analysis used in this paper. Author JMK was supported by a grant from GN ReSound. Author KHA was supported by a NIH Grant (R01 DC60014) and by the grant from GN ReSound.

References (82)

  • I.C. Bruce et al.

    An auditory–periphery model of the effects of acoustic trauma on auditory nerve responses

    J. Acoust. Soc. Am.

    (2003)
  • D. Byrne et al.

    The national acoustics laboratories’ (NAL) new procedure for selecting gain and frequency response of a hearing aid

    Ear Hear.

    (1986)
  • G.C. Carter et al.

    Estimation of the magnitude-squared coherence function via overlapped fast Fourier transform processing

    IEEE Trans. Audio Electroacoust.

    (1973)
  • F. Chen et al.

    Predicting the intelligibility of vocoded speech

    Ear Hear.

    (2011)
  • Chen, F., Guan, T., Wong, L.N., 2013, Effect of temporal fine structure on speech intelligibility modeling. In: Proc....
  • T.Y.C. Ching et al.

    Speech recognition of hearing-impaired listeners: predictions from audibility and the limited role of high-frequency amplification

    J. Acoust. Soc. Am.

    (1998)
  • Cooke, M., 1991. Modeling Auditory Processing and Organization. PhD Thesis, U. Sheffield, May,...
  • N.P. Cooper et al.

    Mechanical responses to two-tone distortion products in the apical and basal turns of the mammalian cochlea

    J. Neurophysiol.

    (1997)
  • Cosentino, S., Marquardt, T., McAlpine, D., Falk, T.H., 2012. Towards objective measures of speech intelligibility for...
  • T. Dau et al.

    A quantitative model of the “effective” signal processing in the auditory system: I. Model structure

    J. Acoust. Soc. Am.

    (1996)
  • H. Dudley

    Remaking speech

    J. Acoust. Soc. Am.

    (1939)
  • D. Fogerty

    Perceptual weighting of individual and concurrent cues for sentence intelligibility: frequency, envelope, and fine structure

    J. Acoust. Soc. Am.

    (2011)
  • D. Glista et al.

    Evaluation of nonlinear frequency compression: clinical outcomes

    Int. J. Audiol.

    (2009)
  • R.L. Goldsworthy et al.

    Analysis of speech-based speech transmission index methods with implications for nonlinear operations

    J. Acoust. Soc. Am.

    (2004)
  • M.P. Gorga et al.

    AP measurements of short-term adaptation in normal and acoustically traumatized ears

    J. Acoust. Soc. Am.

    (1981)
  • Greenberg, S., Arai, T., 2004. What are the essential cues for understanding spoken language? IEICE Trans. Inf. and...
  • D.M. Harris et al.

    Forward masking of auditory nerve fiber responses

    J. Neurophys.

    (1979)
  • M.L. Hicks et al.

    Psychophysical measures of auditory nonlinearities as a function of frequency in individuals with normal hearing

    J. Acoust. Soc. Am.

    (1999)
  • V. Hohmann et al.

    The effect of multichannel dynamic compression on speech intelligibility

    J. Acoust. Soc. Am.

    (1995)
  • I. Holube et al.

    Speech intelligibility predictions in hearing-impaired listeners based on a psychoacoustically motivated perception model

    J. Acoust. Soc. Am.

    (1996)
  • K. Hopkins et al.

    The effects of age and cochlear hearing loss on temporal fine structure sensitivity, frequency sensitivity, and speech reception in noise

    J. Acoust. Soc. Am.

    (2011)
  • K. Hopkins et al.

    Effects of moderate cochlear hearing loss on the ability to benefit from temporal fine structure information in speech

    J. Acoust. Soc. Am.

    (2008)
  • T. Houtgast et al.

    Evaluation of speech transmission channels by using artificial signals

    Acustica

    (1971)
  • L.E. Humes et al.

    Application of the Articulation Index and the Speech Transmission Index to the recognition of speech by normal-hearing and hearing-impaired listeners

    J. Speech Hear. Res.

    (1986)
  • Imai, S., 1983. Cepstral analysis synthesis on the mel frequency scale. In: Proc. IEEE Int. Conf. Acoust. Speech and...
  • L.V. Immerseel et al.

    Digital implementation of linear gammatone filters: comparison of design methods

    Acoust. Res. Lett. Online

    (2003)
  • J.M. Kates

    A time domain digital cochlear model

    IEEE Trans. Sig. Proc.

    (1991)
  • J.M. Kates

    On using coherence to measure distortion in hearing aids

    J. Acoust. Soc. Am.

    (1992)
  • J.M. Kates

    Digital Hearing Aids

    (2008)
  • Kates, J.M., 2013. An auditory model for intelligibility and quality predictions. Proc. Mtgs. Acoust. (POMA) 19,...
  • J.M. Kates et al.

    Coherence and the speech intelligibility index

    J. Acoust. Soc. Am.

    (2005)
  • Cited by (156)

    View all citing articles on Scopus
    View full text