On the usefulness of STFT phase spectrum in human listening tests☆
Introduction
In this paper, the usefulness of the phase spectrum1 is explored in human speech perception.2 The authors have a long-term goal of utilising phase spectra in an effort to improve automatic speech recognition (ASR) performance. It is common practice in ASR to discard the phase spectrum in favour of features that are derived purely from the magnitude spectrum3 (Picone, 1993). In the ASR framework, speech is processed frame-wise using a temporal window of duration 20–40 ms. If the phase spectrum is to be of any use for ASR applications, it should provide some information about speech intelligibility using small window durations (20–40 ms) in a human perception experiment.
A few studies have been reported in the literature which discuss whether the phase spectrum provides any information which can contribute to intelligibility for human speech recognition (HSR). Schroeder (1975), and Oppenheim and Lim (1981) performed some informal perception experiments, concluding that the phase spectrum is important for intelligibility when the window duration of the short-time Fourier transform (STFT) is large (Tw > 1 s), while it seems to convey negligible intelligibility at small window durations (20–40 ms).
Liu et al. (1997) have recently investigated the intelligibility of phase spectra through a more formal human speech perception study. They recorded six stop-consonants from 10 speakers in vowel–consonant–vowel context. Using these recordings, they created magnitude-only and phase-only stimuli. Magnitude-only stimuli were created by analysing the original recordings with a STFT, replacing each frame’s phase spectra with random phase values, then reconstructing the speech signal using the overlap-add method. In the case of phase-only stimuli, the original phase of each frame was retained, while the magnitude of each frame was set to unity for all frequency components. The stimuli were created for various window lengths from 16 ms to 512 ms. These were played to subjects, whose task was to identify each as one of the six consonants. Their results (Fig. 1) show that intelligibility of magnitude-only stimuli decreases while the intelligibility of the phase-only stimuli increases as the window duration increases. For small window durations (Tw < 128 ms), magnitude-only stimuli are significantly more intelligible than phase-only stimuli (while the opposite is true for larger window lengths). This implies that for small window durations (which are of relevance for ASR applications), the magnitude spectrum contributes much more towards intelligibility than the phase spectrum.
The authors of this paper initially set out to reproduce Liu’s results; in doing so, made a number of modifications in Liu’s analysis–modification–synthesis procedure (see Fig. 2). The modifications produce results which are different from Liu’s results and more interesting from an ASR application’s viewpoint. The first suggested modification is that of the analysis window type. Liu and his collaborators employed a Hamming window for construction of both the magnitude-only and phase-only stimuli. In our experiments, we find that the intelligibility of phase-only stimuli is improved significantly and becomes comparable to that of magnitude-only stimuli when a rectangular window is used. The second suggested modification is the choice of analysis frame shift; Liu et al. used a frame shift of Tw/2. As shown by Allen and Rabiner (1977), in order to avoid aliasing errors during reconstruction, the STFT sampling period (or frame shift) must be at most Tw/4 for a Hamming window. In this paper, to be on the safer side, we use a frame shift of Tw/8. Our study also differs from Liu’s study with respect to the number of consonants used (16 for this study compared to 6 for Liu et al). The design parameters are discussed in further detail later in this paper. Our results indicate that even for small window durations (Tw < 128 ms), the phase spectrum can contribute to speech intelligibility as much as the magnitude spectrum if the analysis–modification–synthesis parameters are properly selected.4
The paper outline is as follows: In Section 2, we detail the analysis–modification–synthesis technique used to create the phase-only and magnitude-only stimuli. In Section 3, we describe a number of experiments which evaluate the importance of short-time phase spectra and short-time magnitude spectra in human speech perception. In the first experiment, we demonstrate that intelligibility of phase-only stimuli is improved significantly when a rectangular window is used, and it becomes comparable with that of magnitude-only stimuli even for small window durations. In Experiment 2, we construct magnitude-only and phase-only stimuli for window sizes ranging from 16 ms to 2048 ms, using both Liu’s parameter settings and our parameter settings (discussed in Experiment 1) in order to compare their intelligibility. In the third experiment, we ascertain the contribution that each analysis–modification–synthesis parameter provides towards the intelligibility of signals reconstructed from phase spectra. In the aforementioned experiments, magnitude-only stimuli are created by randomising each frame’s phase spectra. It is also possible to create magnitude-only stimuli by setting all phase values for each frame to zero. Thus, in Experiment 4, we address the issue of using random-phase or zero-phase and determine if a significant difference exists between magnitude-only stimuli constructed with one or the other.
Section snippets
STFT analysis–modification–synthesis technique
Although speech is a non-stationary signal, it is generally assumed to be quasi-stationary and, therefore, can be processed through a short-time Fourier analysis (Allen, 1977, Allen and Rabiner, 1977, Crochiere, 1980, Flanagan and Golden, 1966, Griffin and Lim, 1984, Mathes and Miller, 1947, Portnoff, 1976, Portnoff, 1979, Portnoff, 1980, Portnoff, 1981a, Portnoff, 1981b, Quatieri, 2002, Rabiner and Schafer, 1978, Schafer and Rabiner, 1973). Note that the modifier ‘short-time’ implies a
Experiment 1
In this experiment we compare the intelligibility of magnitude-only and phase-only stimuli using two window types: (1) a rectangular window, and (2) a Hamming window.7 This comparison is done at a small window duration of 32 ms as well as a large window duration of 1024 ms.
Conclusion
In this paper, the relative importance of short-time magnitude and phase spectra on speech perception is investigated. Human perception experiments are conducted to measure intelligibility of speech stimuli reconstructed either from magnitude spectra or phase spectra. The experiments reported here demonstrate that even for small window durations, phase spectra can contribute to speech intelligibility as much as magnitude spectra if the analysis–modification–synthesis parameters are properly
Acknowledgments
This work was partly supported by ARC (Discovery) grant (No. DP0209283). The authors also wish to thank the volunteers who took part in the subjective listening tests reported in this paper.
References (46)
- et al.
Effects of phase on the perception of intervocalic stop consonants
Speech Comm.
(1997) Short-term spectral analysis, synthesis, and modification by discrete Fourier transform
IEEE Trans. Acoust. Speech Signal Process.
(1977)- et al.
A unified approach to short-time Fourier analysis and synthesis
Proc. IEEE
(1977) - Alsteris, L.D., Paliwal, K.K., 2004. Importance of window shape for phase-only reconstruction of speech. In: Proc. IEEE...
- Cox, R.C., Robinson, D.M., 1980. Some notes on phase in speech signals. In: Proc. IEEE Internat. Conf. Acoust., Speech,...
A weighted overlap-add method of short-time Fourier analysis/synthesis
IEEE Trans. Acoust., Speech Signal Process.
(1980)- et al.
Effects of additive noise on signal reconstruction from Fourier transform phase
IEEE Trans. Acoust., Speech Signal Process.
(1983) - et al.
Phase vocoder
Bell Syst. Tech.
(1966) Auditory spectral filtering and monoaural phase perception
J. Acoust. Soc. Amer.
(1967)- et al.
Signal estimation from modified short-time Fourier transform
IEEE Trans. Acoust., Speech Signal Process.
(1984)
Signal reconstruction from phase or magnitude
IEEE Trans. Acoust., Speech Signal Process.
Some results on the time–frequency sampling of the short-time Fourier transform magnitude
IEEE Trans. Acoust., Speech Signal Process.
Enhancement and bandwidth compression of noisy speech
Proc. IEEE
Phase effects in monoaural perception
J. Acoust. Soc. Amer.
Reconstruction of signals from phase: efficient algorithms, segmentation, and generalisations
IEEE Trans. Acoust., Speech Signal Process.
Signal reconstruction from short-time Fourier transform magnitude
IEEE Trans. Acoust., Speech Signal Process.
Uber die Definition des Tones, nebst daran geknupfter Theorie der Sirene und ahnlicher tonbildender Vorrichtungen
Ann. Phys. Chem.
The importance of phase in signals
Proc. IEEE
Digital Signal Process.
Cited by (126)
The speech signal enhancement approach with multiple sub-frames analysis for complex magnitude and phase spectrum recompense
2023, Expert Systems with ApplicationsAnalysis of Instantaneous Frequency Components of Speech Signals for Epoch Extraction
2023, Computer Speech and LanguageReplay spoof detection using energy separation based instantaneous frequency estimation from quadrature and in-phase components
2023, Computer Speech and LanguageCitation Excerpt :To that effect, CFCC features have shown remarkable results under mismatched conditions between training and testing (Li and Huang, 2010). Furthermore, the study reported in Paliwal and Alsteris (2005) used IF spectrum for speech intelligibility. To that effect, subband IF is extracted from the subband filter outputs of CFCC representation for the SSD task (Patel and Patil, 2016).
Combined applications of analytic methods for detection of combustion instability triggering
2021, Aerospace Science and TechnologyMulti-channel adaptive loudness compensation algorithm based on noise tracking in digital hearing aids
2021, Speech CommunicationAsymmetric windows in digital signal processing
2020, Advances in Computers
- ☆
Audio files at http://maxwell.me.gu.edu.au/spl/research/phase/project.htm.