Elsevier

Speech Communication

Volume 45, Issue 2, February 2005, Pages 153-170
Speech Communication

On the usefulness of STFT phase spectrum in human listening tests

https://doi.org/10.1016/j.specom.2004.08.001Get rights and content

Abstract

The short-time Fourier transform (STFT) of a speech signal has two components: the magnitude spectrum and the phase spectrum. In this paper, the relative importance of short-time magnitude and phase spectra for speech perception is investigated. Human perception experiments are conducted to measure intelligibility of speech stimuli synthesized either from magnitude spectra or phase spectra. It is traditionally believed that the magnitude spectrum plays a dominant role for small window durations (20–40 ms); while the phase spectrum is more important for large window durations (>1 s). It is shown in this paper that even for small window durations, the phase spectrum can contribute to speech intelligibility as much as the magnitude spectrum if the analysis–modification–synthesis parameters are properly selected.

Introduction

In this paper, the usefulness of the phase spectrum1 is explored in human speech perception.2 The authors have a long-term goal of utilising phase spectra in an effort to improve automatic speech recognition (ASR) performance. It is common practice in ASR to discard the phase spectrum in favour of features that are derived purely from the magnitude spectrum3 (Picone, 1993). In the ASR framework, speech is processed frame-wise using a temporal window of duration 20–40 ms. If the phase spectrum is to be of any use for ASR applications, it should provide some information about speech intelligibility using small window durations (20–40 ms) in a human perception experiment.

A few studies have been reported in the literature which discuss whether the phase spectrum provides any information which can contribute to intelligibility for human speech recognition (HSR). Schroeder (1975), and Oppenheim and Lim (1981) performed some informal perception experiments, concluding that the phase spectrum is important for intelligibility when the window duration of the short-time Fourier transform (STFT) is large (Tw > 1 s), while it seems to convey negligible intelligibility at small window durations (20–40 ms).

Liu et al. (1997) have recently investigated the intelligibility of phase spectra through a more formal human speech perception study. They recorded six stop-consonants from 10 speakers in vowel–consonant–vowel context. Using these recordings, they created magnitude-only and phase-only stimuli. Magnitude-only stimuli were created by analysing the original recordings with a STFT, replacing each frame’s phase spectra with random phase values, then reconstructing the speech signal using the overlap-add method. In the case of phase-only stimuli, the original phase of each frame was retained, while the magnitude of each frame was set to unity for all frequency components. The stimuli were created for various window lengths from 16 ms to 512 ms. These were played to subjects, whose task was to identify each as one of the six consonants. Their results (Fig. 1) show that intelligibility of magnitude-only stimuli decreases while the intelligibility of the phase-only stimuli increases as the window duration increases. For small window durations (Tw < 128 ms), magnitude-only stimuli are significantly more intelligible than phase-only stimuli (while the opposite is true for larger window lengths). This implies that for small window durations (which are of relevance for ASR applications), the magnitude spectrum contributes much more towards intelligibility than the phase spectrum.

The authors of this paper initially set out to reproduce Liu’s results; in doing so, made a number of modifications in Liu’s analysis–modification–synthesis procedure (see Fig. 2). The modifications produce results which are different from Liu’s results and more interesting from an ASR application’s viewpoint. The first suggested modification is that of the analysis window type. Liu and his collaborators employed a Hamming window for construction of both the magnitude-only and phase-only stimuli. In our experiments, we find that the intelligibility of phase-only stimuli is improved significantly and becomes comparable to that of magnitude-only stimuli when a rectangular window is used. The second suggested modification is the choice of analysis frame shift; Liu et al. used a frame shift of Tw/2. As shown by Allen and Rabiner (1977), in order to avoid aliasing errors during reconstruction, the STFT sampling period (or frame shift) must be at most Tw/4 for a Hamming window. In this paper, to be on the safer side, we use a frame shift of Tw/8. Our study also differs from Liu’s study with respect to the number of consonants used (16 for this study compared to 6 for Liu et al). The design parameters are discussed in further detail later in this paper. Our results indicate that even for small window durations (Tw < 128 ms), the phase spectrum can contribute to speech intelligibility as much as the magnitude spectrum if the analysis–modification–synthesis parameters are properly selected.4

The paper outline is as follows: In Section 2, we detail the analysis–modification–synthesis technique used to create the phase-only and magnitude-only stimuli. In Section 3, we describe a number of experiments which evaluate the importance of short-time phase spectra and short-time magnitude spectra in human speech perception. In the first experiment, we demonstrate that intelligibility of phase-only stimuli is improved significantly when a rectangular window is used, and it becomes comparable with that of magnitude-only stimuli even for small window durations. In Experiment 2, we construct magnitude-only and phase-only stimuli for window sizes ranging from 16 ms to 2048 ms, using both Liu’s parameter settings and our parameter settings (discussed in Experiment 1) in order to compare their intelligibility. In the third experiment, we ascertain the contribution that each analysis–modification–synthesis parameter provides towards the intelligibility of signals reconstructed from phase spectra. In the aforementioned experiments, magnitude-only stimuli are created by randomising each frame’s phase spectra. It is also possible to create magnitude-only stimuli by setting all phase values for each frame to zero. Thus, in Experiment 4, we address the issue of using random-phase or zero-phase and determine if a significant difference exists between magnitude-only stimuli constructed with one or the other.

Section snippets

STFT analysis–modification–synthesis technique

Although speech is a non-stationary signal, it is generally assumed to be quasi-stationary and, therefore, can be processed through a short-time Fourier analysis (Allen, 1977, Allen and Rabiner, 1977, Crochiere, 1980, Flanagan and Golden, 1966, Griffin and Lim, 1984, Mathes and Miller, 1947, Portnoff, 1976, Portnoff, 1979, Portnoff, 1980, Portnoff, 1981a, Portnoff, 1981b, Quatieri, 2002, Rabiner and Schafer, 1978, Schafer and Rabiner, 1973). Note that the modifier ‘short-time’ implies a

Experiment 1

In this experiment we compare the intelligibility of magnitude-only and phase-only stimuli using two window types: (1) a rectangular window, and (2) a Hamming window.7 This comparison is done at a small window duration of 32 ms as well as a large window duration of 1024 ms.

Conclusion

In this paper, the relative importance of short-time magnitude and phase spectra on speech perception is investigated. Human perception experiments are conducted to measure intelligibility of speech stimuli reconstructed either from magnitude spectra or phase spectra. The experiments reported here demonstrate that even for small window durations, phase spectra can contribute to speech intelligibility as much as magnitude spectra if the analysis–modification–synthesis parameters are properly

Acknowledgments

This work was partly supported by ARC (Discovery) grant (No. DP0209283). The authors also wish to thank the volunteers who took part in the subjective listening tests reported in this paper.

References (46)

  • L. Liu et al.

    Effects of phase on the perception of intervocalic stop consonants

    Speech Comm.

    (1997)
  • J.B. Allen

    Short-term spectral analysis, synthesis, and modification by discrete Fourier transform

    IEEE Trans. Acoust. Speech Signal Process.

    (1977)
  • J.B. Allen et al.

    A unified approach to short-time Fourier analysis and synthesis

    Proc. IEEE

    (1977)
  • Alsteris, L.D., Paliwal, K.K., 2004. Importance of window shape for phase-only reconstruction of speech. In: Proc. IEEE...
  • Cox, R.C., Robinson, D.M., 1980. Some notes on phase in speech signals. In: Proc. IEEE Internat. Conf. Acoust., Speech,...
  • R.E. Crochiere

    A weighted overlap-add method of short-time Fourier analysis/synthesis

    IEEE Trans. Acoust., Speech Signal Process.

    (1980)
  • C.Y. Espy et al.

    Effects of additive noise on signal reconstruction from Fourier transform phase

    IEEE Trans. Acoust., Speech Signal Process.

    (1983)
  • J.L. Flanagan et al.

    Phase vocoder

    Bell Syst. Tech.

    (1966)
  • J.L. Goldstein

    Auditory spectral filtering and monoaural phase perception

    J. Acoust. Soc. Amer.

    (1967)
  • D.W. Griffin et al.

    Signal estimation from modified short-time Fourier transform

    IEEE Trans. Acoust., Speech Signal Process.

    (1984)
  • M.H. Hayes et al.

    Signal reconstruction from phase or magnitude

    IEEE Trans. Acoust., Speech Signal Process.

    (1980)
  • D. Izraelevitz

    Some results on the time–frequency sampling of the short-time Fourier transform magnitude

    IEEE Trans. Acoust., Speech Signal Process.

    (1985)
  • Kim, D.S., 2000. Perceptual phase redundancy in speech. In: Proc. IEEE Internat. Conf. Acoust., Speech, Signal...
  • J.S. Lim et al.

    Enhancement and bandwidth compression of noisy speech

    Proc. IEEE

    (1979)
  • R.C. Mathes et al.

    Phase effects in monoaural perception

    J. Acoust. Soc. Amer.

    (1947)
  • G.A. Merchant et al.

    Reconstruction of signals from phase: efficient algorithms, segmentation, and generalisations

    IEEE Trans. Acoust., Speech Signal Process.

    (1983)
  • S.H. Nawab et al.

    Signal reconstruction from short-time Fourier transform magnitude

    IEEE Trans. Acoust., Speech Signal Process.

    (1983)
  • G.S. Ohm

    Uber die Definition des Tones, nebst daran geknupfter Theorie der Sirene und ahnlicher tonbildender Vorrichtungen

    Ann. Phys. Chem.

    (1843)
  • A.V. Oppenheim et al.

    The importance of phase in signals

    Proc. IEEE

    (1981)
  • A.V. Oppenheim et al.

    Digital Signal Process.

    (1975)
  • Paliwal, K.K., 2003. Usefulness of phase in speech processing. Proc. IPSJ Spoken Language Process. Workshop, Gifu,...
  • Paliwal, K.K., Alsteris, L., 2003. Usefulness of phase spectrum in human speech perception. In: Proc. Eurospeech,...
  • Paliwal, K.K., Atal, B.S., 2003. Frequency-related representation of speech. In: Proc. Eurospeech, Geneva, Switzerland,...
  • Cited by (126)

    • Replay spoof detection using energy separation based instantaneous frequency estimation from quadrature and in-phase components

      2023, Computer Speech and Language
      Citation Excerpt :

      To that effect, CFCC features have shown remarkable results under mismatched conditions between training and testing (Li and Huang, 2010). Furthermore, the study reported in Paliwal and Alsteris (2005) used IF spectrum for speech intelligibility. To that effect, subband IF is extracted from the subband filter outputs of CFCC representation for the SSD task (Patel and Patil, 2016).

    View all citing articles on Scopus
    View full text