On the usefulness of STFT phase spectrum in human listening tests

doi:10.1016/j.specom.2004.08.001

Speech Communication

Volume 45, Issue 2, February 2005, Pages 153-170

https://doi.org/10.1016/j.specom.2004.08.001 Get rights and content

Abstract

The short-time Fourier transform (STFT) of a speech signal has two components: the magnitude spectrum and the phase spectrum. In this paper, the relative importance of short-time magnitude and phase spectra for speech perception is investigated. Human perception experiments are conducted to measure intelligibility of speech stimuli synthesized either from magnitude spectra or phase spectra. It is traditionally believed that the magnitude spectrum plays a dominant role for small window durations (20–40 ms); while the phase spectrum is more important for large window durations (>1 s). It is shown in this paper that even for small window durations, the phase spectrum can contribute to speech intelligibility as much as the magnitude spectrum if the analysis–modification–synthesis parameters are properly selected.

Introduction

In this paper, the usefulness of the phase spectrum¹ is explored in human speech perception.² The authors have a long-term goal of utilising phase spectra in an effort to improve automatic speech recognition (ASR) performance. It is common practice in ASR to discard the phase spectrum in favour of features that are derived purely from the magnitude spectrum³ (Picone, 1993). In the ASR framework, speech is processed frame-wise using a temporal window of duration 20–40 ms. If the phase spectrum is to be of any use for ASR applications, it should provide some information about speech intelligibility using small window durations (20–40 ms) in a human perception experiment.

A few studies have been reported in the literature which discuss whether the phase spectrum provides any information which can contribute to intelligibility for human speech recognition (HSR). Schroeder (1975), and Oppenheim and Lim (1981) performed some informal perception experiments, concluding that the phase spectrum is important for intelligibility when the window duration of the short-time Fourier transform (STFT) is large (T_w > 1 s), while it seems to convey negligible intelligibility at small window durations (20–40 ms).

Liu et al. (1997) have recently investigated the intelligibility of phase spectra through a more formal human speech perception study. They recorded six stop-consonants from 10 speakers in vowel–consonant–vowel context. Using these recordings, they created magnitude-only and phase-only stimuli. Magnitude-only stimuli were created by analysing the original recordings with a STFT, replacing each frame’s phase spectra with random phase values, then reconstructing the speech signal using the overlap-add method. In the case of phase-only stimuli, the original phase of each frame was retained, while the magnitude of each frame was set to unity for all frequency components. The stimuli were created for various window lengths from 16 ms to 512 ms. These were played to subjects, whose task was to identify each as one of the six consonants. Their results (Fig. 1) show that intelligibility of magnitude-only stimuli decreases while the intelligibility of the phase-only stimuli increases as the window duration increases. For small window durations (T_w < 128 ms), magnitude-only stimuli are significantly more intelligible than phase-only stimuli (while the opposite is true for larger window lengths). This implies that for small window durations (which are of relevance for ASR applications), the magnitude spectrum contributes much more towards intelligibility than the phase spectrum.

The authors of this paper initially set out to reproduce Liu’s results; in doing so, made a number of modifications in Liu’s analysis–modification–synthesis procedure (see Fig. 2). The modifications produce results which are different from Liu’s results and more interesting from an ASR application’s viewpoint. The first suggested modification is that of the analysis window type. Liu and his collaborators employed a Hamming window for construction of both the magnitude-only and phase-only stimuli. In our experiments, we find that the intelligibility of phase-only stimuli is improved significantly and becomes comparable to that of magnitude-only stimuli when a rectangular window is used. The second suggested modification is the choice of analysis frame shift; Liu et al. used a frame shift of T_w/2. As shown by Allen and Rabiner (1977), in order to avoid aliasing errors during reconstruction, the STFT sampling period (or frame shift) must be at most T_w/4 for a Hamming window. In this paper, to be on the safer side, we use a frame shift of T_w/8. Our study also differs from Liu’s study with respect to the number of consonants used (16 for this study compared to 6 for Liu et al). The design parameters are discussed in further detail later in this paper. Our results indicate that even for small window durations (T_w < 128 ms), the phase spectrum can contribute to speech intelligibility as much as the magnitude spectrum if the analysis–modification–synthesis parameters are properly selected.⁴

The paper outline is as follows: In Section 2, we detail the analysis–modification–synthesis technique used to create the phase-only and magnitude-only stimuli. In Section 3, we describe a number of experiments which evaluate the importance of short-time phase spectra and short-time magnitude spectra in human speech perception. In the first experiment, we demonstrate that intelligibility of phase-only stimuli is improved significantly when a rectangular window is used, and it becomes comparable with that of magnitude-only stimuli even for small window durations. In Experiment 2, we construct magnitude-only and phase-only stimuli for window sizes ranging from 16 ms to 2048 ms, using both Liu’s parameter settings and our parameter settings (discussed in Experiment 1) in order to compare their intelligibility. In the third experiment, we ascertain the contribution that each analysis–modification–synthesis parameter provides towards the intelligibility of signals reconstructed from phase spectra. In the aforementioned experiments, magnitude-only stimuli are created by randomising each frame’s phase spectra. It is also possible to create magnitude-only stimuli by setting all phase values for each frame to zero. Thus, in Experiment 4, we address the issue of using random-phase or zero-phase and determine if a significant difference exists between magnitude-only stimuli constructed with one or the other.

Section snippets

STFT analysis–modification–synthesis technique

Although speech is a non-stationary signal, it is generally assumed to be quasi-stationary and, therefore, can be processed through a short-time Fourier analysis (Allen, 1977, Allen and Rabiner, 1977, Crochiere, 1980, Flanagan and Golden, 1966, Griffin and Lim, 1984, Mathes and Miller, 1947, Portnoff, 1976, Portnoff, 1979, Portnoff, 1980, Portnoff, 1981a, Portnoff, 1981b, Quatieri, 2002, Rabiner and Schafer, 1978, Schafer and Rabiner, 1973). Note that the modifier ‘short-time’ implies a

Experiment 1

In this experiment we compare the intelligibility of magnitude-only and phase-only stimuli using two window types: (1) a rectangular window, and (2) a Hamming window.⁷ This comparison is done at a small window duration of 32 ms as well as a large window duration of 1024 ms.

Conclusion

In this paper, the relative importance of short-time magnitude and phase spectra on speech perception is investigated. Human perception experiments are conducted to measure intelligibility of speech stimuli reconstructed either from magnitude spectra or phase spectra. The experiments reported here demonstrate that even for small window durations, phase spectra can contribute to speech intelligibility as much as magnitude spectra if the analysis–modification–synthesis parameters are properly

Acknowledgments

This work was partly supported by ARC (Discovery) grant (No. DP0209283). The authors also wish to thank the volunteers who took part in the subjective listening tests reported in this paper.

References (46)

L. Liu et al.
Effects of phase on the perception of intervocalic stop consonants
Speech Comm.
(1997)
J.B. Allen
Short-term spectral analysis, synthesis, and modification by discrete Fourier transform
IEEE Trans. Acoust. Speech Signal Process.
(1977)
J.B. Allen et al.
A unified approach to short-time Fourier analysis and synthesis
Proc. IEEE
(1977)
Alsteris, L.D., Paliwal, K.K., 2004. Importance of window shape for phase-only reconstruction of speech. In: Proc. IEEE...
Cox, R.C., Robinson, D.M., 1980. Some notes on phase in speech signals. In: Proc. IEEE Internat. Conf. Acoust., Speech,...
R.E. Crochiere
A weighted overlap-add method of short-time Fourier analysis/synthesis
IEEE Trans. Acoust., Speech Signal Process.
(1980)
C.Y. Espy et al.
Effects of additive noise on signal reconstruction from Fourier transform phase
IEEE Trans. Acoust., Speech Signal Process.
(1983)
J.L. Flanagan et al.
Phase vocoder
Bell Syst. Tech.
(1966)
J.L. Goldstein
Auditory spectral filtering and monoaural phase perception
J. Acoust. Soc. Amer.
(1967)
D.W. Griffin et al.
Signal estimation from modified short-time Fourier transform
IEEE Trans. Acoust., Speech Signal Process.
(1984)

M.H. Hayes et al.

Signal reconstruction from phase or magnitude

IEEE Trans. Acoust., Speech Signal Process.

(1980)

D. Izraelevitz

Some results on the time–frequency sampling of the short-time Fourier transform magnitude

IEEE Trans. Acoust., Speech Signal Process.

(1985)

Kim, D.S., 2000. Perceptual phase redundancy in speech. In: Proc. IEEE Internat. Conf. Acoust., Speech, Signal...

J.S. Lim et al.

Enhancement and bandwidth compression of noisy speech

Proc. IEEE

(1979)

R.C. Mathes et al.

Phase effects in monoaural perception

J. Acoust. Soc. Amer.

(1947)

G.A. Merchant et al.

Reconstruction of signals from phase: efficient algorithms, segmentation, and generalisations

IEEE Trans. Acoust., Speech Signal Process.

(1983)

S.H. Nawab et al.

Signal reconstruction from short-time Fourier transform magnitude

IEEE Trans. Acoust., Speech Signal Process.

(1983)

G.S. Ohm

Uber die Definition des Tones, nebst daran geknupfter Theorie der Sirene und ahnlicher tonbildender Vorrichtungen

Ann. Phys. Chem.

(1843)

A.V. Oppenheim et al.

The importance of phase in signals

Proc. IEEE

(1981)

A.V. Oppenheim et al.

Digital Signal Process.

(1975)

Paliwal, K.K., 2003. Usefulness of phase in speech processing. Proc. IPSJ Spoken Language Process. Workshop, Gifu,...

Paliwal, K.K., Alsteris, L., 2003. Usefulness of phase spectrum in human speech perception. In: Proc. Eurospeech,...

Paliwal, K.K., Atal, B.S., 2003. Frequency-related representation of speech. In: Proc. Eurospeech, Geneva, Switzerland,...

Cited by (126)

The speech signal enhancement approach with multiple sub-frames analysis for complex magnitude and phase spectrum recompense
2023, Expert Systems with Applications
The intended speech must be dealt with in the process of speech communication while under the impact of noise experienced in a variety of situations that degrade speech intelligibility and quality. This work proposes a multiple sub-frames analysis for the elimination of noise variants with compensation of the magnitude and phase spectrum from speech degraded by noise. The clean speech samples are extracted from the ITU-T recommended dataset at a 16 kHz sampling rate and down-sampled to an 8 kHz sampling rate. The noise signal variants are added from the AURORA and NOISEX-92 datasets at diverse input SNR levels (0 dB, 5 dB, 10 dB, 15 dB). The duration of window frames is chosen to be 25 msec in length, together with a shift percentage of 40%, to maintain the continuous nature of frames in speech. The smoothing factor for noise updating in a specific sub-frame is set to 9, and the spectral floor parameter for determining the precise amount of noise elimination in the corrupted spectrum is set to 0.03. The phase spectrum is compensated by incorporating a recompense function that is updated in combination with the sub-frame analysis. The accomplishment of the suggested approach is assessed with regard to objective metrics, including Segmental-Signal-to-Noise-Ratio (SegSNR), Mean-Square-Error (MSE), and Perceptual-Evaluation-of-Speech-Quality (PESQ) scores corresponding to specific sub-frames of speech, respectively. To further analyze the improved quality, simple listening assessment and spectrogram analysis are incorporated, followed by a comparative investigation with prior noise-suppressive algorithms on the corrupted speech corpus.
Analysis of Instantaneous Frequency Components of Speech Signals for Epoch Extraction
2023, Computer Speech and Language
The major impulse-like excitation in the speech signal is due to abrupt closure of the vocal folds, which takes place at the glottal closure instant (GCI) or epoch in each cycle. GCIs are used in many areas of speech science and technology, such as in prosody modification, voice source analysis, formant extraction and speech synthesis. It is difficult to observe these discontinuities (corresponding to GCIs) in the speech signal because of the superimposed time-varying response of the vocal tract system. This paper examines the phase part of different frequency components of the speech signal to extract epochs. Three analysis methods to decompose the speech signal into different frequency components are considered. These methods are the short-time Fourier transform (STFT), narrow bandpass filtering (NBPF), and single frequency filtering (SFF). The locations of the discontinuities in the speech signal are obtained from the instantaneous frequency (IF) (i.e., the time derivative of the phase) of each of the frequency components. A method for automatic detection of epochs using the amplitude weighted IF is proposed. Performance of the proposed epoch detection method is compared with four state-of-the-art methods in clean and telephone quality speech. The performance of the proposed method is comparable with the performance of the existing epoch detection methods for clean speech but better for telephone quality speech.
Replay spoof detection using energy separation based instantaneous frequency estimation from quadrature and in-phase components
2023, Computer Speech and Language
Citation Excerpt :
To that effect, CFCC features have shown remarkable results under mismatched conditions between training and testing (Li and Huang, 2010). Furthermore, the study reported in Paliwal and Alsteris (2005) used IF spectrum for speech intelligibility. To that effect, subband IF is extracted from the subband filter outputs of CFCC representation for the SSD task (Patel and Patil, 2016).
Replay attacks in speech are becoming easier to mount with the advent of high quality of recording and playback devices. This makes these replay attacks a major concern for the security of Automatic Speaker Verification (ASV) systems and voice assistants. In the past, auditory transform-based as well as Instantaneous Frequency (IF)-based features have been proposed for replay spoofed speech detection (SSD). In this context, IF has been estimated either by derivative of analytic phase via Hilbert transform, or by using high temporal resolution Teager Energy Operator (TEO)-based Energy Separation Algorithm (ESA). However, excellent temporal resolution of ESA comes with lacking in using relative phase information and vice-versa. To that effect, we propose novel Cochlear Filter Cepstral Coefficients-based Instantaneous Frequency using Quadrature Energy Separation Algorithm (CFCCIF-QESA) features, with excellent temporal resolution as well as relative phase information. CFCCIF-QESA is designed by exploiting relative phase shift to estimate IF, without estimating phase explicitly from the signal. To motivate and validate effectiveness of proposed QESA approach for IF estimation, we have employed information-theoretic measures, such as Mutual Information (MI), Kullback–Leibler (KL) divergence, and Jensen–Shannon (JS) divergence. The proposed CFCCIF-QESA feature set is extensively evaluated on standard statistically meaningful ASVSpoof 2017 version2.0 dataset. When evaluated on the ASVSpoof 2017 v2.0 dataset, CFCCIF-QESA achieves improved performance as compared to CFCCIF-ESA and CQCC feature sets on GMM, CNN, and LCNN classifiers. Furthermore, in the case of cross-database evaluation using ASVSpoof 2017 v2.0 and VSDC, CFCCIF-QESA also performs relatively better as compared to CFCCIF-ESA and CQCC on GMM classifier. However, for the case of self-classification on the ASVSpoof 2019 PA data, CFCCIF-QESA only outperforms CFCCIF-ESA. Whereas, on BTAS 2016 dataset, it performs relatively close to CFCCIF-ESA. Finally, results are presented for the case when the ASV system is not under attack.
Combined applications of analytic methods for detection of combustion instability triggering
2021, Aerospace Science and Technology
Analytic methods are applied to investigate the dynamic behaviors of the model chambers with different lengths. They are dynamic mode decomposition (DMD), short-time Fourier transform (STFT), and recurrence plots (RPs), which are applied simultaneously to the chambers. The DMD can extract the dynamic modes in the chamber, their frequencies, and their growth rates, but can't find when the dynamic modes appear. Accordingly, the STFT is conducted to find the initiation of acoustic oscillations with specific frequencies and their evolution in real-time. Recurrence plots (RPs) are applied to see phase synchronization between oscillations of pressure and heat release rate and thereby, the triggering time of resonance is determined. In this study, the three methods are applied to the two model chambers, which are devised intentionally for stable and unstable combustion, respectively. The results showed that the present approach could extract unique stability characteristics in each chamber devised in terms of combustion instability. The combined applications can provide complete information on dynamic behaviors, including resonant frequencies, acoustic modes, sustainability of oscillations, the onset of instability triggering.
Multi-channel adaptive loudness compensation algorithm based on noise tracking in digital hearing aids
2021, Speech Communication
The existing loudness compensation algorithms in digital hearing aids destroy the formant structure of the speech signal easily and do not consider the residue noise when implementing loudness compensation after speech enhancement. As a result, the output speech signal fails to meet the requirements of hearing-impaired(HI) people. To solve these problems, a novel multi-channel adaptive loudness compensation algorithm which can vary according to signal-to-noise ratio (SNR) is proposed. In this algorithm, signals are first divided into multiple frequency bands by the Gammatone filter banks that protect the formant structure. Then, binary masked speech enhancement based on human auditory characteristics is implemented in each frequency band, removing noise as much as possible while maintaining the authenticity of speech. Essentially, we propose an adaptive loudness compensation coefficient which can vary referring to the SNR, and adaptively adjust the weight of both the linear compensation and the wide dynamic compression in different frequency bands. The experimental results have shown that compared with the contrast algorithm, the proposed algorithm not only effectively protects the formant of speech in the noise environment, but also suppresses the influence of noise on the loudness performances, along with improvements in intelligibility, comfort level and the clarity of the speech.
Asymmetric windows in digital signal processing
2020, Advances in Computers
Symmetric windows are widely used in the field of digital signal processing due to their easy design and linear phase property. Nevertheless, symmetry also implies a few potential drawbacks like longer time delay in short-time frequency analysis and some limitations in frequency response. The removal of the symmetry constraint can therefore lead to asymmetric windows better in certain respects. In signal processing, better signal representations and related improved processing performance can be accomplished. In addition, shorter time delay can be achieved with asymmetric windows. This feature is important for contemporary spoken communications in the Internet or mobile networks and all other real-time signal processing applications.
The article gives a comprehensive review of the past and current work in the field of asymmetric windows. We elaborate on our work and related efforts of other researchers inspired by the idea of asymmetry. Shorter time delay and some better spectral properties are the most prominent potential of asymmetric windows. However, there are also some other more subtle properties which can improve the performance in specific application contexts (e.g., frequency estimation and detection of closely spaced components in frequency analysis). Several examples of interesting effects of asymmetric windows are presented, followed by empirical evaluations in the fields of pitch modification, shorter time delay audio processing (e.g., speech coding), frequency analysis, speech processing, and FIR filter design. In addition, a detailed comparison of various asymmetric windows found in the literature to widely known symmetric windows is made taking into account several practical and theoretical aspects. Finally, all presented achievements are summarized in a table which provides a complete overview of the current state of this interesting research and application field.

View all citing articles on Scopus

^☆: Audio files at http://maxwell.me.gu.edu.au/spl/research/phase/project.htm.

View full text

On the usefulness of STFT phase spectrum in human listening tests☆

Abstract

Introduction

Section snippets

STFT analysis–modification–synthesis technique

Experiment 1

Conclusion

Acknowledgments

Speech Comm.

Short-term spectral analysis, synthesis, and modification by discrete Fourier transform

IEEE Trans. Acoust. Speech Signal Process.

A unified approach to short-time Fourier analysis and synthesis

Proc. IEEE

A weighted overlap-add method of short-time Fourier analysis/synthesis

IEEE Trans. Acoust., Speech Signal Process.

Effects of additive noise on signal reconstruction from Fourier transform phase

IEEE Trans. Acoust., Speech Signal Process.

Phase vocoder

Bell Syst. Tech.

Auditory spectral filtering and monoaural phase perception

J. Acoust. Soc. Amer.

Signal estimation from modified short-time Fourier transform

IEEE Trans. Acoust., Speech Signal Process.

Signal reconstruction from phase or magnitude

IEEE Trans. Acoust., Speech Signal Process.

Some results on the time–frequency sampling of the short-time Fourier transform magnitude

IEEE Trans. Acoust., Speech Signal Process.

Enhancement and bandwidth compression of noisy speech

Proc. IEEE

Phase effects in monoaural perception

J. Acoust. Soc. Amer.

Reconstruction of signals from phase: efficient algorithms, segmentation, and generalisations

IEEE Trans. Acoust., Speech Signal Process.

Signal reconstruction from short-time Fourier transform magnitude

IEEE Trans. Acoust., Speech Signal Process.

Uber die Definition des Tones, nebst daran geknupfter Theorie der Sirene und ahnlicher tonbildender Vorrichtungen

Ann. Phys. Chem.

The importance of phase in signals

Proc. IEEE

Digital Signal Process.