Elsevier

Speech Communication

Volume 81, July 2016, Pages 104-119
Speech Communication

Phase perception of the glottal excitation and its relevance in statistical parametric speech synthesis

https://doi.org/10.1016/j.specom.2016.01.007Get rights and content

Highlights

  • Phase perception of the glottal excitation is studied.

  • Source-filter vocoder is used to modify pitch-synchronous excitation phase pattern.

  • Natural-phase, zero-phase, and random-phase excitations are compared.

  • Various speakers and speaking styles are utilized in subjective listening tests.

  • Results show that using natural phase information results in improved speech quality.

Abstract

While the characteristics of the amplitude spectrum of the voiced excitation have been studied widely both in natural and synthetic speech, the role of the excitation phase has remained less explored. This contradicts findings observed in sound perception studies indicating that humans are not phase deaf. Especially in speech synthesis, phase information is often omitted for simplicity. This study investigates the impact of phase information of the excitation signal of voiced speech and its relevance in statistical parametric speech synthesis. The experiments in the study involve, firstly, converting the pitch-synchronously computed original phase spectra of the excitation waveforms (either glottal flow waveforms or residuals) to either zero phase, cyclostationary random phase, or random phase. Secondly, the quality of synthetic speech in each case is compared in subjective listening tests to the corresponding signal excited with the original, natural phase. Experiments are conducted with natural, vocoded, and synthetic speech using voice material from various speakers with varying speaking styles, such as breathy, normal, and Lombard speech. The results indicate that the phase spectrum of the voiced excitation has a perceptually relevant effect in natural, vocoded, and synthetic speech, and utilizing the phase information in speech synthesis leads to improved speech quality.

Introduction

In statistical parametric speech synthesis (SPSS), several vocoding techniques have been used in the past decade (Tokuda, Nankaku, Toda, Zen, Yamagishi, Oura, 2013, Zen, Tokuda, Black, 2009). The conventional vocoding approach employs excitation signals composed of impulses mixed with noise. The spectrum of this kind of simple excitation, both in terms of its amplitude and phase, is greatly different from the spectrum of the real voice source of speech, the glottal flow. While the characteristics of the amplitude spectrum of voice excitation have been studied widely both in natural (Childers, Lee, 1991, Gobl, Ní Chasaide, 1992) and synthetic (Klatt, Klatt, 1990, Raitio, Suni, Vainio, Alku, 2014c) speech, the role of the excitation phase has remained less explored. This contradicts findings observed in sound perception studies indicating that humans are not phase deaf (Patterson, 1987). In addition, previous studies show that the phase spectrum has a perceptually relevant role especially in speech signals (Pobloth and Kleijn, 1999) and that incorporating phase information is advantageous, for example, in feature extraction of speech recognition (Alsteris, Paliwal, 2004, Paliwal, 2003, Zhu, Paliwal, 2004).

The common tradition of discarding phase information in speech processing stems from two issues. Firstly, the magnitude spectrum is perceptually more relevant than the phase spectrum. Secondly, there are inherent difficulties, such as phase unwrapping (Tribolet, 1977), in processing the phase spectrum. In addition, previous studies indicate that the perception of phase has a complex dependency on the signal’s fundamental frequency (f0), intensity, and bandwidth (Laitinen, Disch, Pulkki, 2013, Patterson, 1987). Regardless of these factors, the present study was designed to investigate the impact of phase information in speech synthesis. Differently from the previous studies that utilize phase information that is extracted from speech pressure signals (e.g. Paliwal and Alsteris, 2005), the current investigation aims to gather new knowledge on the perceptual relevance of phase embedded in speech excitation that is used by the vocoder in SPSS. More specifically, this study explores how perception of phase information depends on factors related to speech material, such as gender, speaker, and speaking style. The experiments involve, firstly, converting the pitch-synchronously computed original phase spectra of the excitation waveforms (either glottal flow waveforms or residuals) to either zero phase, cyclostationary random phase, or random phase. Secondly, the quality of synthetic speech in each case is compared in subjective listening tests to the corresponding signal excited with the original, natural phase. Experiments are conducted with natural, vocoded, and synthetic speech using voice material from various speakers with varying speaking styles, such as breathy, normal, and Lombard speech.

The paper is organized as follows. Section 2 briefly presents the properties of a periodic signal and discusses previous studies on phase perception and its mechanisms. In addition, the relation of phase to speech production and voice quality is described, and previous studies on utilizing phase in SPSS are discussed. Section 3 first describes the methodology of phase modification, and then details three separate experiments conducted with natural, vocoded, and synthetic speech and presents the consequent results. Section 4 discusses the implications of the results, and finally Section 5 summarizes the findings and concludes the paper.

Section snippets

Properties of a periodic signal

A steady-state periodic signal s(t) can be represented by s(t)=n=1ansin(2πnf0t+φn)where an and φn are the amplitudes and the phases (in radians) of the nth sinusoidal component, respectively, and f0 is the fundamental frequency of the signal in Hertz (oscillations or cycles per second). According to Eq. (1), the waveform of the steady state periodic signal depends solely on an, which define the peak amplitude of each sinusoidal component, and φn which define the instantaneous phase of each

Phase manipulation

In order to investigate the effect of the excitation phase characteristics in SPSS, a phase manipulation scheme was developed in which the phase of the excitation signal can be altered while keeping the magnitude spectrum unchanged. The phase manipulation process is performed pitch-synchronously, each two-period glottal-flow-derivative (or residual) waveform at a time, and the signal is reconstructed again using the pitch-synchronous overlap-add (PSOLA) method (Charpentier, Stella, 1986,

Discussion

The experiments showed that the phase characteristics in the glottal excitation signal are very important for high-quality speech. The voice source has a specific phase pattern emerging from the vocal fold vibratory pattern and the asymmetry between vocal fold closing and opening. There is also an aperiodic component present in the voice source that gives each voice specific perceptual characteristics. The phase modifications performed in the experiments preserved the amplitude spectrum of

Conclusions

In this study, the perception of phase of the glottal excitation of voiced speech was investigated. The experiments involved modifying the excitation phase pitch-synchronously to either zero-phase, cyclostationary-random phase, or random-phase form, and then comparing the speech quality to the natural-phase samples. Experiments were performed using natural speech, vocoded speech, and speech generated with statistical parametric speech synthesis. Subjective evaluations were performed to assess

Acknowledgements

This work was supported by the Academy of Finland (256961, 284671).

References (120)

  • MoulinesE. et al.

    Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

    J. Acoust. Soc. Am.

    (1990)
  • PaliwalK.K. et al.

    On the usefulness of STFT phase spectrum in human listening tests

    Speech Commun.

    (2005)
  • RaitioT. et al.

    Synthesis and perception of breathy, normal, and Lombard speech in the presence of noise

    Comput. Speech Lang.

    (2014)
  • AgiomyrgiannakisY.

    Vocaine the vocoder and applications in speech synthesis

    Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2015)
  • AgiomyrgiannakisY. et al.

    Towards flexible speech coding for speech synthesis: an LF + modulated noise vocoder

    Proceedings of the Interspeech

    (2008)
  • AlkuP. et al.

    An amplitude quotient based method to analyze changes in the shape of the glottal pulse in the regulation of vocal intensity

    J. Acoust. Soc. Am.

    (2006)
  • AlkuP. et al.

    Normalized amplitude quotient for parameterization of the glottal flow

    J. Acoust. Soc. Am.

    (2002)
  • AlsterisL.D. et al.

    ASR on speech reconstructed from short-time Fourier phase spectra

    Proceedings of the Interspeech

    (2004)
  • BilsenF.A.

    On the influence of the number and phase of harmonics on the perceptibility of the pitch of complex signals

    Acustica

    (1973)
  • CabralJ. et al.

    Towards an improved modeling of the glottal source in statistical parametric speech synthesis

    Proceedings of the 6th ISCA Workshop on Speech Synthesis (SSW6)

    (2007)
  • CabralJ. et al.

    Glottal spectral separation for parametric speech synthesis

    Proceedings of the Interspeech

    (2008)
  • CabralJ.P. et al.

    HMM-based speech synthesiser using the LF-model of the glottal source

    Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2011)
  • CabralJ.P. et al.

    Glottal spectral separation for speech synthesis

    IEEE J. Selected Topics Signal Process.

    (2014)
  • CarlsonR. et al.

    Voice source rules for text-to-speech synthesis

    Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (1989)
  • CarlsonR. et al.

    Vowel Perception: The Relative Perceptual Salience of Selected Acoustic Manipulations, Speech Transm. Lab. Quart. Progr

    Status Report (STL-QPSR)

    (1979)
  • CharpentierF. et al.

    Diphone synthesis using an overlap-add technique for speech waveforms concatenation

    Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (1986)
  • ChildersD. et al.

    Modeling the glottal volume-velocity waveform for three voice types

    J. Acoust. Soc. Am.

    (1995)
  • ChildersD. et al.

    Speech synthesis by glottal excited linear prediction

    J. Acoust. Soc. Am.

    (1994)
  • ChildersD.G. et al.

    Vocal quality factors: Analysis, synthesis, and perception

    J. Acoust. Soc. Am.

    (1991)
  • de BoerE.

    A note on phase distortion and hearing

    Acustica

    (1961)
  • de VethJ. et al.

    Extraction of control parameters for the voice source in a text-to-speech system

    Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (1990)
  • DegottexG. et al.

    COVAREP – A collaborative voice analysis repository for speech technologies

    Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2014)
  • DrugmanT.

    Residual excitation skewness for automatic speech polarity detection

    IEEE Signal Process. Lett.

    (2013)
  • DrugmanT. et al.

    Joint robust voicing detection and pitch estimation based on residual harmonics

    Proceedings of the Interspeech

    (2011)
  • DrugmanT. et al.

    The deterministic plus stochastic model of the residual signal and its applications

    IEEE Trans. Audio Speech Lang. Proc.

    (2012)
  • DrugmanT. et al.

    Excitation modeling for HMM-based speech synthesis: Breaking down the impact of periodic and aperiodic components

    Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2014)
  • DrugmanT. et al.

    Detection of glottal closure instants from speech signals: A quantitative review

    IEEE Trans. Audio Speech Lang. Proc.

    (2012)
  • DrugmanT. et al.

    A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis

    Proceedings of the Interspeech

    (2009)
  • DrugmanT. et al.

    Using a pitch-synchronous residual codebook for hybrid HMM/frame selection speech synthesis

    Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2009)
  • ErroD. et al.

    Harmonics plus noise model based vocoder for statistical parametric speech synthesis

    IEEE J. Selected Top. Signal Process.

    (2014)
  • FantG.

    The LF-Model Revisited. Transformations and Frequency Domain Analysis, Speech Transm. Lab. Quart. Progr.

    Status Report (STL-QPSR)

    (1995)
  • FantG. et al.

    A Four-Parameter Model of Glottal Flow, Speech Transm. Lab. Quart. Progr.

    Status Report (STL-QPSR)

    (1985)
  • FriesG.

    Hybrid time- and frequency-domain speech synthesis with extended glottal source generation

    Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (1994)
  • GoldsteinJ.L.

    Auditory spectral filtering and monaural phase perception

    J. Acoust. Soc. Am.

    (1967)
  • HansonH.M.

    Glottal Characteristics of Female Speakers

    (1995)
  • HolmesJ.

    The influence of glottal waveform on the naturalness of speech from a parallel formant synthesizer

    IEEE Trans. Audio Electroacoust.

    (1973)
  • HuntA. et al.

    Unit selection in a concatenative speech synthesis system using a large speech database

    Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (1996)
  • International Telecommunication Union, 2011. Objective measurement of active speech level. Recommendation ITU-T P.56...
  • International Telecommunication Union, 2014. Method for the subjective assessment of intermediate quality level of...
  • JohnsonD.H.

    The relationship between spike rate and synchrony in responses of auditorynerve fibers to single tones

    J. Acoust. Soc. Am.

    (1980)
  • Cited by (0)

    Audio files and additional figures can be found at http://research.spa.aalto.fi/publications/papers/specom-phase/.

    View full text