Elsevier

Computer Speech & Language

Volume 36, March 2016, Pages 365-394
Computer Speech & Language

Nonlinear interactive source-filter models for speech

https://doi.org/10.1016/j.csl.2014.12.002Get rights and content

Highlights

  • We propose two interactive source-filter models, ISFMs, for speech production.

  • ISFMs have the capability of producing fine details of glottal flow.

  • A parameter estimation method is developed for determining the model parameters.

  • The algorithm yields ISFMs performing better than linear source-filter model.

Abstract

The linear source-filter model of speech production assumes that the source of the speech sounds is independent of the filter. However, acoustic simulations based on the physical speech production models show that when the fundamental frequency of the source harmonics approaches the first formant of the vocal tract filter, the filter has significant effects on the source due to the nonlinear coupling between them. In this study, two interactive system models are proposed under the quasi steady Bernoulli flow and linear vocal tract assumptions. An algorithm is developed to estimate the model parameters. Glottal flow and the linear vocal tract parameters are found by conventional methods. Rosenberg model is used to synthesize the glottal waveform. A recursive optimization method is proposed to find the parameters of the interactive model. Finally, glottal flow produced by the nonlinear interactive system is computed. The experimental results show that the interactive system model produces fine details of glottal flow source accurately.

Introduction

The human speech production system involves lungs, trachea, vocal cords, pharynx, oral tract, nasal tract, tongue and lips. The brain controls the production process, while other organs form the time-varying aerodynamic and acoustic subsystems. Based on the physics of speech production, the aerodynamic and acoustic subsystems are the basis of current articulatory speech synthesizers which mimic the human vocal system.

The aerodynamic part of the speech production system involves the interaction of airflow and tissue structure of vocal folds. To investigate the interaction effects on the vibration of vocal folds, the Navier–Stokes nonlinear differential equations need to be solved on the complex soft tissue structure of the vocal folds with a set of appropriate boundary conditions (Alipour et al., 2000, Jungsoo and Frankel, 2007, Jungsoo and Frankel, 2008, Zheng et al., 2009). Due to the high complexity of vocal fold tissue dynamics and high time consumption of finite element based methods, instead of using numerical simulation, excised canine vocal folds or kinematic solid physical models of vocal folds have been used to investigate the aerodynamic system (Alipour and Scherer, 1995, Alipour and Scherer, 2006, Khosia et al., 2007, Khosia et al., 2008). For articulatory speech synthesis, the aerodynamic system is generally modeled by the combination of a bio-mechanical structure that models tissue elasticity and an airflow model. These models work based on the myoelastic aerodynamic theory of phonation (Van den Berg, 1958, Titze, 2006a). They can be classified as low dimensional, such as one mass, two mass, sixteen mass etc, and high dimensional models like point mass models. Their inputs are lung pressure, elasticity, mass and friction of vocal folds, and the output is the glottal area. In addition to these input variables, acoustic loading of vocal tract or supraglottal input pressure can also be used as an input to the aerodynamic system so that the acoustic–aerodynamic interaction can also be taken into account (Zanartu et al., 2007).

The acoustical part of the speech production system represents the transmission and radiation of sound generated by the vocal folds or by a turbulence at a constriction in the vocal tract. It is usually assumed that the sound propagation is one dimensional and linear except at the glottis and constrictions where nonlinear effects cannot be neglected (Sondhi and Schroeter, 1999). In the articulatory synthesizers, the vocal tract and trachea are considered as a concatenation of uniform tubes with either equal or different lengths depending on the acoustic simulation type. Since, in linear acoustics, sound propagation in a tube depends on the cross-sectional area and the length, in order to simulate the acoustics in the vocal tract, these parameters are calculated from 3D MRI images of the vocal tract. Therefore, the inputs of acoustic subsystem are pressure or flow sources, cross-sectional areas and tube lengths and its output is acoustic pressure distribution in the vocal tract and radiated speech pressure wave.

Most of the current articulatory speech synthesis methods rely on two assumptions, quasi-steady flow at the glottis and linear vocal tract. Simulations based on these assumptions show that glottal flow is affected by the vocal tract acoustic loading if the vocal tract impedance is comparable to the acoustic glottis impedance; in other words, supraglottal pressure is comparable to subglottal pressure (Ananthapadmanabha and Fant, 1982, Titze, 2006a). Studies (Ishizaka and Flanagan, 1972, Ananthapadmanabha and Fant, 1982, Titze, 1984, Titze, 2008) showed two major effects of this interaction; superimposed ripple on the glottal flow and skewness of the glottal flow with respect to glottal area. These are called as level-I effects by Titze (2008).

Despite the fact that the research on the physics of speech production has been continuing for a considerable amount of time, the acquired knowledge has not been utilized extensively in the development of speech signal models. The linear source-filter model has been used dominantly for about half a century. It is applied in many speech related applications ranging from speech synthesis to speaker recognition. On the other hand, it is a quite simplified approximation of the speech production system. It assumes that glottal flow is independent from vocal tract acoustics. However, this assumption is good only for low pitch male speech (Titze, 2008). The simulations of physical speech production models show that glottal flow waveform is significantly affected by the vocal tract acoustics when the fundamental frequency (F0) of the glottal flow approaches the first formant (F1) of vocal tract. (It is illustrated in Section 2.) For example, these effects might be important especially in emotional speech synthesis due to the fact that depending on the emotional state of the speaker, the fundamental frequency (F0) of voice source can increase (or decrease) while F1 is fixed or even decreases. The physical simulation results suggest that if the fundamental frequency of a speech signal is to be altered, the changes in the glottal flow waveform due to the interaction between the source and the filter should be considered. This knowledge obtained from the physical simulations is not used in fundamental frequency modification methods. It might be one of the reasons why the perceived quality of modified speech decreases when using TD-PSOLA (Bulut and Narayanan, 2008). An interactive source-filter model can make it possible to use this information in speech processing applications.

The aim of this study is, based on the existing knowledge of the physics of speech production, to develop an interactive source-filter model that takes the level-I effects of nonlinear interaction of the source and the filter into account. In particular, a way of combining the glottal nonlinearity to the linear model of vocal tract, to produce the glottal flow, is introduced. The proposed nonlinear interactive source-filter model is based on the quasi-steady glottal flow and linear vocal tract assumptions. It is an extension of the linear source-filter model with a capability of producing level-I interaction effects. We show that the proposed nonlinear model can be used in either interactive or non-interactive mode by using a control parameter.

The parameters of the interactive system can be estimated by solving a combination of nonlinear blind estimation and parameter optimization problems. We propose a parameter estimation algorithm that yields a stable nonlinear interactive system model, which always performs better than the classical linear model.

The literature on source-filter interaction is dominated by the simulation of the speech production system by using the directly measured physical quantities like vocal tract areas and mechanical parameters of vocal folds. The glottal flow waveforms produced by the proposed models are investigated. In this work, a nonlinear discrete-time system model is developed to solve the inverse problem, i.e., estimation of glottal flow from speech. The estimate is evaluated by comparing the estimated glottal flow to that obtained by linear inverse filtering.

The rest of this article begins by a review of the source-filter interaction in voiced speech in Section 2. The dependence of the glottal flow on the vocal tract shape and vocal fold vibration frequency (F0) is demonstrated by using simple vocal fold and vocal tract models. Section 3 describes the assumptions on the source-filter model. Section 4 introduces the proposed interactive source-filter model (ISFM). An extended interactive model that takes subglottal loading into consideration is described in Section 4. In Section 5, a parameter estimation algorithm for the interactive model is proposed and, in Section 6, it is applied on a short speech segment. In Section 6, proposed interactive and classic noninteractive source-filter models are compared on a large speech database using Rosenberg+ source model. The Rosenberg+ glottal flow model is extended to an interactive source model that has the capability of producing fine details of glottal flow waveform. The interactive Rosenberg+ model is compared to the non-interactive Rosenberg+ model on the glottal flow estimates obtained from closed phase analysis of a large speech database. Finally, Section 7 provides a conclusion for the study.

Section snippets

A review of source-filter coupling

Nonlinear source-filter interaction has been studied by simulation, using acoustic and/or biomechanical models of vocal system. Due to the nonlinearity of the glottal impedance it can be simulated in time domain (Story, 1995). The one mass and the two mass models of Ishizaka and Flanagan (1972) are among the first models that take into account the source-filter interaction. They solved the transmission line circuit analog (Fant, 1960, Flanagan, 1965, Ananthapadmanabha and Fant, 1982, Birkholz

Bernoulli theory for the glottal flow

The most widely used assumptions in modeling the glottal aerodynamics are that the glottal flow is incompressible (flow density is constant) and steady during the opening of vocal folds and unsteady during their closing. It is called as quasi-steady Bernoulli flow assumption. Then, the pressure-flow relation can be written by the well known Bernoulli equation (Titze, 2006a):PTG=kt12ρug2ag2where ug is the glottal flow, PTG is the transglottal pressure (the difference between the subglottal and

Interactive source-filter modeling

We consider two ways to calculate pIN from the chain matrix; from lip velocity and from the glottal flow. Let us first write PIN in terms of ULIPS using KTRACT assuming that radiation impedance is zero (PLIP = 0),

PINULIPS=BTRACT=zN/2n=0Nbkzk

Defining the transfer function from lip velocity to vocal tract input pressure as B(z)=n=0Nbkzk, then PIN(z) isPIN(z)=zN/2B(z)ULIPS(z)

In time domain,pIN[n]=k=0NbkuLIPS[n+N/2k]

Note that in Eq. (13), pIN[n] depends on the future values of uLIPS[n] and it

Parameter estimation

In speech analysis, the estimation of model parameters is a blind estimation problem. We follow the steps below to find the model parameters, Amax, PL, ak, bk, qk.

  • 1.

    Closed – Phase analysis

    Estimation of glottal flow waveform by removing the vocal tract and radiation filters from speech signal using constrained closed phase analysis (Alku et al., 2009). In this step, both glottal flow waveform, ug[n], and vocal tract filter coefficients, ak's, are estimated from speech.

  • 2.

    Glottal parameterization

    For

Experiments

In this section, we present two sets of experiments conducted on a short speech segment and a large speech database recorded in anechoic recording room at our laboratory. At the beginning, the model and parameter estimation procedure is tested on a short speech segment in order to investigate how the nonlinear model produce fine details of glottal flow obtained by inverse filtering. Later, the linear non-interactive and nonlinear interactive models are compared by using Rosenberg+ model (

Conclusions

In this work, we present a framework for modeling nonlinear source filter coupling in voiced speech sounds. Based on this framework, two nonlinear interactive source filter models, ISFM1 and ISFM2, are developed and their relation to the linear source filter model, LSFM, and each other are exposed. It is shown that LSFM is an approximation of the ISFMs under certain conditions. The major difference between the ISFMs and LSFM is that, in the LSFM the source and the filter are assumed to be

References (49)

  • T. Ananthapadmanabha et al.

    Calculation of true glottal flow and its components

    Speech Commun.

    (1982)
  • I.R. Titze

    A theoretical study of F0-F1 interaction with application to resonant speaking and singing voice

    J. Voice

    (2004)
  • F. Alipour et al.

    A finite-element model of vocal-fold vibration

    J. Acoust. Soc. Am.

    (2000)
  • F. Alipour et al.

    Pulsatile airflow during phonation: an excised larynx model

    J. Acoust. Soc. Am.

    (1995)
  • F. Alipour et al.

    Characterizing glottal jet turbulence

    J. Acoust. Soc. Am.

    (2006)
  • P. Alku et al.

    Closed phase covariance analysis based on constrained linear prediction for glottal inverse filtering

    J. Acoust. Soc. Am.

    (2009)
  • I. Arroabarren et al.

    Inverse filtering in singing voice: a critical analysis

    IEEE Trans. Audio Speech Lang. Process.

    (2006)
  • B.S. Atal et al.

    Speech analysis and synthesis by linear prediction of the speech wave

    J. Acoust. Soc. Am.

    (1971)
  • P. Birkholz et al.

    Simulation of losses due to turbulence in the time-varying vocal system

    IEEE Trans. Audio Speech Lang. Process.

    (2007)
  • M. Bulut et al.

    On the robustness of overall F0-only modifications to the perception of emotions in speech

    J. Acoust. Soc. Am.

    (2008)
  • D. Childers et al.

    Measuring and modeling vocal source-tract interaction

    IEEE Trans. Biomed. Eng.

    (1994)
  • G. Fant

    The Acoustic Theory of The Speech Production

    (1960)
  • J. Flanagan

    Speech Analysis Synthesis and Perception

    (1965)
  • Q. Fu et al.

    Robust glottal source estimation based on joint source-filter model optimization

    IEEE Trans. Audio Speech Lang. Process.

    (2006)
  • N. Henrich et al.

    On the use of the derivative of electroglottographic signals for characterization of nonpathological voice phonation

    J. Acoust. Soc. Am.

    (2004)
  • M.S. Howe et al.

    Source-tract interaction with prescribed vocal fold motion

    J. Acoust. Soc. Am.

    (2012)
  • K. Ishizaka et al.

    Synthesis of voiced source sounds from a two-mass model of the vocal cords

    Bell Syst. Tech. J.

    (1972)
  • B. Jong et al.

    Instantaneous orifice discharge coefficients of driven physical models of the human larynx

    J. Acoust. Soc. Am.

    (2007)
  • S. Jungsoo et al.

    Numerical simulation of turbulence transition and sound radiation for flow through a rigid glottal model

    J. Acoust. Soc. Am.

    (2007)
  • S. Jungsoo et al.

    Comparing turbulence models for flow through a rigid glottal model

    J. Acoust. Soc. Am.

    (2008)
  • J.L. Kelly et al.

    Speech synthesis

  • S. Khosia et al.

    Vortical flow field during phonation in an excised canine larynx model

    Ann. Otol. Rhinol. Laryngol.

    (2007)
  • S. Khosia et al.

    Using particle image velocimetry to measure anterior–posterior velocity gradients during phonation in the excised canine larynx model

    Ann. Otol. Rhinol. Laryngol.

    (2008)
  • A. Krishnamurthy et al.

    Two-channel speech analysis

    IEEE Trans. Acoust. Speech Signal Process.

    (1986)
  • Cited by (9)

    • Estimation of Source-Filter Interaction Regions Based on Electroglottography

      2019, Journal of Voice
      Citation Excerpt :

      In level 1 interaction, acoustic vocal tract pressures affect the transglottal pressure, and with it the glottal airflow (Figure 1, inner loop). At this level, vocal fold vibration can remain relatively undisturbed.14 In the glottal airflow, however, frequencies creating harmonic distortion are produced that contribute to the source spectrum.

    • New Evidence That Nonlinear Source-Filter Coupling Affects Harmonic Intensity and fo Stability During Instances of Harmonics Crossing Formants

      2017, Journal of Voice
      Citation Excerpt :

      In contrast to level 1 interactions, recall that level 2 interactions require the vibratory pattern of the vocal folds to be disturbed by the acoustic pressures within the vocal tract. These disruptions may take the form of pitch instabilities, biphonation, subharmonics, or deterministic chaos.9,16,17 It has already been shown that pitch jumps occur more frequently in phonations where fo crosses F19—implying that nonlinear source-filter coupling may impact the stability of fo when it is in the vicinity of F1.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by S. Narayanan.

    View full text