Skip to main content
Log in

Simultaneous Estimation of Glottal Source Waveforms and Vocal Tract Shapes from Speech Signals Based on ARX-LF Model

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Estimating glottal source waveforms and vocal tract shapes is typically done by processing the speech signal using an inverse filter and then fitting the residual signal using the glottal source model. However, due to source-tract interactions, the estimation accuracy is reduced. In this paper, we propose a method to estimate glottal source waveforms and vocal tract shapes simultaneously based on an analysis-by-synthesis approach with a source-filter model constructed of an Auto-Regressive eXogenous (ARX) model and the Liljencrants-Fant (LF) model. Since the optimization of multiple parameters makes simultaneous estimation difficult, we first initialize the glottal source parameters using the inverse filter method, and then simultaneously estimate the accurate parameters of the glottal sources and the vocal tract shapes using an analysis-by-synthesis approach. Experimental results with synthetic and real speech signals showed that the proposed method has higher estimation accuracy than using the inverse filter.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6

Similar content being viewed by others

References

  1. Cohen, J., Kamm, T., Andreou, A.G. (1995). Vocal tract normalization in speech recognition: Compensating for systematic speaker variability. The Journal of the Acoustical Society of America, 97(5), 3246–3247.

    Article  Google Scholar 

  2. Raitio, T., Suni, A., Pulakka, H., Vainio, M., Alku, P. (2011). Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4564–4567).

  3. Drugman, T., Dubuisson, T., Dutoit, T. (2009). On the mutual information between source and filter contributions for voice pathology detection. In Tenth Annual Conference of the International Speech Communication Association.

  4. Childers, D.G. (1995). Glottal source modeling for voice conversion. Speech Communication, 16(2), 127–138.

    Article  Google Scholar 

  5. Plumpe, M.D., Quatieri, T.F., Reynolds, D.A. (1999). Modeling of the glottal flow derivative waveform with application to speaker identification. IEEE Transactions on Speech and Audio Processing, 7(5), 569–586.

    Article  Google Scholar 

  6. Iliev, A.I., Scordilis, M.S., Papa, J.P., Falcão, A.X. (2010). Spoken emotion recognition through optimum-path forest classification using glottal features. Computer Speech & Language, pp. 445–460.

  7. Li, X., & Akagi, M. (2018). A three-layer emotion perception model for valence and arousal-based detection from multilingual speech. In Interspeech (pp. 3643–3647).

  8. Fant, G., Liljencrants, J., Lin, Q.g. (1985). A four-parameter model of glottal flow. STL-QPSR, 4, 1–13.

    Google Scholar 

  9. Rabiner, L.R., & Schafer, R.W. (1987). Digital processing of speech signals. Prentice-hall Englewood Cliffs, NJ, 100.

  10. Wong, D., Markel, J., Gray, A. (1979). Least squares glottal inverse filtering from the acoustic speech waveform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(4), 350–355.

    Article  Google Scholar 

  11. Alku, P. (1992). Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Communication, 11(2-3), 109–118.

    Article  Google Scholar 

  12. Drugman, T., Bozkurt, B., Dutoit, T. Complex cepstrum-based decomposition of speech for glottal source estimation. Interspeech, 116–119.

  13. Kane, J., & Gobl, C. (2013). Automating manual user strategies for precise voice source analysis. Speech Communication, 55(3), 397–414.

    Article  Google Scholar 

  14. Klatt, D.H., & Klatt, L.C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers. The Journal of the Acoustical Society of America, 87(2), 820–857.

    Article  Google Scholar 

  15. Fujisaki, H., & Ljungqvist, M. (1986). Proposal and evaluation of models for the glottal source waveform. ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, 11, 1605–1608.

    Article  Google Scholar 

  16. Ding, W., Kasuya, H., Adachi, S. (1995). Simultaneous estimation of vocal tract and voice source parameters based on an ARX model. IEICE Transactions on Information and Systems, 78(6), 738–743.

    Google Scholar 

  17. Fujisaki, H., & Ljungqvist, M. (1996). Estimation of voice source and vocal tract parameters based on ARMA analysis and a model for the glottal source waveform. In Recent Research Towards Advanced Man-machine Interface Through Spoken Language (pp. 52–60).

  18. Fröhlich, M., Michaelis, D., Strube, H.W. (2001). SIM-simultaneous inverse filtering and matching of a glottal flow model for acoustic speech signals. The Journal of the Acoustical Society of America, 110(1), 479–488.

    Article  Google Scholar 

  19. Vincent, D., Rosec, O., Chonavel, T. (2005). Estimation of LF glottal source parameters based on an ARX model. In Ninth European Conference on Speech Communication and Technology (pp. 333–336).

  20. Fu, Q., & Murphy, P. (2006). Robust glottal source estimation based on joint source-filter model optimization. IEEE Transactions on Audio, Speech, and Language Processing, 14(2), 492–501.

    Article  Google Scholar 

  21. Fant, G. (1995). The LF-model revisited Transformations and frequency domain analysis. Speech Trans. Lab. Q. Rep., Royal Inst. of Tech. Stockholm, 2(3), 119–156.

    Google Scholar 

  22. Li, Y., Sakakibara, K.I., Morikawa, D., Akagi, M. (2017). Commonalities of glottal sources and vocal tract shapes among speakers in emotional speech. In International Seminar on Speech Production (pp. 24–34).

  23. Takahashi, K., & Akagi, M. (2018). Estimation of glottal source waveforms and vocal tract shape for singing voices with wide frequency range. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1879–1887).

  24. Drugman, T., Thomas, M., Gudnason, J., Naylor, P., Dutoit, T. (2012). Detection of glottal closure instants from speech signals: A quantitative review. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 994–1006.

    Article  Google Scholar 

  25. Kane, J., Yanushevskaya, I., Ní Chasaide, A., Gobl, C. (2012). Exploiting time and frequency domain measures for precise voice source parameterisation. Speech Prosody, 2012, 143–146.

    Google Scholar 

  26. Lu, H.L. (2002). Toward a high-quality singing synthesizer with vocal texture control. Stanford University.

  27. Kawahara, H., Sakakibara, K.I., Banno, H., Morise, M., Toda, T., Irino, T. (2015). Aliasing-free implementation of discrete-time glottal source models and their applications to speech synthesis and F0 extractor evaluation. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2015 Asia-Pacific (pp. 520–529).

  28. Drugman, T., Bozkurt, B., Dutoit, T. (2012). A comparative study of glottal source estimation techniques. Computer Speech & Language, 26(1), 20–34.

    Article  Google Scholar 

Download references

Acknowledgements

This study was supported by a Grant-in-Aid for Scientific Research (A) (No. 25240026), JST-Mirai Program (JP-MJMI18D1) and China Scholarship Council (CSC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongwei Li.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Sakakibara, KI. & Akagi, M. Simultaneous Estimation of Glottal Source Waveforms and Vocal Tract Shapes from Speech Signals Based on ARX-LF Model. J Sign Process Syst 92, 831–838 (2020). https://doi.org/10.1007/s11265-019-01510-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-019-01510-4

Keywords

Navigation