Abstract
Estimating glottal source waveforms and vocal tract shapes is typically done by processing the speech signal using an inverse filter and then fitting the residual signal using the glottal source model. However, due to source-tract interactions, the estimation accuracy is reduced. In this paper, we propose a method to estimate glottal source waveforms and vocal tract shapes simultaneously based on an analysis-by-synthesis approach with a source-filter model constructed of an Auto-Regressive eXogenous (ARX) model and the Liljencrants-Fant (LF) model. Since the optimization of multiple parameters makes simultaneous estimation difficult, we first initialize the glottal source parameters using the inverse filter method, and then simultaneously estimate the accurate parameters of the glottal sources and the vocal tract shapes using an analysis-by-synthesis approach. Experimental results with synthetic and real speech signals showed that the proposed method has higher estimation accuracy than using the inverse filter.
Similar content being viewed by others
References
Cohen, J., Kamm, T., Andreou, A.G. (1995). Vocal tract normalization in speech recognition: Compensating for systematic speaker variability. The Journal of the Acoustical Society of America, 97(5), 3246–3247.
Raitio, T., Suni, A., Pulakka, H., Vainio, M., Alku, P. (2011). Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4564–4567).
Drugman, T., Dubuisson, T., Dutoit, T. (2009). On the mutual information between source and filter contributions for voice pathology detection. In Tenth Annual Conference of the International Speech Communication Association.
Childers, D.G. (1995). Glottal source modeling for voice conversion. Speech Communication, 16(2), 127–138.
Plumpe, M.D., Quatieri, T.F., Reynolds, D.A. (1999). Modeling of the glottal flow derivative waveform with application to speaker identification. IEEE Transactions on Speech and Audio Processing, 7(5), 569–586.
Iliev, A.I., Scordilis, M.S., Papa, J.P., Falcão, A.X. (2010). Spoken emotion recognition through optimum-path forest classification using glottal features. Computer Speech & Language, pp. 445–460.
Li, X., & Akagi, M. (2018). A three-layer emotion perception model for valence and arousal-based detection from multilingual speech. In Interspeech (pp. 3643–3647).
Fant, G., Liljencrants, J., Lin, Q.g. (1985). A four-parameter model of glottal flow. STL-QPSR, 4, 1–13.
Rabiner, L.R., & Schafer, R.W. (1987). Digital processing of speech signals. Prentice-hall Englewood Cliffs, NJ, 100.
Wong, D., Markel, J., Gray, A. (1979). Least squares glottal inverse filtering from the acoustic speech waveform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(4), 350–355.
Alku, P. (1992). Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Communication, 11(2-3), 109–118.
Drugman, T., Bozkurt, B., Dutoit, T. Complex cepstrum-based decomposition of speech for glottal source estimation. Interspeech, 116–119.
Kane, J., & Gobl, C. (2013). Automating manual user strategies for precise voice source analysis. Speech Communication, 55(3), 397–414.
Klatt, D.H., & Klatt, L.C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers. The Journal of the Acoustical Society of America, 87(2), 820–857.
Fujisaki, H., & Ljungqvist, M. (1986). Proposal and evaluation of models for the glottal source waveform. ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, 11, 1605–1608.
Ding, W., Kasuya, H., Adachi, S. (1995). Simultaneous estimation of vocal tract and voice source parameters based on an ARX model. IEICE Transactions on Information and Systems, 78(6), 738–743.
Fujisaki, H., & Ljungqvist, M. (1996). Estimation of voice source and vocal tract parameters based on ARMA analysis and a model for the glottal source waveform. In Recent Research Towards Advanced Man-machine Interface Through Spoken Language (pp. 52–60).
Fröhlich, M., Michaelis, D., Strube, H.W. (2001). SIM-simultaneous inverse filtering and matching of a glottal flow model for acoustic speech signals. The Journal of the Acoustical Society of America, 110(1), 479–488.
Vincent, D., Rosec, O., Chonavel, T. (2005). Estimation of LF glottal source parameters based on an ARX model. In Ninth European Conference on Speech Communication and Technology (pp. 333–336).
Fu, Q., & Murphy, P. (2006). Robust glottal source estimation based on joint source-filter model optimization. IEEE Transactions on Audio, Speech, and Language Processing, 14(2), 492–501.
Fant, G. (1995). The LF-model revisited Transformations and frequency domain analysis. Speech Trans. Lab. Q. Rep., Royal Inst. of Tech. Stockholm, 2(3), 119–156.
Li, Y., Sakakibara, K.I., Morikawa, D., Akagi, M. (2017). Commonalities of glottal sources and vocal tract shapes among speakers in emotional speech. In International Seminar on Speech Production (pp. 24–34).
Takahashi, K., & Akagi, M. (2018). Estimation of glottal source waveforms and vocal tract shape for singing voices with wide frequency range. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1879–1887).
Drugman, T., Thomas, M., Gudnason, J., Naylor, P., Dutoit, T. (2012). Detection of glottal closure instants from speech signals: A quantitative review. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 994–1006.
Kane, J., Yanushevskaya, I., Ní Chasaide, A., Gobl, C. (2012). Exploiting time and frequency domain measures for precise voice source parameterisation. Speech Prosody, 2012, 143–146.
Lu, H.L. (2002). Toward a high-quality singing synthesizer with vocal texture control. Stanford University.
Kawahara, H., Sakakibara, K.I., Banno, H., Morise, M., Toda, T., Irino, T. (2015). Aliasing-free implementation of discrete-time glottal source models and their applications to speech synthesis and F0 extractor evaluation. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2015 Asia-Pacific (pp. 520–529).
Drugman, T., Bozkurt, B., Dutoit, T. (2012). A comparative study of glottal source estimation techniques. Computer Speech & Language, 26(1), 20–34.
Acknowledgements
This study was supported by a Grant-in-Aid for Scientific Research (A) (No. 25240026), JST-Mirai Program (JP-MJMI18D1) and China Scholarship Council (CSC).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, Y., Sakakibara, KI. & Akagi, M. Simultaneous Estimation of Glottal Source Waveforms and Vocal Tract Shapes from Speech Signals Based on ARX-LF Model. J Sign Process Syst 92, 831–838 (2020). https://doi.org/10.1007/s11265-019-01510-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-019-01510-4