Abstract
The aim of this paper is to evaluate the effectiveness of using video data for voice source parametrization in the representation of voice production through physical modeling. Laryngeal imaging techniques can be effectively used to obtain vocal fold video sequences and to derive time patterns of relevant glottal cues, such as folds edge position or glottal area. In many physically based numerical models of the vocal folds, these parameters are estimated from the inverse filtered glottal flow waveform, obtained from audio recordings of the sound pressure at lips. However, this model inversion process is often problematic and affected by accuracy and robustness issues. It is here discussed how video analysis of the fold vibration might be effectively coupled to the parametric estimation algorithms based on voice recordings, to improve accuracy and robustness of model inversion.
Similar content being viewed by others
References
Stevens, K.N.: Acoustic Phonetics, Current Studies in Linguistics. The MIT Press, Cambridge (1998)
Ishizaka, K., Flanagan, J.L.: Synthesis of voiced sounds from a two-mass model of the vocal cords. Bell Syst. Tech. J. 51(6), 1233–1268 (1972)
Koizumi, T., Taniguchi, S., Hiromitsu, S.: Two-mass models of the vocal cords for natural sounding voice synthesis. J. Acoust. Soc. Am. 82(4), 1179–1192 (1987)
Titze, I.R.: The physics of small-amplitude oscillations of the vocal folds. J. Acoust. Soc. Am. 83(4), 1536–1552 (1988)
Pelorson, X., Hirschberg, A., van Hassel, R.R., Wijnands, A.P.J.: Theoretical and experimental study of quasisteady-flow separation within the glottis during phonation. Application to a modified two-mass model. J. Acoust. Soc. Am. 96(6), 3416–3431 (1994)
Lucero, J.C.: Dynamics of the two-mass model of the vocal folds: equilibria, bifurcations and oscillation region. J. Acoust. Soc. Am. 94, 3104–3111 (1993)
Ishizaka, K., Isshiki, N.: Computer simulation of pathological vocal-cord vibration. Bell Syst. Tech. J. 60, 1193–1198 (1976)
Scalassara, P.R., Maciel, C.D., Guido, R.C., Pereira, J.C., Fonseca, E.S., Montagnoli, A.N., Júnior, S.B., Vieira, L.S., Sanchez, F.L.: Autoregressive decomposition and pole tracking applied to vocal fold nodule signals. Pattern Recogn. Lett. 28(11), 1360–1367 (2007)
Alku, P.: Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Commun. 11(2–3), 109–118 (1992)
Funaki, K., Miyanaga, Y., Tochinai, K.: Recursive ARMAX speech analysis based on a glottal source model with phase compensation. Signal Process. 3, 279–295 (1999)
Rao, P., Barman, A.D.: Speech formant frequency estimation: evaluating a nonstationary analysis method. Signal Process. 80(8), 1655–1667 (2000)
Wittenberg, T., Mergell, P., Tigges, M., Eysholdt, U.: Quantitative characterization of functional voice disorders using motion analysis of highspeed video and modeling. In: Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’97)-vol. 3, ICASSP’97, pp. 1663–1666 (1997)
Döllinger, M.: The next step in voice assessment: high-speed digital endoscopy and objective evaluation. Curr. Bioinform. 4(2), 101–111 (2009)
Lohscheller, J., Eysholdt, U., Toy, H., Döllinger, M.: Phonovibrography: mapping high-speed movies of vocal fold vibrations into 2-D diagrams for visualizing and analyzing the underlying laryngeal dynamics. IEEE Trans. Med. Imaging 27(3), 300–309 (2008)
Döllinger, M., Dubrovkiy, D., Patel, R.: Spatiotemporal analysis of vocal fold vibrations between children and adults. Laryngoscope 122(11), 2511–2518 (2012)
Larsson, H., Hertegård, S., Lindestad, P., Hammarberg, B.: Vocal fold vibrations: high-speed imaging, kymography, and acoustic analysis: a preliminary report. Laryngoscope 110(12), 2117–22 (2000)
Drioli, C.: A flow waveform-matched low-dimensional glottal model based on physical knowledge. J. Acoust. Soc. Am. 117(5), 3184–3195 (2005)
Drioli, C., Avanzini, F.: Non-modal voice synthesis by low-dimensional physical models. In: Proceedings of 3rd International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA) (2003)
Drioli, C., Calanca, A.: Voice processing by dynamic Glottal models with applications to speech enhancement. In: Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH 2011), pp. 1789–1792 (2011)
Švec, J.G., Schutte, H.K.: Videokymography: high-speed line scanning of vocal fold vibration. J. Voice 10(2), 201–205 (1996)
Qiu, Q., Schutte, H.: A new generation videokymography for routine clinical vocal fold examination. Laryngoscope 116(10), 1824–8 (2006)
Snidaro, L., Foresti, G.L.: Real-time thresholding with euler numbers. Pattern Recogn. Lett. 24(9–10), 1533–1544 (2003)
Foresti, G., Regazzoni, C.: A hierarchical approach to feature extraction and grouping. IEEE Trans. Image Process. 9(6), 1056–1074 (2000)
Maragos, P.A., Schafer, R.W., Butt, M.A. (eds.): Mathematical Morphology and Its Applications to Image and Signal Processing, Computational Imaging and Vision, 3rd edn. Kluwer, Atlanta (1996)
Eviatar, H., Somorjai, R.L.: A fast, simple active contour algorithm for biomedical images. Pattern Recogn. Lett. 17(9), 969–974 (1996)
Backstrom, T., Alku, P., Vilkman, E.: Time-domain parameterization of the closing phase of glottal airflow waveform from voices over a large intensity range. IEEE Trans. Speech Audio Process. 10(3), 186–192 (2002)
Acknowledgments
We wish to thank Cymo B.V., Groningen, The Netherlands, for kindly providing the acoustic and videokymographic data used in this paper. We also wish to thank the two anonymous reviewers for their valuable comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Drioli, C., Foresti, G.L. Accurate glottal model parametrization by integrating audio and high-speed endoscopic video data. SIViP 9, 1451–1459 (2015). https://doi.org/10.1007/s11760-013-0597-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-013-0597-0