ABSTRACT
In this paper, a new speech driven 3-D geometric tongue model is constructed. The constructed 3-D tongue shape is controlled with control points on 2-D midsagittal tongue curve, and speech-driven inverse estimation based on the constructed model is evaluated by empirical data. X-Ray 2-D vocal tract motion videos are tagged for the midsagittal tongue motion, and static 3-D vocal tracts of 20 phonemes are collected with MRI for the realistic 3-D tongue shape. MFCC are calculated from the videos as acoustic features, and are then used in a LSTM-RNN to predict the control points movement of the tongue shape. Three geometrically intuitive control points are selected to represent and calculate the midsagittal line of the tongue through linear regression. Cross-sections on the central lines of the tongues, whose height, width and angle are then predicted from the midsagittal line, are reconstructed with geometric curves, and the shape of each cross-section are then placed on the midsagittal line to get the overall predicted moving grid of the 3-D tongue. In this 3-D tongue model, acoustic features and realistic tongue motion are mapped directly to preserve more realistic articulatory details, and the control points are intuitive for non-experts to control the model, and the geometric tongue shapes predicted are comparable with realistic tongue dynamics. Based on the proposed method, the speech-driven prediction is evaluated with the realistic data, which proved this proposed method feasible.
- P., Birkholz, D., Jackèl, & N. J., Kroger (2006). Construction and control of a three-dimensional vocal tract model. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings (Vol. 1, pp. I-I). IEEE.Google ScholarCross Ref
- O., Engwall (2003). Combining MRI, EMA and EPG measurements in a three-dimensional tongue model. Speech Communication, 41(2-3), 303--329.Google ScholarCross Ref
- P., Badin, & A., Serrurier (2006). Three-dimensional linear modeling of tongue: Articulatory data and models.Google Scholar
- Q., Fang, J., Liu, C., Song, J., Wei, & W., Lu (2014). A novel 3D geometric articulatory model. In The 9th International Symposium on Chinese Spoken Language Processing (pp. 368--371). IEEE.Google ScholarCross Ref
- Q., Fang, H., Li, J., Wei, J., Wang, & X., Wu (2018). A Nonlinear 3D Geometric Tongue Model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4989--4993). IEEE.Google ScholarDigital Library
- P., Badin, G., Bailly, M., Raybaudi, & C., Segebarth (1998). A three-dimensional linear articulatory model based on MRI data. In The Third ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis.Google ScholarCross Ref
- Y., Yao (2016). The Study of the 4D Vocal-Tract Model of Mandarin Chinese. Ph.D. Thesis, Peking University.Google Scholar
- G., Wang (2010). An Articulatory Model of Vocal Tract in Mandarin. Ph.D. Thesis, Peking University.Google Scholar
- J., Zhang (2018). Research on Articulatory Model. Ph.D. Thesis, Peking University.Google Scholar
- C., Qin, M. A., Carreira-Perpinán, K., Richmond, A., Wrench, & S., Renals (2008). Predicting tongue shapes from a few landmark locations.Google ScholarCross Ref
- T., Kaburagi, & M., Honda (1994). Determination of sagittal tongue shape from the positions of points on the tongue surface. The Journal of the Acoustical Society of America, 96(3), 1356--1366.Google ScholarCross Ref
- P., Badin, E., Baricchi, & A., Vilain, (1997). Determining tongue articulation: from discrete fleshpoints to continuous shadow. In Fifth European Conference on Speech Communication and Technology.Google ScholarCross Ref
- P., Liu, Q., Yu, Z., Wu, S., Kang, H., Meng, & L., Cai, (2015). A deep recurrent approach for acoustic-to-articulatory inversion. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4450--4454). IEEE.Google ScholarCross Ref
- P., Zhu, L., Xie, & Y., Chen (2015). Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings. In Sixteenth Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
Index Terms
- A Speech-Driven 3-D Tongue Model with Realistic Movement in Mandarin Chinese
Recommendations
A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese
ISAIMS '20: Proceedings of the 1st International Symposium on Artificial Intelligence in Medical SciencesIn this paper, a new speech-driven lip synchronization method is developed, predicting the 3-D geometric shape of the lip without using speech recognition model in the visualization procedure, and can be trained and evaluated with realistic dynamics. ...
Generating tonal distinctions in Mandarin Chinese using an electrolarynx with preprogrammed tone patterns
We created an electrolarynx (EL) that generates tonal distinctions in Mandarin.Tones are identified with greater accuracy in the tonal vs. monotone EL condition.Tonal information enhances EL speech intelligibility and acceptability. An electrolarynx (EL)...
The Coding Strategy for the Mandarin Speech Conveying Sarcasm in Acoustic and Articulatory Domain
ICDSP '21: Proceedings of the 2021 5th International Conference on Digital Signal ProcessingPurpose: This study investigated the coding strategy for the speech conveying two opposing attitudes, i.e., sarcasm and praising, based on the utterances elicited by role-play dialogues. Method: Using an electromagnetic articulography (EMA), we ...
Comments