skip to main content
10.1145/3664647.3681097acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

Published: 28 October 2024 Publication History

Abstract

We devise an articulatory representation-based text-to-speech (TTS) model, ArtSpeech, an explainable and effective network for human-like speech synthesis, by revisiting the sound production system. Current deep TTS models learn acoustic-text mapping in a fully parametric manner, ignoring the explicit physical significance of articulation movement. ArtSpeech, on the contrary, leverages articulatory representations to perform adaptive TTS, clearly describing the voice tone and speaking prosody of different speakers. Specifically, energy, F0, and vocal tract variables are utilized to represent airflow forced by articulatory organs, the degree of tension in the vocal folds of the larynx, and the coordinated movements between different organs, respectively. We also design a multi-dimensional style mapping network to extract speaking styles from the articulatory representations, guided by which variation predictors could predict the final mel spectrogram output. To validate the effectiveness of our approach, we conducted comprehensive experiments and analyses using the widely recognized speech corpus, such as LJSpeech and LibriTTS datasets, yielding promising similarity enhancement between the generated results and the target speaker's voice and prosody.

References

[1]
Sandesh Aryal and Ricardo Gutierrez-Osuna. 2016. Data driven articulatory synthesis with deep neural networks. Computer Speech & Language, Vol. 36 (2016), 260--273.
[2]
Mathieu Bernard and Hadrien Titeux. 2021. Phonemizer: Text to Phones Transcription for Multiple Languages in Python. Journal of Open Source Software, Vol. 6, 68 (2021), 3958. https://doi.org/10.21105/joss.03958
[3]
Peter Birkholz. 2013. Modeling consonant-vowel coarticulation for articulatory speech synthesis. PloS one, Vol. 8, 4 (2013), e60603.
[4]
Tom Bäckström, Okko Räsänen, Abraham Zewoudie, Pablo Pérez Zarazaga, Liisa Koivusalo, Sneha Das, Esteban Gómez Mellado, Marieum Bouafif Mansali, Daniel Ramos, Sudarsana Kadiri, and Paavo Alku. 2022. Introduction to Speech Processing 2 ed.). https://doi.org/10.5281/zenodo.6821775
[5]
Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren Gölge, and Moacir A Ponti. 2022. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In ICML. PMLR, 2709--2720.
[6]
Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, and Tie-Yan Liu. 2021. Adaspeech: Adaptive text to speech for custom voice. arXiv preprint arXiv:2103.00993 (2021).
[7]
Chung-Ming Chien and Hung-yi Lee. 2021. Hierarchical prosody modeling for non-autoregressive speech synthesis. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 446--453.
[8]
Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. 2020. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8188--8197.
[9]
Joao M Correia, César Caballero-Gaudes, Sara Guediche, and Manuel Carreiras. 2020. Phonatory and articulatory representations of speech production in cortical and subcortical fMRI responses. Scientific Reports, Vol. 10, 1 (2020), 4529.
[10]
Tamás Gábor Csapó, Csaba Zainkó, László Tóth, Gábor Gosztolya, and Alexandra Markó. 2020. Ultrasound-based articulatory-to-acoustic mapping with WaveGlow speech synthesis. arXiv preprint arXiv:2008.03152 (2020).
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[12]
Randy L. Diehl. 2008. Acoustic and auditory phonetics: the adaptive design of speech sound systems. Philosophical Transactions of the Royal Society B: Biological Sciences, Vol. 363 (2008), 965 -- 978.
[13]
Aciel Eshky, Manuel Sam Ribeiro, Joanne Cleland, Korin Richmond, Zoe Roxburgh, James Scobbie, and Alan Wrench. 2019. UltraSuite: a repository of ultrasound and acoustic data from child speech therapy sessions. arXiv preprint arXiv:1907.00835 (2019).
[14]
Qiang Fang and Jianwu Dang. 2006. Speech synthesis based on a physiological articulatory model. In International Symposium on Chinese Spoken Language Processing. Springer, 211--222.
[15]
Gunnar Fant. 1960. Acoustic Theory Of Speech Production.
[16]
Jessica L. Gaines, Kwang shik Kim, Benjamin Parrell, Vikram Ramanarayanan, Srikantan S. Nagarajan, and John F. Houde. 2021. Discrete constriction locations describe a comprehensive range of vocal tract shapes in the Maeda model. Jasa Express Letters, Vol. 1 (2021).
[17]
Jean-Michel Gérard, Reiner Wilhelms-Tricarico, Pascal Perrier, and Yohan Payan. 2006. A 3D dynamical biomechanical tongue model to study speech motor control. arXiv preprint physics/0606148 (2006).
[18]
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020).
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[20]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9 (1997), 1735--1780.
[21]
Wei-Ning Hsu, Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, et al. 2018. Hierarchical generative modeling for controllable speech synthesis. arXiv preprint arXiv:1810.07217 (2018).
[22]
Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. 2022. Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech. Advances in Neural Information Processing Systems, Vol. 35 (2022), 10970--10983.
[23]
Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision. 1501--1510.
[24]
Chae-Bin Im, Sang-Hoon Lee, Seung-Bin Kim, and Seong-Whan Lee. 2022. Emoq-tts: Emotion intensity quantization for fine-grained controllable emotional text-to-speech. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6317--6321.
[25]
Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/.
[26]
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8110--8119.
[27]
Wolfgang von Kempelen, Herbert E Brekle, and Wolfgang Wildgen. 1791. Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine. Grammatica universalis, Vol. 4 ( 1791).
[28]
Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In ICML. PMLR, 5530--5540.
[29]
Viacheslav Klimkov, Srikanth Ronanki, Jonas Rohnke, and Thomas Drugman. 2019. Fine-grained robust prosody transfer for single-speaker neural text-to-speech. arXiv preprint arXiv:1907.02479 (2019).
[30]
Sangeun Kum and Juhan Nam. 2019. Joint detection and classification of singing voice melody using convolutional recurrent neural networks. Applied Sciences, Vol. 9, 7 (2019), 1324.
[31]
Adrian Ła'ncucki. 2021. Fastpitch: Parallel text-to-speech with pitch prediction. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6588--6592.
[32]
Jimin Lee, Susan Shaiman, and Gary Weismer. 2016. Relationship between tongue positions and formant frequencies in female speakers. The Journal of the Acoustical Society of America, Vol. 139 1 (2016), 426--40.
[33]
Keon Lee, Kyumin Park, and Daeyoung Kim. 2021. STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech. In Interspeech.
[34]
Tao Li, Xinsheng Wang, Qicong Xie, Zhichao Wang, and Linfu Xie. 2021. Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 30 (2021), 1448--1460.
[35]
Yinghao Aaron Li, Cong Han, and Nima Mesgarani. 2022. Styletts: A style-based generative model for natural and diverse text-to-speech synthesis. arXiv preprint arXiv:2205.15439 (2022).
[36]
Yinghao Aaron Li, Cong Han, Vinay Raghavan, Gavin Mischler, and Nima Mesgarani. 2024. Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. Advances in Neural Information Processing Systems, Vol. 36 (2024).
[37]
Qiguang Lin and Gunnar Fant. 1989. Vocal-tract area-function parameters from formant frequencies. In EUROSPEECH.
[38]
Björn Lindblom and Johan Sundberg. 1970. Acoustical consequences of lip, tongue, jaw, and larynx movement. The Journal of the Acoustical Society of America, Vol. 50 4 (1970), 1166--79.
[39]
Zhenhua Ling, Korin Richmond, Junichi Yamagishi, and Ren-Hua Wang. 2009. Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 17 (2009), 1171--1185.
[40]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
[41]
Florian Lux and Ngoc Thang Vu. 2022. Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features. In Annual Meeting of the Association for Computational Linguistics.
[42]
Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang. 2021. Meta-stylespeech: Multi-speaker adaptive text-to-speech generation. In ICML. PMLR, 7748--7759.
[43]
Mohsen Motie-Shirazi, Matías Za nartu, Sean D Peterson, and Byron D Erath. 2021. Vocal fold dynamics in a synthetic self-oscillating model: Intraglottal aerodynamic pressure and energy. The Journal of the Acoustical Society of America, Vol. 150, 2 (2021), 1332--1345.
[44]
Shrikanth S. Narayanan, Asterios Toutios, Vikram Ramanarayanan, Adam C. Lammert, Jangwon Kim, Sungbok Lee, Krishna S. Nayak, Yoon-Chul Kim, Yinghua Zhu, Louis M. Goldstein, Dani Byrd, Erik Bresch, Prasanta Kumar Ghosh, Athanasios Katsamanis, and Michael I. Proctor. 2014. Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). The Journal of the Acoustical Society of America, Vol. 136 3 (2014), 1307.
[45]
Maud Parrot, Juliette Millet, and Ewan Dunbar. 2019. Independent and automatic evaluation of acoustic-to-articulatory inversion models. arXiv preprint arXiv:1911.06573 (2019).
[46]
Joseph S Perkell. 1974. A physiologically-oriented model of tongue activity in speech production. (1974).
[47]
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020).
[48]
Yi Ren, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen, Zhijie Yan, and Zhou Zhao. 2022. Prosospeech: Enhancing prosody with quantized vector pre-training in text-to-speech. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7577--7581.
[49]
Yi Ren, Jinglin Liu, and Zhou Zhao. 2021. Portaspeech: Portable and high-quality generative text-to-speech. Advances in Neural Information Processing Systems, Vol. 34 (2021), 13963--13974.
[50]
Korin Richmond, Phil Hoole, and Simon King. 2011. Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In Twelfth Annual Conference of the International Speech Communication Association.
[51]
Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. 2023. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116 (2023).
[52]
RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, and Rif A Saurous. 2018. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In ICML. PMLR, 4693--4702.
[53]
Daisy Stanton, Yuxuan Wang, and RJ Skerry-Ryan. 2018. Predicting expressive speaking style from text in end-to-end speech synthesis. In 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 595--602.
[54]
Mark Kenneth Tiede, Carol Y. Espy-Wilson, Dolly Goldenberg, Vikramjit Mitra, Hosung Nam, and G Sivaraman. 2017. Quantifying kinematic aspects of reduction in a contrasting rate production task. Journal of the Acoustical Society of America, Vol. 141 (2017), 3580--3580.
[55]
Tomoki Toda, Alan W Black, and Keiichi Tokuda. 2004. Mapping from articulatory movements to vocal tract spectrum with Gaussian mixture model for articulatory speech synthesis. In Fifth ISCA Workshop on Speech Synthesis.
[56]
Rafael Valle, Jason Li, Ryan Prenger, and Bryan Catanzaro. 2020. Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6189--6193.
[57]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[58]
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. 2023. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111 (2023).
[59]
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. 2017. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017).
[60]
Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In ICML. PMLR, 5180--5189.
[61]
Reiner Wilhelms-Tricarico. 1995. Physiological modeling of speech production: methods for modeling soft-tissue articulators. The Journal of the Acoustical Society of America, Vol. 97 5 Pt 1 (1995), 3085--98.
[62]
Peter Wu, Li-Wei Chen, Cheol Jun Cho, Shinji Watanabe, Louis Goldstein, Alan W Black, and Gopala K. Anumanchipalli. 2023. Speaker-Independent Acoustic-to-Articulatory Speech Inversion. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1--5.
[63]
Peter Wu, Tingle Li, Yijing Lu, Yubin Zhang, Jiachen Lian, Alan W Black, Louis Goldstein, Shinji Watanabe, and Gopala K Anumanchipalli. 2023. Deep Speech Synthesis from MRI-Based Articulatory Representations. arXiv preprint arXiv:2307.02471 (2023).
[64]
Yide Yu, Amin Honarmandi Shandiz, and László Tóth. 2021. Reconstructing speech from real-time articulatory MRI using neural vocoders. In 2021 29th European Signal Processing Conference (EUSIPCO). IEEE, 945--949.
[65]
Heiga Zen, Viet-Trung Dang, Robert A. J. Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Z. Chen, and Yonghui Wu. 2019. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In Interspeech.
[66]
Guangyan Zhang, Ying Qin, W. Zhang, Jialun Wu, Mei Li, Yu Gai, Feijun Jiang, and Tan Lee. 2022. iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis Based on Disentanglement Between Prosody and Timbre. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 31 (2022), 1693--1705.
[67]
Jing-Xuan Zhang, Korin Richmond, Zhen-Hua Ling, and Lirong Dai. 2021. Talnet: Voice reconstruction from tongue and lip articulation with transfer learning from text-to-speech synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14402--14410. Issue: 16.
[68]
Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, and Bhuvana Ramabhadran. 2019. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. arXiv preprint arXiv:1907.04448 (2019).
[69]
Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. 2023. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926 (2023).
[70]
Xinfa Zhu, Yi Lei, Kun Song, Yongmao Zhang, Tao Li, and Lei Xie. 2023. Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5.

Index Terms

  1. ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
      October 2024
      11719 pages
      ISBN:9798400706868
      DOI:10.1145/3664647
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 October 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. articulatory representation
      2. style transfer
      3. tts

      Qualifiers

      • Research-article

      Funding Sources

      • Young Scientists Fund of the National Natural Science Foundation of China

      Conference

      MM '24
      Sponsor:
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne VIC, Australia

      Acceptance Rates

      MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 179
        Total Downloads
      • Downloads (Last 12 months)179
      • Downloads (Last 6 weeks)96
      Reflects downloads up to 28 Feb 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media