research-article

ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

Authors:

Hua HuangAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 535 - 544

https://doi.org/10.1145/3664647.3681097

Published: 28 October 2024 Publication History

Abstract

We devise an articulatory representation-based text-to-speech (TTS) model, ArtSpeech, an explainable and effective network for human-like speech synthesis, by revisiting the sound production system. Current deep TTS models learn acoustic-text mapping in a fully parametric manner, ignoring the explicit physical significance of articulation movement. ArtSpeech, on the contrary, leverages articulatory representations to perform adaptive TTS, clearly describing the voice tone and speaking prosody of different speakers. Specifically, energy, F0, and vocal tract variables are utilized to represent airflow forced by articulatory organs, the degree of tension in the vocal folds of the larynx, and the coordinated movements between different organs, respectively. We also design a multi-dimensional style mapping network to extract speaking styles from the articulatory representations, guided by which variation predictors could predict the final mel spectrogram output. To validate the effectiveness of our approach, we conducted comprehensive experiments and analyses using the widely recognized speech corpus, such as LJSpeech and LibriTTS datasets, yielding promising similarity enhancement between the generated results and the target speaker's voice and prosody.

References

[1]

Sandesh Aryal and Ricardo Gutierrez-Osuna. 2016. Data driven articulatory synthesis with deep neural networks. Computer Speech & Language, Vol. 36 (2016), 260--273.

Digital Library

[2]

Mathieu Bernard and Hadrien Titeux. 2021. Phonemizer: Text to Phones Transcription for Multiple Languages in Python. Journal of Open Source Software, Vol. 6, 68 (2021), 3958. https://doi.org/10.21105/joss.03958

[3]

Peter Birkholz. 2013. Modeling consonant-vowel coarticulation for articulatory speech synthesis. PloS one, Vol. 8, 4 (2013), e60603.

[4]

Tom Bäckström, Okko Räsänen, Abraham Zewoudie, Pablo Pérez Zarazaga, Liisa Koivusalo, Sneha Das, Esteban Gómez Mellado, Marieum Bouafif Mansali, Daniel Ramos, Sudarsana Kadiri, and Paavo Alku. 2022. Introduction to Speech Processing 2 ed.). https://doi.org/10.5281/zenodo.6821775

[5]

Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren Gölge, and Moacir A Ponti. 2022. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In ICML. PMLR, 2709--2720.

[6]

Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, and Tie-Yan Liu. 2021. Adaspeech: Adaptive text to speech for custom voice. arXiv preprint arXiv:2103.00993 (2021).

[7]

Chung-Ming Chien and Hung-yi Lee. 2021. Hierarchical prosody modeling for non-autoregressive speech synthesis. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 446--453.

[8]

Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. 2020. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8188--8197.

[9]

Joao M Correia, César Caballero-Gaudes, Sara Guediche, and Manuel Carreiras. 2020. Phonatory and articulatory representations of speech production in cortical and subcortical fMRI responses. Scientific Reports, Vol. 10, 1 (2020), 4529.

[10]

Tamás Gábor Csapó, Csaba Zainkó, László Tóth, Gábor Gosztolya, and Alexandra Markó. 2020. Ultrasound-based articulatory-to-acoustic mapping with WaveGlow speech synthesis. arXiv preprint arXiv:2008.03152 (2020).

[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[12]

Randy L. Diehl. 2008. Acoustic and auditory phonetics: the adaptive design of speech sound systems. Philosophical Transactions of the Royal Society B: Biological Sciences, Vol. 363 (2008), 965 -- 978.

[13]

Aciel Eshky, Manuel Sam Ribeiro, Joanne Cleland, Korin Richmond, Zoe Roxburgh, James Scobbie, and Alan Wrench. 2019. UltraSuite: a repository of ultrasound and acoustic data from child speech therapy sessions. arXiv preprint arXiv:1907.00835 (2019).

[14]

Qiang Fang and Jianwu Dang. 2006. Speech synthesis based on a physiological articulatory model. In International Symposium on Chinese Spoken Language Processing. Springer, 211--222.

Digital Library

[15]

Gunnar Fant. 1960. Acoustic Theory Of Speech Production.

[16]

Jessica L. Gaines, Kwang shik Kim, Benjamin Parrell, Vikram Ramanarayanan, Srikantan S. Nagarajan, and John F. Houde. 2021. Discrete constriction locations describe a comprehensive range of vocal tract shapes in the Maeda model. Jasa Express Letters, Vol. 1 (2021).

[17]

Jean-Michel Gérard, Reiner Wilhelms-Tricarico, Pascal Perrier, and Yohan Payan. 2006. A 3D dynamical biomechanical tongue model to study speech motor control. arXiv preprint physics/0606148 (2006).

[18]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020).

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[20]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9 (1997), 1735--1780.

Digital Library

[21]

Wei-Ning Hsu, Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, et al. 2018. Hierarchical generative modeling for controllable speech synthesis. arXiv preprint arXiv:1810.07217 (2018).

[22]

Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. 2022. Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech. Advances in Neural Information Processing Systems, Vol. 35 (2022), 10970--10983.

[23]

Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision. 1501--1510.

[24]

Chae-Bin Im, Sang-Hoon Lee, Seung-Bin Kim, and Seong-Whan Lee. 2022. Emoq-tts: Emotion intensity quantization for fine-grained controllable emotional text-to-speech. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6317--6321.

[25]

Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/.

[26]

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8110--8119.

[27]

Wolfgang von Kempelen, Herbert E Brekle, and Wolfgang Wildgen. 1791. Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine. Grammatica universalis, Vol. 4 ( 1791).

[28]

Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In ICML. PMLR, 5530--5540.

[29]

Viacheslav Klimkov, Srikanth Ronanki, Jonas Rohnke, and Thomas Drugman. 2019. Fine-grained robust prosody transfer for single-speaker neural text-to-speech. arXiv preprint arXiv:1907.02479 (2019).

[30]

Sangeun Kum and Juhan Nam. 2019. Joint detection and classification of singing voice melody using convolutional recurrent neural networks. Applied Sciences, Vol. 9, 7 (2019), 1324.

[31]

Adrian Ła'ncucki. 2021. Fastpitch: Parallel text-to-speech with pitch prediction. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6588--6592.

[32]

Jimin Lee, Susan Shaiman, and Gary Weismer. 2016. Relationship between tongue positions and formant frequencies in female speakers. The Journal of the Acoustical Society of America, Vol. 139 1 (2016), 426--40.

[33]

Keon Lee, Kyumin Park, and Daeyoung Kim. 2021. STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech. In Interspeech.

[34]

Tao Li, Xinsheng Wang, Qicong Xie, Zhichao Wang, and Linfu Xie. 2021. Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 30 (2021), 1448--1460.

Digital Library

[35]

Yinghao Aaron Li, Cong Han, and Nima Mesgarani. 2022. Styletts: A style-based generative model for natural and diverse text-to-speech synthesis. arXiv preprint arXiv:2205.15439 (2022).

[36]

Yinghao Aaron Li, Cong Han, Vinay Raghavan, Gavin Mischler, and Nima Mesgarani. 2024. Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. Advances in Neural Information Processing Systems, Vol. 36 (2024).

[37]

Qiguang Lin and Gunnar Fant. 1989. Vocal-tract area-function parameters from formant frequencies. In EUROSPEECH.

[38]

Björn Lindblom and Johan Sundberg. 1970. Acoustical consequences of lip, tongue, jaw, and larynx movement. The Journal of the Acoustical Society of America, Vol. 50 4 (1970), 1166--79.

[39]

Zhenhua Ling, Korin Richmond, Junichi Yamagishi, and Ren-Hua Wang. 2009. Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 17 (2009), 1171--1185.

Digital Library

[40]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.

[41]

Florian Lux and Ngoc Thang Vu. 2022. Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features. In Annual Meeting of the Association for Computational Linguistics.

[42]

Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang. 2021. Meta-stylespeech: Multi-speaker adaptive text-to-speech generation. In ICML. PMLR, 7748--7759.

[43]

Mohsen Motie-Shirazi, Matías Za nartu, Sean D Peterson, and Byron D Erath. 2021. Vocal fold dynamics in a synthetic self-oscillating model: Intraglottal aerodynamic pressure and energy. The Journal of the Acoustical Society of America, Vol. 150, 2 (2021), 1332--1345.

[44]

Shrikanth S. Narayanan, Asterios Toutios, Vikram Ramanarayanan, Adam C. Lammert, Jangwon Kim, Sungbok Lee, Krishna S. Nayak, Yoon-Chul Kim, Yinghua Zhu, Louis M. Goldstein, Dani Byrd, Erik Bresch, Prasanta Kumar Ghosh, Athanasios Katsamanis, and Michael I. Proctor. 2014. Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). The Journal of the Acoustical Society of America, Vol. 136 3 (2014), 1307.

[45]

Maud Parrot, Juliette Millet, and Ewan Dunbar. 2019. Independent and automatic evaluation of acoustic-to-articulatory inversion models. arXiv preprint arXiv:1911.06573 (2019).

[46]

Joseph S Perkell. 1974. A physiologically-oriented model of tongue activity in speech production. (1974).

[47]

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020).

[48]

Yi Ren, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen, Zhijie Yan, and Zhou Zhao. 2022. Prosospeech: Enhancing prosody with quantized vector pre-training in text-to-speech. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7577--7581.

[49]

Yi Ren, Jinglin Liu, and Zhou Zhao. 2021. Portaspeech: Portable and high-quality generative text-to-speech. Advances in Neural Information Processing Systems, Vol. 34 (2021), 13963--13974.

[50]

Korin Richmond, Phil Hoole, and Simon King. 2011. Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In Twelfth Annual Conference of the International Speech Communication Association.

[51]

Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. 2023. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116 (2023).

[52]

RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, and Rif A Saurous. 2018. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In ICML. PMLR, 4693--4702.

[53]

Daisy Stanton, Yuxuan Wang, and RJ Skerry-Ryan. 2018. Predicting expressive speaking style from text in end-to-end speech synthesis. In 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 595--602.

[54]

Mark Kenneth Tiede, Carol Y. Espy-Wilson, Dolly Goldenberg, Vikramjit Mitra, Hosung Nam, and G Sivaraman. 2017. Quantifying kinematic aspects of reduction in a contrasting rate production task. Journal of the Acoustical Society of America, Vol. 141 (2017), 3580--3580.

[55]

Tomoki Toda, Alan W Black, and Keiichi Tokuda. 2004. Mapping from articulatory movements to vocal tract spectrum with Gaussian mixture model for articulatory speech synthesis. In Fifth ISCA Workshop on Speech Synthesis.

[56]

Rafael Valle, Jason Li, Ryan Prenger, and Bryan Catanzaro. 2020. Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6189--6193.

[57]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).

[58]

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. 2023. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111 (2023).

[59]

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. 2017. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017).

[60]

Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In ICML. PMLR, 5180--5189.

[61]

Reiner Wilhelms-Tricarico. 1995. Physiological modeling of speech production: methods for modeling soft-tissue articulators. The Journal of the Acoustical Society of America, Vol. 97 5 Pt 1 (1995), 3085--98.

[62]

Peter Wu, Li-Wei Chen, Cheol Jun Cho, Shinji Watanabe, Louis Goldstein, Alan W Black, and Gopala K. Anumanchipalli. 2023. Speaker-Independent Acoustic-to-Articulatory Speech Inversion. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1--5.

[63]

Peter Wu, Tingle Li, Yijing Lu, Yubin Zhang, Jiachen Lian, Alan W Black, Louis Goldstein, Shinji Watanabe, and Gopala K Anumanchipalli. 2023. Deep Speech Synthesis from MRI-Based Articulatory Representations. arXiv preprint arXiv:2307.02471 (2023).

[64]

Yide Yu, Amin Honarmandi Shandiz, and László Tóth. 2021. Reconstructing speech from real-time articulatory MRI using neural vocoders. In 2021 29th European Signal Processing Conference (EUSIPCO). IEEE, 945--949.

[65]

Heiga Zen, Viet-Trung Dang, Robert A. J. Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Z. Chen, and Yonghui Wu. 2019. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In Interspeech.

[66]

Guangyan Zhang, Ying Qin, W. Zhang, Jialun Wu, Mei Li, Yu Gai, Feijun Jiang, and Tan Lee. 2022. iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis Based on Disentanglement Between Prosody and Timbre. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 31 (2022), 1693--1705.

Digital Library

[67]

Jing-Xuan Zhang, Korin Richmond, Zhen-Hua Ling, and Lirong Dai. 2021. Talnet: Voice reconstruction from tongue and lip articulation with transfer learning from text-to-speech synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14402--14410. Issue: 16.

[68]

Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, and Bhuvana Ramabhadran. 2019. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. arXiv preprint arXiv:1907.04448 (2019).

[69]

Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. 2023. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926 (2023).

[70]

Xinfa Zhu, Yi Lei, Kun Song, Yongmao Zhang, Tao Li, and Lei Xie. 2023. Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5.

Index Terms

ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations
1. Applied computing
  1. Arts and humanities
    1. Sound and music computing
2. Human-centered computing

Recommendations

Articulatory Speech Re-synthesis: Profiting from Natural Acoustic Speech Data
Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions

The quality of static phones (e.g. vowels, fricatives, nasals, laterals) generated by articulatory speech synthesizers has reached a high level in the last years. Our goal is to expand this high quality to dynamic speech, i.e. whole syllables, words, ...
Articulatory speech synthesis
HMM-Based Vietnamese Speech Synthesis

In this paper, improving naturalness HMM-based speech synthesis for Vietnamese language is described. By this synthesis method, trajectories of speech parameters are generated from the trained Hidden Markov models. A final speech waveform is synthesized ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Young Scientists Fund of the National Natural Science Foundation of China

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
179
Total Downloads

Downloads (Last 12 months)179
Downloads (Last 6 weeks)96

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten