Abstract
Nowadays speech synthesis or text to speech (TTS), an ability of system to produce human like natural sounding voice from the written text, is gaining popularity in the field of speech processing. For any TTS, intelligibility and naturalness are the two important measures for defining the quality of a synthesized sound which is highly dependent on the prosody modeling using acoustic model of synthesizer. The purpose of this review survey is firstly to study and analyze the various approaches used traditionally (articulatory synthesis, formant synthesis, concatenative speech synthesis and statistical parametric techniques based on hidden Markov model) and recently (statistical parametric based on deep learning approaches) for acoustic modeling with their pros and cons. The approaches based on deep learning to build the acoustic model has significantly contributed to the advancement of TTS as models based on deep learning are capable of modelling the complex context dependencies in the input data. Apart from these, this article also reviews the TTS approaches for generating speech with different voices and emotions to makes the TTS more realistic to use. It also addresses the subjective and objective metrics used to measure the quality of the synthesized voice. Various well known speech synthesis systems based on autoregressive and non-autoregressive models such as Tacotron, Deep Voice, WaveNet, Parallel WaveNet, Parallel Tacotron, FastSpeech by global tech-giant Google, Facebook, Microsoft employed the architecture of deep learning for end-to-end speech waveform generation and attained a remarkable mean opinion score (MOS).




























Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
References
Achanta S, Gangashetty S (2017) Deep Elman recurrent neural networks for statistical parametric speech synthesis. Speech Commun 93:31–42. https://doi.org/10.21437/Interspeech.2015-266
Alias F, Sevillano X, Socoró JC, Gonzalvo X (2008) Towards high-quality next-generation text-to-speech synthesis: A multidomain approach by automatic domain classification. IEEE Trans Audio Speech Lang Process 16(7):1340–1354. https://doi.org/10.1109/TASL.2008.925145
Arik S, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Raiman J, Sengupta S, Shoeybi M (2017a) Deep Voice: real-time neural text-to-speech. https://arxiv.org/abs/1702.07825
Arik S, Diamos G, Gibiansky A, Miller J, Peng K, Ping W, Raiman J, Zhou Y (2017b) Deep Voice 2: multi-speaker neural text-to-speech. https://arxiv.org/abs/1705.08947
Arik SO, Chen J, Peng K, Ping W, Zhou Y (2018). Neural voice cloning with a few samples. In: Neural information processing system. https://doi.org/10.48550/arXiv.1802.06006
Babacan O, Drugman T, D'Alessandro N, Henrich N, Dutoit T (2013) A comparative study of pitch extraction algorithms on a large variety of singing sounds. In: IEEE international conference on acoustics, speech and signal processing, pp 7815–7819. https://doi.org/10.1109/ICASSP.2013.6639185
Biagetti G, Crippa P, Falaschetti L (2018) HMM speech synthesis based on MDCT representation. Int J Speech Technol 21(4):1045–1055. https://doi.org/10.1007/s10772-018-09571-9
Binkowski M, Donahue J, Dieleman S, Clark A, Elsen ´ E, Casagrande N, Cobo LC, Simonyan (2020) High fidelity speech synthesis with adversarial networks. In: International conference on learning representation. https://doi.org/10.48550/arXiv.1909.11646
Botha GR, Barnard E (2012) Factors that affect the accuracy of text-based language identification. Comput Speech Lang 26(5):307–320. https://doi.org/10.1016/j.csl.2012.01.004
Chai L, Du J, Liu QF, Lee CH (2019) Using generalized gaussian distributions to improve regression error modeling for deep learning-based speech enhancement. IEEE/ACM Trans Audio Speech Lang Process 27(12):1919–1931. https://doi.org/10.1109/TASLP.2019.2935803
Chen L, Gales MJF, Braunschweiler N, Akamine M, Knill K (2014a) Integrated expression prediction and speech synthesis from text. IEEE J Select Top Signal Process 8(2):323–335. https://doi.org/10.1109/JSTSP.2013.2294938
Chen LH, Ling ZH, Liu LJ, Dai LR (2014b) Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Trans Audio Speech Lang Process 22(12):1859–1872. https://doi.org/10.1109/TASLP.2014.2353991
Chen LH, Raitio T, Valentini BC, Ling ZH, Yamagishi LH (2015) A deep generative architecture for postfiltering in statistical parametric speech synthesis. IEEE/ACM Trans Audio Speech Lang Process (TASLP) 23(11):2003–2014. https://doi.org/10.1109/TASLP.2015.2461448
Chen Y, Assael Y, Shillingford B, Budden D, Reed S, Zen H, Wang Q, Cobo LC, Trask A, Laurie B et al. (2019). Sample efficient adaptive text-to-speech. https://doi.org/10.48550/arXiv.1809.10460
Chung YA, Wang Y, Hsu WN, Zhang Y, Skerry-Ryan RJ (2018) Semi-supervised training for improving data efficiency in end-to-end speech synthesis. https://doi.org/10.48550/arXiv.1808.10128
Coelho LP, Braga D, Dias MS, Mateo CG (2013) On the development of an automatic voice pleasantness classification and intensity estimation system. Comput Speech Lang 27(1):75–88. https://doi.org/10.1016/j.csl.2012.01.006
Cutajar K, Bonilla EV, Michiardi P, Filippone M (2017) Random feature expansions for deep Gaussian processes. In: Proceeding of international conference on machine learning, pp 884–893. https://arxiv.org/abs/1610.04386
Damianou AC, Lawrence N (2013) Deep gaussian processes. In: Proceeding of international conference on artificial intelligence statistic, pp 207–215. https://doi.org/10.48550/arXiv.1211.0358
Dong S, Wang P, Abbas K (2021) A survey on deep learning and its applications. Comput Sci Rev 40:100379. https://doi.org/10.1016/j.cosrev.2021.100379
Drugman, T, Wilfart G, Dutoit T (2009) A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis. https://arxiv.org/abs/2001.00842
Fan Y, Qian Y, Soong F, He L (2015) Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4475–4479. https://doi.org/10.1109/ICASSP.2015.7178817
Fukada T, Tokuda K, Kobayashi T, Imai S (1992) An adaptive algorithm for Mel-cepstral analysis of speech. In. Proceedings of the IEEE international conference on acoustics, speech and signal processing, pp 137–140. https://doi.org/10.1109/ICASSP.1992.225953
Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2414–2423. https://doi.org/10.1109/CVPR.2016.265
Ghahabi O, Hernando J (2018) Restricted boltzmann machines for vector representation of speech in speaker recognition. Comput Speech Lang 47:16–19. https://doi.org/10.1016/j.csl.2017.06.007
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial networks. https://doi.org/10.48550/arXiv.1406.2661
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Boca Raton
Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243. https://doi.org/10.1109/TASSP.1984.1164317
Gu J, Bradbury J, Xiong C, Li VO, Socher R(2017). Non-autoregressive neural machine translation. https://doi.org/10.48550/arXiv.1711.02281
Gujarathi P, Patil SR (2021) Review on unit selection-based concatenation approach in text to speech synthesis system. In: Cybernetics, cognition and machine learning applications, Springer, Singapore, pp 191–202. https://doi.org/10.1007/978-981-33-6691-6_22
Guo J, Tan X, He D, Qin T, Xu L, Liu TY (2019) Non-autoregressive neural machine translation with enhanced decoder input. Proc AAAI Conf Artif Intell 33:3723–3730. https://doi.org/10.48550/arXiv.1812.09664
Hinton G, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554. https://doi.org/10.1162/neco.2006.18.7.1527
Hinton G, Deng L, Yu D, Dahl G, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. Signal Process Mag 29(6):82–97. https://doi.org/10.1109/MSP.2012.2205597
Hodari Z, Watts O, Ronanki S, King S (2018) Learning interpretable control dimensions for speech synthesis by using external data. In: Proceeding of INTERSPEECH 2018, pp 32–36. https://doi.org/10.21437/Interspeech.2018-2075
Hojo N, Ijima Y, Mizuno H (2018) DNN-based speech synthesis using speaker codes. IEICE Trans Inf Syst 101:462–472. https://doi.org/10.1587/transinf.2017EDP7165
Hu YJ, Ling ZH (2016) DBN-based spectral feature representation for statistical parametric speech synthesis. IEEE Signal Process Lett 23(3):321–332. https://doi.org/10.1109/LSP.2016.2516032
Huang WC, Hayashi T, Wu YC, Kameoka H, Toda T (2021) Pretraining techniques for sequence-to-sequence voice conversion. IEEE/ACM Trans Audio Speech Lang Process 29:745–755. https://doi.org/10.1109/TASLP.2021.3049336
Ijima Y, Miyazaki N, Mizuno H, Sakauchi S (2015) Statistical model training technique based on speaker clustering approach for HMM-based speech synthesis. Speech Commun 71:50–61. https://doi.org/10.1016/j.specom.2015.04.003
Isaac E, Heiga Z, Jonathan S, Zhang Y, Jia Y, Ron W, Yonghui W (2020). Parallel Tacotron: non-autoregressive and controllable TTS. In: IEEE international conference on acoustics, speech and signal processing, pp 5709–5713. https://doi.org/10.48550/arXiv.2010.11439
Isaac E, Heiga Z, Jonathan S & Zhang Y, Jia Y & Ryan RJ & Yonghui W (2021) Parallel Tacotron 2: A non-autoregressive neural TTS model with differentiable duration modeling. https://doi.org/10.48550/arXiv.2103.14574
Jia Y, Zhang Y, Weiss RJ, Wang Q, Shen J, Ren F, Chen Z, Nguyen P, Pang R, Moreno IL, et al. (2018) Transfer learning from speaker verification to multispeaker text-to-speech synthesis. https://doi.org/10.48550/arXiv.1806.04558
Jinyin C, Ye L, Ming Z (2021) MASS: multi-task anthropomorphic speech synthesis framework. Comput Speech Lang 70:1012–1043. https://doi.org/10.48550/arXiv.2105.04124
Juang BH (1984) On using the Itakura-Saito measures for speech coder performance evaluation. At&t Bell Lab Tech J 63(8):1477–1498. https://doi.org/10.1002/j.1538-7305.1984.tb00047.x
Kang S, Qian X, Meng H (2013) Multi-distribution deep belief network for speech synthesis. In: IEEE international conference on acoustics, speech and signal processing, pp 8012–8016. https://doi.org/10.1109/ICASSP.2013.6639225
Karabetsos S, Tsiakoulis P, Chalamandaris A, Raptis S (2010) One-class classification for spectral join cost calculation in unit selection speech synthesis. Signal Process Lett 17(8):746–749. https://doi.org/10.1109/LSP.2010.2053357
Kawahara H, Katayose H, Cheveigné A, Patterson R (1999) Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of f0 and periodicity. In: Proceedings of EUROSPEECH, pp 2781–2784. https://doi.org/10.21437/Eurospeech.1999-613
Kayte SN, Mundada M, Kayte C (2015) A review of unit selection speech synthesis. Int J Adv Res Comput Sci Softw Eng 5(10):475–479
Khorinphan C, Phansamdaeng S, Saiyod S (2014) Thai speech synthesis with emotional tone: Based on formant synthesis for home robot. In: 2014 third ICT international student project conference (ICT-ISPC), IEEE, pp 111–114. https://doi.org/10.1109/ICT-ISPC.2014.6923230
Kolen JF, Kermer SC (2001) Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In: A field guide to dynamical recurrent neural, pp 237–244. https://doi.org/10.1109/9780470544037.ch14
Kominek J, Schultz T, Black AW (2008) Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion. In: A field guide to dynamical recurrent networks, pp 63–68. https://doi.org/10.1109/9780470544037.ch14
Koriyama T, Kobayashi T (2019) Statistical parametric speech synthesis using deep gaussian processes. IEEE/ACM Trans Audio Speech Lang Process 27(5):948–959. https://doi.org/10.1109/TASLP.2019.2905167
Kumar K, Kumar R, de Boissiere T, Gestin L, Teoh WZ, Sotelo J, de Brébisson A, Bengio Y, Courville AC (2019). Melgan: Generative adversarial networks for conditional waveform synthesis. In: Advances in neural information processing systems, pp 14910–14921. https://doi.org/10.48550/arXiv.1910.06711
Kwon O, Song E, Kim JM, Kang HG (2019a) Effective parameter estimation methods for an ExcitNet model in generative text-to-speech systems. https://doi.org/10.48550/arXiv.1905.08486
Kwon O, Jang I, Ahn C, Kang HG (2019b) An effective style token weight control technique for end-to-end emotional speech synthesis. IEEE Signal Process Lett 26(9):1383–1387. https://doi.org/10.1109/LSP.2019.2931673
Latif S, Rana R, Khalifa S, Jurdak R, Qadir J, Schuller BW (2020) Deep representation learning in speech processing: Challenges, recent advances, and future trends. https://doi.org/10.48550/arXiv.2001.00378
Le N, Rathour VS, Yamazaki K, Luu K, Savvides M (2021) Deep reinforcement learning in computer vision: a comprehensive survey. Artif Intell Rev. https://doi.org/10.48550/arXiv.2108.11510
Lee CH, Jung SK, Kang HG (2007) Applying a speaker-dependent speech compression technique to concatenative TTS synthesizers. IEEE Trans Audio Speech Lang Process 15(2):632–640. https://doi.org/10.1109/TASL.2006.876762
Lee Y, Rabiee A, Lee SY (2017) Emotional end-to-end neural speech synthesizer. https://arxiv.org/abs/1711.05447
Li Y, Lee T, Qian Y (2004) Analysis and modeling of F0 contours for Cantonese text-to-speech. ACM Trans Asian Lang Inf Process 3(3):169–180. https://doi.org/10.1145/1037811.1037813
Liang MS, Yang RC, Chiang YC, Lyu DC, Lyu RY (2004) A Taiwanese text-to-speech system with applications to language learning. In: proceedings of the IEEE International Conference on Advanced Learning Technologies, pp, 91–95. https://doi.org/10.1109/ICALT.2004.1357381
Ling ZH, Wu YJ, Wang YPu Ping; Qin, Long; Wang, Ren Hua (2006) USTC System for blizzard challenge 2006: An improved HMM-based speech synthesis method. In: Proceeding of blizzard challenge workshop
Ling ZH, Deng L, Yu D (2013) Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis. IEEE Trans Audio Speech Lang Process 21(10):7825–7829. https://doi.org/10.1109/TASL.2013.2269291
Ling ZH, Kang SY, Zen H, Senior A, Schuster M, Qian XJ, Meng HM, Li D (2015) Deep learning for acoustic modelling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Process Mag 32(3):35–52. https://doi.org/10.1109/MSP.2014.2359987
Liu ZC, Ling ZH, Dai LR (2018) Statistical parametric speech synthesis using generalized distillation framework. IEEE Signal Process Lett 25(5):695–699. https://doi.org/10.1109/LSP.2018.2819886
Liu Z, Mak B (2019) Cross-lingual multi-speaker text-to-speech synthesis for voice cloning without using parallel corpus for unseen speakers. https://doi.org/10.48550/arXiv.1911.11601
Liu Z, Mak B (2020) Multi-lingual multi-speaker text-to-speech synthesis for voice cloning with online speaker enrollment. In: proceeding of INTERSPEECH, pp 2932–2936
Loizou PC (2011) Speech quality assessment. Multimedia analysis, processing and communications. Stud Comput Intell 346:623–654
Ludovic M, Berger J, Kastner M (2006) The ITU-T standard for single-ended speech quality assessment. IEEE Trans Audio Speech Lang Process 14(6):1924–1934. https://doi.org/10.1109/TASL.2006.883177
Lukose S, Upadhya S (2017) Text to speech synthesizer-formant synthesis. In: Proceeding of 2017 international conference on nascent technologies in engineering (ICNTE), pp 1–4. https://doi.org/10.1109/ICNTE.2017.7947945
Mametani K, Kato T, Yamamoto S (2019) Investigating context features hidden in end-to-end TTS. In: Proceedings of IEEE the 44th international conference on acoustics, speech and signal processing, Brighton, pp 6920–6924. https://doi.org/10.1109/ICASSP.2019.8683857
Mitsui K, Koriyama T, Saruwatari H (2020) Multi-speaker text-to-speech synthesis using deep Gaussian processes. In: Proceeding of INTERSPEECH 2020. https://doi.org/10.48550/arXiv.2008.02950
Mitsui K, Koriyama T, Saruwatari H (2021) Deep Gaussian process based multi-speaker speech synthesis with latent speaker representation. Speech Commun 132:132–145. https://doi.org/10.1016/j.specom.2021.07.001
Moungsri D, Koriyama T, Kobayashi T (2018) GPR-based Thai speech synthesis using multi-level duration prediction. Speech Commun 99:114–123. https://doi.org/10.1016/j.specom.2018.03.005
Mu Z, Yang X and Dong Y (2021) Review of end-to-end speech synthesis technology based on deep learning. https://doi.org/10.48550/arXiv.2104.09995
Nachmani E, Polyak A, Taigman Y, Wolf L (2018) Fitting new speakers based on a short untranscribed sample. In: International conference on machine learning, PMLR. pp 3683–3691. https://doi.org/10.48550/arXiv.1802.06984
Nakamura K, Hashimoto K, Oura K, Nankaku Y, Tokuda K (2019) Singing voice synthesis based on convolutional neural networks. https://doi.org/10.48550/arXiv.1904.06868
Nakashika T, Yatabe K (2021) Gamma Boltzmann machine for audio modeling. IEEE/ACM Trans Audio Speech Lang Process 29:2591–2605. https://doi.org/10.1109/TASLP.2021.3095656
Nakashika T, Takaki S, Yamagishi J (2019) Complex-valued restricted Boltzmann machine for speaker-dependent speech parameterization from complex spectra. IEEE/ACM Trans Audio Speech Lang Process 27(2):244–254. https://doi.org/10.1109/TASLP.2018.2877465
Narendra NP, Rao KS (2017) Generation of creaky voice for improving the quality of HMM-based speech synthesis. Comput Speech Lang 42:38–58. https://doi.org/10.1016/j.csl.2016.08.002
Nazir O and Malik A (2021) Deep learning end to end speech synthesis: a review. In: international conference on secure cyber computing and communications (ICSCCC), pp 66–71. https://doi.org/10.1109/ICSCCC51823.2021.9478125
Norvig P, Russel JS (2020) Artificial intelligence: a modern approach.
Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016). WaveNet: a generative model for raw audio. https://arxiv.org/abs/1609.03499
Oord A, Li Y, Babuschkin I, Simonyan Karen, Vinyals O, Kavukcuoglu K, Driessche G, Lockhart E, Cobo L, Stimberg F, Casagrande N, Grewe D, Noury S, Dieleman S, Elsen E, Kalchbrenner N, Zen H, Graves A, King H, Hassabis D (2017) Parallel WaveNet: fast high-fidelity speech synthesis. https://doi.org/10.48550/arXiv.1711.10433
Panda S, Nayak A (2015) An efficient model for text-to-speech synthesis in Indian languages. Int J Speech Technol 18(3):305–315. https://doi.org/10.1007/s10772-015-9271-y
Panda SP, Nayak AK, Patnaik S (2015) Text to speech synthesis with an Indian language perspective. Int J Grid Utility Comput 6:170–178. https://doi.org/10.1504/IJGUC.2015.070676
Peng K, Ping W, Song Z, Zhao K (2020) Non-autoregressive neural text-to-speech. In: International conference on machine learning, pp 7586–7598. https://doi.org/10.48550/arXiv.1905.08459
Peter B, Susanne D, Simon S (2019) Perceptual optimization of an enhanced geometric vocal fold model for articulatory speech synthesis. INTERSPEECH 2019: 3765–3769. https://doi.org/10.21437/Interspeech.2019-2410
Ping W, Peng K, Chen J (2018). ClariNet: Parallel wave generation in end-to-end text-to-speech. https://doi.org/10.48550/arXiv.1807.07281
Ping W, Peng K, Zhao K, Song Z (2020). WaveFlow: a compact flow-based model for raw audio. In: International conference on machine learning. https://doi.org/10.48550/arXiv.1912.01219
Qian Y, Fan y, Hu W, Soong FK (2014) On the training aspects of deep neural network (DNN) for parametric TTS synthesis. In: IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 3829–3833. https://doi.org/10.1109/ICASSP.2014.6854318
Qinsheng D, Jian Z, Lirong W, Lijuan S (2011) Articulatory speech synthesis: a survey. In: Proceeding of 14th IEEE international conference on computational science and engineering, pp 539–542. https://doi.org/10.1109/CSE.2011.95
Rao KS, Narendra P (2019) Source modeling techniques for quality enhancement in statistical parametric speech synthesis. In: Springer briefs in speech technology: studies in speech signal processing, natural language understanding, and machine learning, pp 13–15
Reddy K, Rao S (2017) Robust pitch extraction method for the HMM-based speech synthesis system. IEEE Signal Process Lett 24(8):1133–1137. https://doi.org/10.1109/LSP.2017.2712646
Ren Y, Ruan Y, Tan X, Qin T, Zhao S, Zhao Z, Liu TY (2019) FastSpeech: fast, robust and controllable text to speech. In: Proceeding of advances in neural information processing systems. https://doi.org/10.48550/arXiv.1905.09263
Ren Y, Hu C, Tan X, Qin, T, Zhao S, Zhao Z, Liu T Y (2020). Fastspeech 2: fast and high-quality end-to-end text to speech. https://arxiv.org/abs/2006.04558
Saito Y, Takamichi S, Saruwatari H (2019) Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra. Comput Speech Lang 58:347–363. https://doi.org/10.1016/j.csl.2019.05.008
Salami R, Laflamme C, Bessette B, Adoul JP (1997) ITU-T G.729 annex a: reduced complexity 8 kbit/s CS-ACELP codec for digital simultaneous voice and data. IEEE Commun Mag 35(9):53–63. https://doi.org/10.1109/35.620526
Sasirekha D, Chandra E (2012) Text to speech: a simple tutorial. Int J Soft Comput Eng (IJSCE) 2(1):275–278
Schmidhuber J (2014) Deep learning in neural networks: an overview. Neural Netw 61:85–117. https://doi.org/10.1016/j.neunet.2014.09.003
Schroder M (2001) Emotional speech synthesis: a review. In: 7th European conference on speech communication and technology, pp 561–564. https://doi.org/10.21437/Eurospeech.2001-150
Sharma B, Prasanna SRM (2017) Enhancement of spectral tilt in synthesized speech. IEEE Signal Process Lett 24(4):382–386. https://doi.org/10.1109/LSP.2017.2662805
Sharma P, Abrol V, Sao AK (2018) Reducing footprint of unit selection-based text-to-speech system using compressed sensing and sparse representation. Comput Speech Lang 55:91–208. https://doi.org/10.1016/j.csl.2018.05.003
Shen J, Pang R, Weiss R J, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerry-Ryan R et al. (2018a) Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In: IEEE International conference on acoustics, speech and signal processing. https://doi.org/10.1109/ICASSP.2018a.8461368
Shen J. Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan, RJ, Saurous RA, Agiomvrgiannakis Y, Wu Y (2018b) Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In: Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4779–4783. https://doi.org/10.48550/arXiv.1712.05884
Shinoda K, Watanabe T (2000) MDL-based context-dependent subword modeling for speech recognition. J Acoust Sci Technol 21(2):79–86. https://doi.org/10.1250/ast.21.79
Siddhi D, Verghese JM, Bhavik D (2017) Survey on various methods of text to speech synthesis. Int J Comput Appl 165(6):26–30. https://doi.org/10.5120/ijca2017913891
Sisman B, Yamagishi J, King S, Li H (2021) An overview of voice conversion and its challenges: from statistical modeling to deep learning. IEEE/ACM Trans Audio Speech Lang Process 29:132–157. https://doi.org/10.1109/TASLP.2020.3038524
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of the annual conference on neural information processing systems, pp 3104–3112. https://doi.org/10.48550/arXiv.1409.3215
Tabet Y, Boughazi M (2011) Speech synthesis techniques. A survey. In: Proceeding of international workshop on systems, signal processing and their applications, pp 67–70. https://doi.org/10.1109/WOSSPA.2011.5931414
Takamichi S, Toda T, Neubig G, Sakti S, Nakamura S (2014) A Postfilter to modify modulation spectrum in HMM-based speech synthesis. In: Proceeding of international conference of acoustics, speech, and signal processing, pp 290–294. https://doi.org/10.1109/ICASSP.2014.6853604
Taigman Y, Wolf L, Polyak A, Nachmani E (2017) Voiceloop: voice fitting and synthesis via a phonological loop. https://doi.org/10.48550/arXiv.1707.06588
Talkin D (1995) A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, pp 497–518
Tiomkin S, Malah D, Shechtman S, Kons Z (2011) A hybrid text-to-speech system that combines concatenative and statistical synthesis units. IEEE Trans Audio Speech Lang Process 19(5):1278–1288. https://doi.org/10.1109/TASL.2010.2089679
Tits N, Haddad KE, Dutoit T (2019) Exploring transfer learning for low resource emotional TTS. Intelligent systems and applications. In: proceeding of advances in intelligent systems and computing, pp 53–60. https://doi.org/10.1007/978-3-030-29516-5_5
Toda T (2011) Modeling of speech parameter sequence considering global variance for HMM-based speech synthesis. In: Hidden Markov models, theory and applications, pp 131–150
Toda T, Tokuda K (2007) A Speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE transactions on information and systems E90-D (5):816–824. http://dx.doi.org/https://doi.org/10.1093/ietisy/e90-d.5.816
Tokuda K, Kobayashi T, and Imai S (1995) Speech parameter generation from HMM using dynamic features. In: International conference on acoustics, speech, and signal processing, pp 660–663. https://doi.org/10.1109/ICASSP.1995.479684
Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T (2000) Speech parameter generation algorithms for HMM-based speech synthesis. In: Proceeding of international conference of acoustics, speech, and signal processing, pp 1315–1318. https://doi.org/10.1109/ICASSP.2000.861820
Tokuda K, Zen H, Black AW (2002) An HMM-based speech synthesis system applied to English. In: Proceeding of IEEE workshop on speech synthesis, pp 227–230. https://doi.org/10.1109/WSS.2002.1224415
Tokuda K, Nankaku Y, Toda T, Zen H, Yamagishi J, Oura K (2013) Speech synthesis based on hidden Markov models. Proc IEEE 101(5):1234–1252
Um SY, Oh S, Byun K, Jang I, Ahn C, Kang HG (2020) Emotional speech synthesis with rich and granularized control. In: ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7254–7258. https://doi.org/10.1109/ICASSP40776.2020.9053732
Valle R, Li J, Prenger R, Catanzaro B (2020) Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. In: International conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 6189–6193. https://doi.org/10.48550/arXiv.1910.11997
Viswanathan M, Viswanathan M (2005) Measuring speech quality for text-to-speech systems: development and assessment of a modified mean opinion score (MOS) scale. Comput Speech Lang 19:55–83. https://doi.org/10.1016/j.csl.2003.12.001
Wang C, Ling Z, Zhang B, Dai L (2008) Multi-layer f0 Modeling for HMM-based speech synthesis. 2008 6th International symposium on Chinese spoken language processing, pp 1–4. https://doi.org/10.1109/CHINSL.2008.ECP.44
Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le Q (2017a) Tacotron: towards end-to-end speech synthesis. In: Proceeding of INTERSPEECH-2017a, pp 4006–4010. https://doi.org/10.48550/arXiv.1703.10135
Wang Y, Skerry-Ryan, RJ, Stanton D, Wu Y, Weiss R, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le Q, Agiomyrgiannakis Y, Clark R, Saurous R (2017b) Tacotron: A fully end-to-end text-to-speech synthesis model. In: Proceeding of INTERPSEECH-2017b. https://arxiv.org/abs/1703.10135
Wang X, Takaki S, Yamagishi J (2018a) Autoregressive neural f0 model for statistical parametric speech synthesis. IEEE/ACM Trans Audio Speech Lang Process 26(8):1406–1419. https://doi.org/10.1109/TASLP.2018.2828650
Wang Y, Stanton D, Zhang Y, Ryan RS, Battenberg E, Shor J, Xiao Y, Jia Y, Ren F, Saurous RA (2018b) Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In: International conference on machine learning. pp 5180–5189
Wang Q, Wang X, Liu W, Chen G (2021) Predicting the Chinese poetry prosodic based on a developed BERT model. In: IEEE 2nd international conference on big data, artificial intelligence and internet of things engineering (ICBAIE), pp 583–586
Wei P, Kainan P, Andrew G, Arik SO, Kannan A, Narang S, Raiman J, Miller J (2017) Deep Voice 3: scaling text-to-speech with convolutional sequence learning. In: International conference on learning representations. https://doi.org/10.48550/arXiv.1710.07654
Wen Z, Li K, Huang Z, Lee CH, Tao J (2018) Improving deep neural network-based speech synthesis through contextual feature parametrization and multi-task learning. J Signal Process Syst 90(7):1025–1037. https://doi.org/10.1007/s11265-017-1293-z
Wu Z, Virtanen T, Kinnunen T, Chng ES, Li H (2013) Exemplar-based unit selection for voice conversion utilizing temporal information. In: Proceeding of INTERSPEECH-2013, pp 3057–3061. https://doi.org/10.21437/Interspeech.2013-667
Wu X, Cao Y, Lu H, Liu S, Kang S, Wu Z, Liu X, Meng H (2021) Exemplar-based emotive speech synthesis. IEEE/ACM Trans Audio, Speech Lang Process 29:874–886. https://doi.org/10.1109/TASLP.2021.3052688
Xie C, Lv J, Li Y, Sang Y (2018) Cross-correlation conditional restricted Boltzmann machines for modeling motion system. Knowl Based Syst 159:259–269. https://doi.org/10.1016/j.knosys.2018.06.026
Yamagishi TM, Kobayashi T (2004) HMM-based expressive speech synthesis - Towards TTS with arbitrary speaking styles and emotions. In: Proceeding of special workshop in Maui (SWIM)
Yamagishi J, Nose T, Zen H, Ling ZH, Toda T, Tokuda K, King S, Renals S (2009) Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Trans Audio Speech Lang Process 17(6):1208–1230
Yang J, Klabjan D (2021) Bayesian active learning for choice models with deep gaussian processes. IEEE Trans Intell Transp Syst 22(2):1080–1092. https://doi.org/10.1109/TITS.2019.2962535
Yin X, Lei M, Qian Y, Soong F, He L, Ling ZH, Dai LR (2015) Modeling f0 trajectories in hierarchically structured deep neural networks. Speech Commun 76:82–92. https://doi.org/10.1016/j.specom.2015.10.007
Yin X, Ling ZH, Hu YJ, Dai LR (2016) Modeling spectral envelopes using deep conditional restricted Boltzmann machines for statistical parametric speech synthesis. In: Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5125–5129. https://doi.org/10.1109/ICASSP.2016.7472654
Yishuang N, Sheng H, Zhiyong W, Chunxiao X, Zhang LJ (2019) A review of deep learning-based speech synthesis. Appl Sci 9(19):40–50. https://doi.org/10.3390/app9194050
Zen H, Sak H (2015) Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis.In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4470–4474
Zen H, Tokuda K, Masuko T, Kobayashi T, Kitamura T (2007) A hidden semi-markov model-based speech synthesis system. IEICE-Trans Inf Syst 90:825–834. https://doi.org/10.1093/ietisy/e90-d.5.825
Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064. https://doi.org/10.1016/j.specom.2009.04.004
Zen H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: Proceeding of IEEE international conference on acoustics, speech and signal processing, pp 7962–7966. https://doi.org/10.1109/ICASSP.2013.6639215
Zen H, Dang V, Clark R, Zhang Y, Weiss R, Jia Y, Chen Z, Wu Y (2019) LibriTTS: A corpus derived from LibriSpeech for text-to-speech. In: Proceeding of INTERSPEECH, pp 1526–1530. https://doi.org/10.48550/arXiv.1904.02882
Zhang Y, Weiss RJ, Zen H, Wu Y, Chen Z, Skerry-Ryan RJ, Jia Y, Rosenberg A, Ramabhadra B (2019a) Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. In: the proceeding of INTERSPEECH 2019a. https://doi.org/10.48550/arXiv.1907.04448
Zhang YJ, Pan S, He L, Ling ZH (2019b) Learning latent representations for style control and transfer in end-to-end speech synthesis. In: international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 6945–6949. https://doi.org/10.1109/ICASSP.2019b.8683623
Zhao Y, Takaki S, Luong HT, Yamagishi J, Saito D, Minematsu N (2018) Wasserstein GAN and waveform loss-based acoustic model training for multi-speaker text-to-speech synthesis systems using a WaveNet vocoder. IEEE Access 6:60478–60488. https://doi.org/10.1109/ACCESS.2018.2872060
Zhou X, Ling ZH, Dai LR (2021) UnitNet: a sequence-to-sequence acoustic model for concatenative speech synthesis. IEEE/ACM Trans Audio Speech Lang Process 29:2643–2655. https://doi.org/10.1109/TASLP.2021.3093823
Zoughi T, Homayoonpoor M (2018) DBMiP: a pre-training method for information propagation over deep networks. Comput Speech Lang 55:82–100. https://doi.org/10.1016/j.csl.2018.10.001
Acknowledgements
This piece of research work in supported by IK Gujral Punjab Technical University, Kapurthala, Punjab, India.
Funding
No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kaur, N., Singh, P. Conventional and contemporary approaches used in text to speech synthesis: a review. Artif Intell Rev 56, 5837–5880 (2023). https://doi.org/10.1007/s10462-022-10315-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-022-10315-0
Keywords
- Concatenative speech synthesis
- Formant speech synthesis
- Articulatory speech synthesis
- Statistical parametric speech synthesis using hidden Markov model and deep learning methods
- Expressive TTS
- Multi-lingual and multi-speaker TTS
- Autoregressive and non-autoregressive models
- Speech quality metric
- Speech corpus