Unit Selection Speech Synthesis Using Frame-Sized Speech Segments and Neural Network Based Acoustic Models

Ling, Zhen-Hua; Zhou, Zhi-Ping

doi:10.1007/s11265-018-1336-0

Unit Selection Speech Synthesis Using Frame-Sized Speech Segments and Neural Network Based Acoustic Models

Published: 13 February 2018

Volume 90, pages 1053–1062, (2018)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Zhen-Hua Ling¹ &
Zhi-Ping Zhou¹

267 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

This paper proposes to select frame-sized speech segments for waveform concatenation speech synthesis using neural network based acoustic models. First, a deep neural network (DNN) based frame selection method is presented. In this method, three DNNs are adopted to calculate target costs and concatenation costs respectively for selecting candidate frames of 5ms length. One DNN is built in the same way as the DNN-based statistical parametric speech synthesis, which predicts target acoustic features given linguistic context inputs. The distance between the acoustic features of a candidate unit and the predicted ones for a target unit is calculated as the target cost. The other two DNNs are constructed to predict the acoustic features at current frame using its context features and the acoustic features of preceding frames. At synthesis time, these two DNNs are employed to calculate the concatenation cost for each candidate frame given its preceding ones. Furthermore, recurrent neural networks (RNNs) with long short-term memory (LSTM) cells are adopted to replace DNNs for acoustic modeling in order to make better use of the sequential information. A strategy of using multi-frame instead of single frame as the basic unit for selection is also presented to reduce the concatenation points within synthetic speech. Experimental results show that our proposed method can achieve better naturalness than the hidden Markov model (HMM)-based frame selection method and the HMM-based parametric speech synthesis method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Article 29 September 2022

Notes

Some examples of synthetic speech can be found at http://home.ustc.edu.cn/~zzp1012/Springer/demo.html.
https://www.mturk.com

References

Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.
Article Google Scholar
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceedings of the Eurospeech (pp. 2347–2350).
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. In IEEE international conference on acoustics, speech, and signal processing, 2000. ICASSP’00. Proceedings. 2000, (Vol. 3 pp. 1315–1318). Piscataway: IEEE.
Toda, T., Kawai, H., Tsuzaki, M., & Shikano, K. (2002). Unit selection algorithm for Japanese speech synthesis based on both phoneme unit and diphone unit. In IEEE international conference on acoustics, speech, and signal processing (ICASSP), 2002, (Vol. 1 pp. I–465). Piscataway: IEEE.
Kishore, S. P., & Black, A. W. (2003). Unit size in unit selection speech synthesis. In INTERSPEECH.
Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1), 43–49.
Article MATH Google Scholar
Black, A. W., & Campbell, N. (1995). Optimising selection of units from speech databases for concatenative synthesis. In EUROSPEECH (pp. 581–584).
Ling, Z. -H., & Wang, R.-H. (2006). HMM-Based unit selection using frame sized speech segments. In Ninth international conference on spoken language processing.
Qian, Y., Soong, F. K., & Yan, Z. -J. (2013). A unified trajectory tiling approach to high quality speech rendering. IEEE Transactions on Audio, Speech, and Language Processing, 21 (2), 280– 290.
Article Google Scholar
Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2013 (pp. 7962–7966). Piscataway: IEEE.
Wu, Z., Valentini-Botinhao, C., Watts, O., & King, S. (2015). Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015 (pp. 4460–4464). Piscataway: IEEE.
Ling, Z. -H., Kang, S. -Y., Zen, H., Senior, A., Schuster, M., Qian, X. -J., Meng, H. M., & Deng, L. (2015). Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Processing Magazine, 32(3), 35–52.
Article Google Scholar
Merritt, T., Clark, R. A., Wu, Z., Yamagishi, J., & King, S. (2016). Deep neural network-guided unit selection synthesis. In IEEE International conference on acoustics, speech and signal processing (ICASSP), 2016 (pp. 5145–5149). Piscataway: IEEE.
Fan, Y., Qian, Y., Xie, F. -L., & Soong, F. K. (2014). TTS Synthesis with bidirectional lstm based recurrent neural networks. In Interspeech (pp. 1964–1968).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Fukada, T., Tokuda, K., Kobayashi, T., & Imai, S. (1992). An adaptive algorithm for mel-cepstral analysis of speech. In IEEE international conference on acoustics, speech, and signal processing, 1992. ICASSP-92, 1992, (Vol. 1 pp. 137–140). Piscataway: IEEE.
Hirai, T., & Tenpaku, S. (2004). Using 5 ms segments in concatenative speech synthesis. In Fifth ISCA workshop on speech synthesis.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. -R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Article Google Scholar
Neal, R. M. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56(1), 71–113.
Article MathSciNet MATH Google Scholar
Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560.
Article Google Scholar
Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In Kremer, S. C., & Kolen, J. F. (Eds.) A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press.
Kominek, J., & Black, A. W. (2004). The CMU arctic speech databases. In Fifth ISCA workshop on speech synthesis.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580.
Toda, T. (2011). Modeling of speech parameter sequence considering global variance for HMM-based speech synthesis. In Dymarski, P. (Ed.) Hidden Markov Models, Theory and Applications. InTech, ch. 6 (pp. 131–150).

Download references

Acknowledgements

This work was partly supported by The National Key Research and Development Program of China (Grant No. 2016YFB1001303), the CAS Strategic Priority Research Program (Grant No. XDB02070006), and the Fundamental Research Funds for the Central Universities (Grant No. WK2350000001).

Author information

Authors and Affiliations

National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, People’s Republic of China
Zhen-Hua Ling & Zhi-Ping Zhou

Authors

Zhen-Hua Ling
View author publications
You can also search for this author in PubMed Google Scholar
Zhi-Ping Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhen-Hua Ling.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ling, ZH., Zhou, ZP. Unit Selection Speech Synthesis Using Frame-Sized Speech Segments and Neural Network Based Acoustic Models. J Sign Process Syst 90, 1053–1062 (2018). https://doi.org/10.1007/s11265-018-1336-0

Download citation

Received: 15 January 2017
Revised: 08 August 2017
Accepted: 25 January 2018
Published: 13 February 2018
Issue Date: July 2018
DOI: https://doi.org/10.1007/s11265-018-1336-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unit Selection Speech Synthesis Using Frame-Sized Speech Segments and Neural Network Based Acoustic Models

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unit Selection Speech Synthesis Using Frame-Sized Speech Segments and Neural Network Based Acoustic Models

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation