Skip to main content
Log in

Unit Selection Speech Synthesis Using Frame-Sized Speech Segments and Neural Network Based Acoustic Models

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

This paper proposes to select frame-sized speech segments for waveform concatenation speech synthesis using neural network based acoustic models. First, a deep neural network (DNN) based frame selection method is presented. In this method, three DNNs are adopted to calculate target costs and concatenation costs respectively for selecting candidate frames of 5ms length. One DNN is built in the same way as the DNN-based statistical parametric speech synthesis, which predicts target acoustic features given linguistic context inputs. The distance between the acoustic features of a candidate unit and the predicted ones for a target unit is calculated as the target cost. The other two DNNs are constructed to predict the acoustic features at current frame using its context features and the acoustic features of preceding frames. At synthesis time, these two DNNs are employed to calculate the concatenation cost for each candidate frame given its preceding ones. Furthermore, recurrent neural networks (RNNs) with long short-term memory (LSTM) cells are adopted to replace DNNs for acoustic modeling in order to make better use of the sequential information. A strategy of using multi-frame instead of single frame as the basic unit for selection is also presented to reduce the concatenation points within synthetic speech. Experimental results show that our proposed method can achieve better naturalness than the hidden Markov model (HMM)-based frame selection method and the HMM-based parametric speech synthesis method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5

Similar content being viewed by others

Notes

  1. Some examples of synthetic speech can be found at http://home.ustc.edu.cn/~zzp1012/Springer/demo.html.

  2. https://www.mturk.com

References

  1. Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.

    Article  Google Scholar 

  2. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceedings of the Eurospeech (pp. 2347–2350).

  3. Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. In IEEE international conference on acoustics, speech, and signal processing, 2000. ICASSP’00. Proceedings. 2000, (Vol. 3 pp. 1315–1318). Piscataway: IEEE.

  4. Toda, T., Kawai, H., Tsuzaki, M., & Shikano, K. (2002). Unit selection algorithm for Japanese speech synthesis based on both phoneme unit and diphone unit. In IEEE international conference on acoustics, speech, and signal processing (ICASSP), 2002, (Vol. 1 pp. I–465). Piscataway: IEEE.

  5. Kishore, S. P., & Black, A. W. (2003). Unit size in unit selection speech synthesis. In INTERSPEECH.

  6. Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1), 43–49.

    Article  MATH  Google Scholar 

  7. Black, A. W., & Campbell, N. (1995). Optimising selection of units from speech databases for concatenative synthesis. In EUROSPEECH (pp. 581–584).

  8. Ling, Z. -H., & Wang, R.-H. (2006). HMM-Based unit selection using frame sized speech segments. In Ninth international conference on spoken language processing.

  9. Qian, Y., Soong, F. K., & Yan, Z. -J. (2013). A unified trajectory tiling approach to high quality speech rendering. IEEE Transactions on Audio, Speech, and Language Processing, 21 (2), 280– 290.

    Article  Google Scholar 

  10. Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2013 (pp. 7962–7966). Piscataway: IEEE.

  11. Wu, Z., Valentini-Botinhao, C., Watts, O., & King, S. (2015). Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015 (pp. 4460–4464). Piscataway: IEEE.

  12. Ling, Z. -H., Kang, S. -Y., Zen, H., Senior, A., Schuster, M., Qian, X. -J., Meng, H. M., & Deng, L. (2015). Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Processing Magazine, 32(3), 35–52.

    Article  Google Scholar 

  13. Merritt, T., Clark, R. A., Wu, Z., Yamagishi, J., & King, S. (2016). Deep neural network-guided unit selection synthesis. In IEEE International conference on acoustics, speech and signal processing (ICASSP), 2016 (pp. 5145–5149). Piscataway: IEEE.

  14. Fan, Y., Qian, Y., Xie, F. -L., & Soong, F. K. (2014). TTS Synthesis with bidirectional lstm based recurrent neural networks. In Interspeech (pp. 1964–1968).

  15. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  16. Fukada, T., Tokuda, K., Kobayashi, T., & Imai, S. (1992). An adaptive algorithm for mel-cepstral analysis of speech. In IEEE international conference on acoustics, speech, and signal processing, 1992. ICASSP-92, 1992, (Vol. 1 pp. 137–140). Piscataway: IEEE.

  17. Hirai, T., & Tenpaku, S. (2004). Using 5 ms segments in concatenative speech synthesis. In Fifth ISCA workshop on speech synthesis.

  18. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. -R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

    Article  Google Scholar 

  19. Neal, R. M. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56(1), 71–113.

    Article  MathSciNet  MATH  Google Scholar 

  20. Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560.

    Article  Google Scholar 

  21. Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In Kremer, S. C., & Kolen, J. F. (Eds.) A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press.

  22. Kominek, J., & Black, A. W. (2004). The CMU arctic speech databases. In Fifth ISCA workshop on speech synthesis.

  23. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580.

  24. Toda, T. (2011). Modeling of speech parameter sequence considering global variance for HMM-based speech synthesis. In Dymarski, P. (Ed.) Hidden Markov Models, Theory and Applications. InTech, ch. 6 (pp. 131–150).

Download references

Acknowledgements

This work was partly supported by The National Key Research and Development Program of China (Grant No. 2016YFB1001303), the CAS Strategic Priority Research Program (Grant No. XDB02070006), and the Fundamental Research Funds for the Central Universities (Grant No. WK2350000001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhen-Hua Ling.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ling, ZH., Zhou, ZP. Unit Selection Speech Synthesis Using Frame-Sized Speech Segments and Neural Network Based Acoustic Models. J Sign Process Syst 90, 1053–1062 (2018). https://doi.org/10.1007/s11265-018-1336-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-018-1336-0

Keywords

Navigation