On the Application and Compression of Deep Time Delay Neural Network for Embedded Statistical Parametric Speech Synthesis

Zheng, Yibin; Tao, Jianhua; Wen, Zhengqi; Fu, Ruibo

doi:10.21437/Interspeech.2018-1970

On the Application and Compression of Deep Time Delay Neural Network for Embedded Statistical Parametric Speech Synthesis

Yibin Zheng, Jianhua Tao, Zhengqi Wen, Ruibo Fu

Acoustic models based on long short-term memory (LSTM) recurrent neural networks (RNNs) were applied to statistical parametric speech synthesis (SPSS) and shown significant improvements. However, the model complexity and inference time cost of RNNs are much higher than feed-forward neural networks (FNN) due to the sequential nature of the learning algorithm, thus limiting its usage in many runtime applications. In this paper, we explore a novel application of deep time delay neural network (TDNN) for embedded SPSS, which requires low disk footprint, memory and latency. The TDNN could model long short-term temporal dependencies with inference cost comparable to standard FNN. Temporal subsampling enabled by TDNN could reduce computational complexity. Then we compress deep TDNN using singular value decomposition (SVD) to further reduce model complexity, which are motivated by the goal of building embedded SPSS systems which can be run efficiently on mobile devices. Both objective and subjective experimental results show that, the proposed deep TDNN with SVD compression could generate synthesized speech with better speech quality than FNN and comparable speech quality to LSTM, while drastically reduce model complexity and speech parameter generation time.

doi: 10.21437/Interspeech.2018-1970

Cite as: Zheng, Y., Tao, J., Wen, Z., Fu, R. (2018) On the Application and Compression of Deep Time Delay Neural Network for Embedded Statistical Parametric Speech Synthesis. Proc. Interspeech 2018, 922-926, doi: 10.21437/Interspeech.2018-1970

@inproceedings{zheng18b_interspeech,
  author={Yibin Zheng and Jianhua Tao and Zhengqi Wen and Ruibo Fu},
  title={{On the Application and Compression of Deep Time Delay Neural Network for Embedded Statistical Parametric Speech Synthesis}},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={922--926},
  doi={10.21437/Interspeech.2018-1970}
}