A compact transformer-based GAN vocoder

Miao, Chenfeng; Chen, Ting; Chen, Minchuan; Ma, Jun; Wang, Shaojun; Xiao, Jing

doi:10.21437/Interspeech.2022-11254

A compact transformer-based GAN vocoder

Chenfeng Miao, Ting Chen, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao

Recent work has shown that self-attention module in Transformer architecture is an effective way of modeling natural languages and images. In this work, we propose a novel way for audio synthesis using Self-Attention Network (SAN). To the best of our knowledge, there is no successful application of Transformer architecture or SAN in high-fidelity waveform generation tasks. The main challenge in adapting SAN to audio generation tasks lies in its quadratic growth of the computational complexity with respect to the input sequence length, making it impractical with high-resolution audio tasks. To tackle this problem, we apply dilated sliding window to vanilla SAN. This technique enables our model to have large receptive field, linear computational complexity and extremely small footprint. We experimentally show that the proposed model archives smaller model size, while producing audio samples with comparable speech quality in comparison with the best publicly available model. In particular, our small footprint model has only 0.57M parameters and can generate 22.05kHz high-fidelity audio 113 times faster than real-time on a NVIDIA V100 GPU without engineered inference kernels.

doi: 10.21437/Interspeech.2022-11254

Cite as: Miao, C., Chen, T., Chen, M., Ma, J., Wang, S., Xiao, J. (2022) A compact transformer-based GAN vocoder. Proc. Interspeech 2022, 1636-1640, doi: 10.21437/Interspeech.2022-11254

@inproceedings{miao22b_interspeech,
  author={Chenfeng Miao and Ting Chen and Minchuan Chen and Jun Ma and Shaojun Wang and Jing Xiao},
  title={{A compact transformer-based GAN vocoder}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={1636--1640},
  doi={10.21437/Interspeech.2022-11254}
}