Algorithms for speech bandwidth extension (BWE) may work in either the time domain or the frequency domain. Time-domain methods often do not sufficiently recover the high-frequency content of speech signals; frequency-domain methods are better at recovering the spectral envelope, but have difficulty reconstructing the details of the waveform. In this paper, we propose a two-stage approach for BWE, which enjoys the advantages of both time- and frequency-domain methods. The first stage is a frequency-domain neural network, which predicts the high-frequency part of the wide-band spectrogram from the narrow-band input spectrogram. The wide-band spectrogram is then converted into a time-domain waveform, and passed through the second stage to refine the temporal details. For the first stage, we compare a convolutional recurrent network (CRN) with a temporal convolutional network (TCN), and find that the latter is able to capture long-span dependencies equally well as the former while using a lot fewer parameters. For the second stage, we enhance the Wave-U-Net architecture with a multi-resolution short-time Fourier transform (MSTFT) loss function. A series of comprehensive experiments show that the proposed system achieves superior performance in speech enhancement (measured by both time- and frequency-domain metrics) as well as speech recognition.
Cite as: Lin, J., Wang, Y., Kalgaonkar, K., Keren, G., Zhang, D., Fuegen, C. (2021) A Two-Stage Approach to Speech Bandwidth Extension. Proc. Interspeech 2021, 1689-1693, doi: 10.21437/Interspeech.2021-1941
@inproceedings{lin21d_interspeech, author={Ju Lin and Yun Wang and Kaustubh Kalgaonkar and Gil Keren and Didi Zhang and Christian Fuegen}, title={{A Two-Stage Approach to Speech Bandwidth Extension}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={1689--1693}, doi={10.21437/Interspeech.2021-1941} }