Abstract
Speech coding is to effectively represent speech signals in the form of digital signals. Existing solutions usually have large quantization error when coding at very low rate. This would result in serious spectrum energy distortion of the reconstructed speech, and most of the current approaches do not consider that the distortion is often unevenly distributed across the spectrum. To address these issues, we propose a novel multi-band multi-scale generative adversarial network (MBMS-GAN) for speech coding. In particular, the speech coding is trained in an adversarial manner at different sub-bands and scales to consider both the global and local spectral energy distortion. Besides, a unified codebook matching strategy is designed by integrating the Euclidean distance and cosine similarity to consider both the absolute distance and directions of two vectors in the matching. We very effectiveness of our method on the popular CSTR-VCTK dataset, and the results demonstrate that our method can significantly improve the quality of reconstructed speech at 600 bps by 0.19 in terms of MOS score. Our study has high application value in the scenario of narrow communication channels such as satellite communication.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Rowe, D.: Codec 2-Open source speech coding at 2400 bits/s and below. In: TAPR and ARRL 30th Digital Communications Conference, pp. 80–84. Springer, Heidelberg (2011). https://doi.org/10.10007/1234567890
Moore, R.K., Skidmore, L.: On the use/misuse of the term “phoneme”. In: Kubin , G., Kacic, Z., (eds.) Proc. INTERSPEECH 2019–20th Annual Conference of the International Speech Communication Association, Graz, Austria, Sep. 2019, LNCS, vol. 1234, pp. 2340–2344. Springer, Heidelberg (2019). https://doi.org/10.10007/1234567890
Dietz, M., et al.: Overview of the EVS codec architecture. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5698–5702. IEEE, Heidelberg (2015). https://doi.org/10.10007/1234567890
Kleijn, W.B., et al.: Wavenet based low rate speech coding. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 676–680. IEEE, Heidelberg (2018). https://doi.org/10.10007/1234567890
van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Bessette, B., et al.: The Adaptive Multirate Wideband Speech Codec (AMR-WB). IEEE Trans. Speech Audio Process. 10(8), 620–636 (2002)
Lin, J., Kalgaonkar, K., He, Q., Lei, X.: Speech enhancement for low bit rate speech codec. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7777–7781. IEEE (2022). https://doi.org/10.1109/ICASSP49725.2022.9414849
Skoglund, J., Valin, J.-M.: Improving Opus low bit rate quality with neural speech synthesis. arXiv preprint arXiv:1905.04628 (2019)
Valin, J.-M., Skoglund, J.: LPCNet: improving neural speech synthesis through linear prediction. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5891–5895. IEEE (2019). https://doi.org/10.1109/ICASSP.2019.8682434
Mustafa, A., Büthe, J., Korse, S., Gupta, K., Fuchs, G., Pia, N.: A streamwise GAN vocoder for wideband speech coding at very low bit rate. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 66–70. IEEE (2021). https://doi.org/10.1109/WASPAA51851.2021.9583419
Mustafa, A., Pia, N., Fuchs, G.: StylemelGAN: an efficient high-fidelity adversarial vocoder with temporal adaptive normalization. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6034–6038. IEEE (2021). https://doi.org/10.1109/ICASSP39728.2021.9413884
Kleijn, W.B., et al.: Generative speech coding with predictive variance regularization. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6478–6482. IEEE (2021). https://doi.org/10.1109/ICASSP39728.2021.9413897
Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., Tagliasacchi, M.: SoundStream: an end-to-end neural audio codec. IEEE/ACM Trans. Audio Speech Lang. Process. 30 495–507 (2021)
LeBlanc, W.P., Bhattacharya, B., Mahmoud, S.A., Cuperman, V.: Efficient search and design procedures for robust multi-stage VQ of LPC parameters for 4 kb/s speech coding. IEEE Trans. Speech Audio Process. 1(4), 373–385 (1993)
Giannella, C.R.: Instability results for Euclidean distance, nearest neighbor search on high dimensional Gaussian data. Inf. Process. Lett. 169, 106115 (2021)
van den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems 30 (2017)
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)
D’efossez, A., Copet, J., Synnaeve, G., Adi, Y.: High fidelity neural audio compression. arXiv preprint arXiv:2210.13438 (2022)
Veaux, C., Yamagishi, J., MacDonald, K., et al.: CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. University of Edinburgh, The Centre for Speech Technology Research (CSTR) (2017)
Hines, A., Skoglund, J., Kokaram, A.C., Harte, N.: ViSQOL: an objective speech quality model. EURASIP J. Audio Speech Music Process. 2015(1), 1–18 (2015)
Princen, J., Bradley, A.: Analysis/synthesis filter bank design based on time domain aliasing cancellation. IEEE Trans. Acoust. Speech Signal Process. 34(5), 1153–1161 (1986)
Polyak, A., et al.: Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355 (2021)
Zhang, J., Tao, D.: FAMED-Net: a fast and accurate multi-scale end-to-end dehazing network. IEEE Trans. Image Process. 29, 72–84 (2019)
Yang, Q., Luo, Y., Hu, H., Zhou, X., Du, B., Tao, D.: Robust metric boosts transfer. In: 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6 (2022)
Zhan, Y., Yu, J., Yu, Z., Zhang, R., Tao, D., Tian, Q.: Comprehensive distance-preserving autoencoders for cross-modal retrieval. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1137–1145 (2018)
Acknowledgements
This work is supported by the Fundamental Research Funds for the Central Universities (No. 2042023kf1033), the National Natural Science Foundation of China (No. 62276195 and 62262026), and the project of Jiangxi Education Department (No. GJJ211111).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, Q., Tu, W., Luo, Y., Zhou, X., Xiao, L., Zheng, Y. (2023). MBMS-GAN: Multi-Band Multi-Scale Adversarial Learning for Enhancement of Coded Speech at Very Low Rate. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14260. Springer, Cham. https://doi.org/10.1007/978-3-031-44195-0_38
Download citation
DOI: https://doi.org/10.1007/978-3-031-44195-0_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44194-3
Online ISBN: 978-3-031-44195-0
eBook Packages: Computer ScienceComputer Science (R0)