MBMS-GAN: Multi-Band Multi-Scale Adversarial Learning for Enhancement of Coded Speech at Very Low Rate

Xu, Qianhui; Tu, Weiping; Luo, Yong; Zhou, Xin; Xiao, Li; Zheng, Youqiang

doi:10.1007/978-3-031-44195-0_38

Qianhui Xu¹¹,
Weiping Tu¹¹,
Yong Luo¹¹,
Xin Zhou¹²,
Li Xiao¹¹ &
…
Youqiang Zheng¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14260))

Included in the following conference series:

International Conference on Artificial Neural Networks

969 Accesses

Abstract

Speech coding is to effectively represent speech signals in the form of digital signals. Existing solutions usually have large quantization error when coding at very low rate. This would result in serious spectrum energy distortion of the reconstructed speech, and most of the current approaches do not consider that the distortion is often unevenly distributed across the spectrum. To address these issues, we propose a novel multi-band multi-scale generative adversarial network (MBMS-GAN) for speech coding. In particular, the speech coding is trained in an adversarial manner at different sub-bands and scales to consider both the global and local spectral energy distortion. Besides, a unified codebook matching strategy is designed by integrating the Euclidean distance and cosine similarity to consider both the absolute distance and directions of two vectors in the matching. We very effectiveness of our method on the popular CSTR-VCTK dataset, and the results demonstrate that our method can significantly improve the quality of reconstructed speech at 600 bps by 0.19 in terms of MOS score. Our study has high application value in the scenario of narrow communication channels such as satellite communication.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Rowe, D.: Codec 2-Open source speech coding at 2400 bits/s and below. In: TAPR and ARRL 30th Digital Communications Conference, pp. 80–84. Springer, Heidelberg (2011). https://doi.org/10.10007/1234567890
Moore, R.K., Skidmore, L.: On the use/misuse of the term “phoneme”. In: Kubin , G., Kacic, Z., (eds.) Proc. INTERSPEECH 2019–20th Annual Conference of the International Speech Communication Association, Graz, Austria, Sep. 2019, LNCS, vol. 1234, pp. 2340–2344. Springer, Heidelberg (2019). https://doi.org/10.10007/1234567890
Dietz, M., et al.: Overview of the EVS codec architecture. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5698–5702. IEEE, Heidelberg (2015). https://doi.org/10.10007/1234567890
Kleijn, W.B., et al.: Wavenet based low rate speech coding. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 676–680. IEEE, Heidelberg (2018). https://doi.org/10.10007/1234567890
van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Bessette, B., et al.: The Adaptive Multirate Wideband Speech Codec (AMR-WB). IEEE Trans. Speech Audio Process. 10(8), 620–636 (2002)
Article Google Scholar
Lin, J., Kalgaonkar, K., He, Q., Lei, X.: Speech enhancement for low bit rate speech codec. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7777–7781. IEEE (2022). https://doi.org/10.1109/ICASSP49725.2022.9414849
Skoglund, J., Valin, J.-M.: Improving Opus low bit rate quality with neural speech synthesis. arXiv preprint arXiv:1905.04628 (2019)
Valin, J.-M., Skoglund, J.: LPCNet: improving neural speech synthesis through linear prediction. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5891–5895. IEEE (2019). https://doi.org/10.1109/ICASSP.2019.8682434
Mustafa, A., Büthe, J., Korse, S., Gupta, K., Fuchs, G., Pia, N.: A streamwise GAN vocoder for wideband speech coding at very low bit rate. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 66–70. IEEE (2021). https://doi.org/10.1109/WASPAA51851.2021.9583419
Mustafa, A., Pia, N., Fuchs, G.: StylemelGAN: an efficient high-fidelity adversarial vocoder with temporal adaptive normalization. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6034–6038. IEEE (2021). https://doi.org/10.1109/ICASSP39728.2021.9413884
Kleijn, W.B., et al.: Generative speech coding with predictive variance regularization. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6478–6482. IEEE (2021). https://doi.org/10.1109/ICASSP39728.2021.9413897
Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., Tagliasacchi, M.: SoundStream: an end-to-end neural audio codec. IEEE/ACM Trans. Audio Speech Lang. Process. 30 495–507 (2021)
Google Scholar
LeBlanc, W.P., Bhattacharya, B., Mahmoud, S.A., Cuperman, V.: Efficient search and design procedures for robust multi-stage VQ of LPC parameters for 4 kb/s speech coding. IEEE Trans. Speech Audio Process. 1(4), 373–385 (1993)
Article Google Scholar
Giannella, C.R.: Instability results for Euclidean distance, nearest neighbor search on high dimensional Gaussian data. Inf. Process. Lett. 169, 106115 (2021)
Article MathSciNet MATH Google Scholar
van den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)
Google Scholar
D’efossez, A., Copet, J., Synnaeve, G., Adi, Y.: High fidelity neural audio compression. arXiv preprint arXiv:2210.13438 (2022)
Veaux, C., Yamagishi, J., MacDonald, K., et al.: CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. University of Edinburgh, The Centre for Speech Technology Research (CSTR) (2017)
Google Scholar
Hines, A., Skoglund, J., Kokaram, A.C., Harte, N.: ViSQOL: an objective speech quality model. EURASIP J. Audio Speech Music Process. 2015(1), 1–18 (2015)
Article Google Scholar
Princen, J., Bradley, A.: Analysis/synthesis filter bank design based on time domain aliasing cancellation. IEEE Trans. Acoust. Speech Signal Process. 34(5), 1153–1161 (1986)
Article Google Scholar
Polyak, A., et al.: Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355 (2021)
Zhang, J., Tao, D.: FAMED-Net: a fast and accurate multi-scale end-to-end dehazing network. IEEE Trans. Image Process. 29, 72–84 (2019)
Article MathSciNet MATH Google Scholar
Yang, Q., Luo, Y., Hu, H., Zhou, X., Du, B., Tao, D.: Robust metric boosts transfer. In: 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6 (2022)
Google Scholar
Zhan, Y., Yu, J., Yu, Z., Zhang, R., Tao, D., Tian, Q.: Comprehensive distance-preserving autoencoders for cross-modal retrieval. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1137–1145 (2018)
Google Scholar

Download references

Acknowledgements

This work is supported by the Fundamental Research Funds for the Central Universities (No. 2042023kf1033), the National Natural Science Foundation of China (No. 62276195 and 62262026), and the project of Jiangxi Education Department (No. GJJ211111).

Author information

Authors and Affiliations

School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, China
Qianhui Xu, Weiping Tu, Yong Luo, Li Xiao & Youqiang Zheng
Jiangxi Science and Technology Normal University, Nanchang, China
Xin Zhou

Authors

Qianhui Xu
View author publications
You can also search for this author in PubMed Google Scholar
Weiping Tu
View author publications
You can also search for this author in PubMed Google Scholar
Yong Luo
View author publications
You can also search for this author in PubMed Google Scholar
Xin Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Li Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Youqiang Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiping Tu .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, Q., Tu, W., Luo, Y., Zhou, X., Xiao, L., Zheng, Y. (2023). MBMS-GAN: Multi-Band Multi-Scale Adversarial Learning for Enhancement of Coded Speech at Very Low Rate. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14260. Springer, Cham. https://doi.org/10.1007/978-3-031-44195-0_38

Download citation

DOI: https://doi.org/10.1007/978-3-031-44195-0_38
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44194-3
Online ISBN: 978-3-031-44195-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MBMS-GAN: Multi-Band Multi-Scale Adversarial Learning for Enhancement of Coded Speech at Very Low Rate