Abstract
Speech intelligibility enhancement is a perceptual enhancement technique for clean speech reproduced in noisy environments. Many studies enhance speech intelligibility by speaking style conversion (SSC), which relies solely on the Lombard effect does not work well in strong noise interference. They also model the conversion of fundamental frequency (F0) with a straightforward linear transform and map only a very few dimensions Mel-cepstral coefficients (MCEPs). As F0 and MCEPs are critical aspects of hierarchical intonation, we believe that adequate modeling of these features is essential. In this paper, we make a creative study of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales that describe speech at different time resolutions for effective F0 conversion, and we also express MCEPs with 20 dimensions over baseline 10 dimensions for MCEPs conversion. We utilize an iMetricGAN network to optimize the speech intelligibility metrics in strong noise. Experimental results show that proposed Non-Parallel Speech Style Conversion using CWT and iMetricGAN based CycleGAN (NS-CiC) method outperforms the baselines that significantly increased speech intelligibility in robust noise environments in objective and subjective evaluations.
This work was supported by the National Key Research and Development Program of China (1502-211100026).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Alghamdi, A., Chan, W.Y.: Modified ESTOI for improving speech intelligibility prediction. In: 2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1–5 (2020)
Alghamdi, N., Maddock, S., Marxer, R., Barker, J., Brown: a corpus of audio-visual lombard speech with frontal and profile views. J. Acoust. Soc. Am. 143(6), EL523–EL529 (2018)
Chai, T., Draxler, R.R.: Root mean square error (RMSE) or mean absolute error (MAE)?-arguments against avoiding RMSE in the literature. Geosci Mod. Dev 7(3), 1247–1250 (2014)
Garnier, M., Henrich, N.: Speaking in noise: How does the lombard effect improve acoustic contrasts between speech and ambient noise? Comput. Speech Lang. 28(2), 580–597 (2014)
Hu, M., Xiao, J., Liao, L., Wang, Z., Lin, C.W., Wang, M., Satoh, S.: Capturing small, fast-moving objects: frame interpolation via recurrent motion enhancement. IEEE Trans. Circ. Syst. Video Technol. 1 (2021). https://doi.org/10.1109/TCSVT.2021.3110796
Kawanami, H., Iwami, Y., Toda, T., Saruwatari, H., Shikano, K.: GMM-based voice conversion applied to emotional speech synthesis (2003)
Liao, L., Xiao, J., Wang, Z., Lin, C.W., Satoh, S.: Image inpainting guided by coherence priors of semantics and textures. In: CVPR, pp. 6539–6548 (2021)
Liao, L., Xiao, J., Wang, Z., Lin, C.-W., Satoh, S.: Guidance and evaluation: semantic-aware image inpainting for mixed scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 683–700. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_41
Liao, L., Xiao, J., Wang, Z., Lin, C.W., Satoh, S.: Uncertainty-aware semantic guidance and estimation for image inpainting. IEEE J. Sel. Top. Sig. Process. 15(2), 310–323 (2021)
Ming, H., Huang, D.Y., Xie, L., Wu, J., Dong, M., Li, H.: Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion. In: Interspeech, pp. 2453–2457 (2016)
Morise, M., Yokomori, F., Ozawa, K.: World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)
Paul, D., Shifas, M.P., Pantazis, Y., Stylianou, Y.: Enhancing speech intelligibility in text-to-speech synthesis using speaking style conversion. arXiv preprint arXiv:2008.05809 (2020)
Rec, I.: P. 800: Methods for subjective determination of transmission quality. ITU (1996)
Ribeiro, M.S., Clark, R.A.: A multi-level representation of f0 using the continuous wavelet transform and the discrete cosine transform. In: ICASSP, pp. 4909–4913. IEEE (2015)
Seshadri, S., Juvela, L., Räsänen, O., Alku, P.: Vocal effort based speaking style conversion using vocoder features and parallel learning. IEEE Access 7, 17230–17246 (2019)
Seshadri, S., Juvela, L., Yamagishi, J., Räsänen, O.: Cycle-consistent adversarial networks for non-parallel vocal effort based speaking style conversion. In: ICASSP. IEEE (2019)
Sisman, B., Li, H.: Wavelet analysis of speaker dependent and independent prosody for voice conversion. In: Interspeech, pp. 52–56 (2018)
Soloducha, M., Raake, A., Kettler, F., Voigt, P.: Lombard speech database for German language. In: Proceedings of of DAGA 42nd Annual Conference on Acoustics (2016)
Van Kuyk, S., Kleijn, W.B., Hendriks, R.C.: An evaluation of intrusive instrumental intelligibility metrics. IEEE ACM Trans. Audio Speech Lang. Process. 26(11), 2153–2166 (2018)
Varga, A., Steeneken, H.J.: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Xiao, J., Liu, J., Li, D., Zhao, L., Wang, Q. (2022). Speech Intelligibility Enhancement By Non-Parallel Speech Style Conversion Using CWT and iMetricGAN Based CycleGAN. In: Þór Jónsson, B., et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13141. Springer, Cham. https://doi.org/10.1007/978-3-030-98358-1_43
Download citation
DOI: https://doi.org/10.1007/978-3-030-98358-1_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98357-4
Online ISBN: 978-3-030-98358-1
eBook Packages: Computer ScienceComputer Science (R0)