Skip to main content

Speech Intelligibility Enhancement By Non-Parallel Speech Style Conversion Using CWT and iMetricGAN Based CycleGAN

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13141))

Abstract

Speech intelligibility enhancement is a perceptual enhancement technique for clean speech reproduced in noisy environments. Many studies enhance speech intelligibility by speaking style conversion (SSC), which relies solely on the Lombard effect does not work well in strong noise interference. They also model the conversion of fundamental frequency (F0) with a straightforward linear transform and map only a very few dimensions Mel-cepstral coefficients (MCEPs). As F0 and MCEPs are critical aspects of hierarchical intonation, we believe that adequate modeling of these features is essential. In this paper, we make a creative study of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales that describe speech at different time resolutions for effective F0 conversion, and we also express MCEPs with 20 dimensions over baseline 10 dimensions for MCEPs conversion. We utilize an iMetricGAN network to optimize the speech intelligibility metrics in strong noise. Experimental results show that proposed Non-Parallel Speech Style Conversion using CWT and iMetricGAN based CycleGAN (NS-CiC) method outperforms the baselines that significantly increased speech intelligibility in robust noise environments in objective and subjective evaluations.

This work was supported by the National Key Research and Development Program of China (1502-211100026).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Alghamdi, A., Chan, W.Y.: Modified ESTOI for improving speech intelligibility prediction. In: 2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1–5 (2020)

    Google Scholar 

  2. Alghamdi, N., Maddock, S., Marxer, R., Barker, J., Brown: a corpus of audio-visual lombard speech with frontal and profile views. J. Acoust. Soc. Am. 143(6), EL523–EL529 (2018)

    Google Scholar 

  3. Chai, T., Draxler, R.R.: Root mean square error (RMSE) or mean absolute error (MAE)?-arguments against avoiding RMSE in the literature. Geosci Mod. Dev 7(3), 1247–1250 (2014)

    Article  Google Scholar 

  4. Garnier, M., Henrich, N.: Speaking in noise: How does the lombard effect improve acoustic contrasts between speech and ambient noise? Comput. Speech Lang. 28(2), 580–597 (2014)

    Article  Google Scholar 

  5. Hu, M., Xiao, J., Liao, L., Wang, Z., Lin, C.W., Wang, M., Satoh, S.: Capturing small, fast-moving objects: frame interpolation via recurrent motion enhancement. IEEE Trans. Circ. Syst. Video Technol. 1 (2021). https://doi.org/10.1109/TCSVT.2021.3110796

  6. Kawanami, H., Iwami, Y., Toda, T., Saruwatari, H., Shikano, K.: GMM-based voice conversion applied to emotional speech synthesis (2003)

    Google Scholar 

  7. Liao, L., Xiao, J., Wang, Z., Lin, C.W., Satoh, S.: Image inpainting guided by coherence priors of semantics and textures. In: CVPR, pp. 6539–6548 (2021)

    Google Scholar 

  8. Liao, L., Xiao, J., Wang, Z., Lin, C.-W., Satoh, S.: Guidance and evaluation: semantic-aware image inpainting for mixed scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 683–700. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_41

    Chapter  Google Scholar 

  9. Liao, L., Xiao, J., Wang, Z., Lin, C.W., Satoh, S.: Uncertainty-aware semantic guidance and estimation for image inpainting. IEEE J. Sel. Top. Sig. Process. 15(2), 310–323 (2021)

    Article  Google Scholar 

  10. Ming, H., Huang, D.Y., Xie, L., Wu, J., Dong, M., Li, H.: Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion. In: Interspeech, pp. 2453–2457 (2016)

    Google Scholar 

  11. Morise, M., Yokomori, F., Ozawa, K.: World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)

    Article  Google Scholar 

  12. Paul, D., Shifas, M.P., Pantazis, Y., Stylianou, Y.: Enhancing speech intelligibility in text-to-speech synthesis using speaking style conversion. arXiv preprint arXiv:2008.05809 (2020)

  13. Rec, I.: P. 800: Methods for subjective determination of transmission quality. ITU (1996)

    Google Scholar 

  14. Ribeiro, M.S., Clark, R.A.: A multi-level representation of f0 using the continuous wavelet transform and the discrete cosine transform. In: ICASSP, pp. 4909–4913. IEEE (2015)

    Google Scholar 

  15. Seshadri, S., Juvela, L., Räsänen, O., Alku, P.: Vocal effort based speaking style conversion using vocoder features and parallel learning. IEEE Access 7, 17230–17246 (2019)

    Article  Google Scholar 

  16. Seshadri, S., Juvela, L., Yamagishi, J., Räsänen, O.: Cycle-consistent adversarial networks for non-parallel vocal effort based speaking style conversion. In: ICASSP. IEEE (2019)

    Google Scholar 

  17. Sisman, B., Li, H.: Wavelet analysis of speaker dependent and independent prosody for voice conversion. In: Interspeech, pp. 52–56 (2018)

    Google Scholar 

  18. Soloducha, M., Raake, A., Kettler, F., Voigt, P.: Lombard speech database for German language. In: Proceedings of of DAGA 42nd Annual Conference on Acoustics (2016)

    Google Scholar 

  19. Van Kuyk, S., Kleijn, W.B., Hendriks, R.C.: An evaluation of intrusive instrumental intelligibility metrics. IEEE ACM Trans. Audio Speech Lang. Process. 26(11), 2153–2166 (2018)

    Google Scholar 

  20. Varga, A., Steeneken, H.J.: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing Xiao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xiao, J., Liu, J., Li, D., Zhao, L., Wang, Q. (2022). Speech Intelligibility Enhancement By Non-Parallel Speech Style Conversion Using CWT and iMetricGAN Based CycleGAN. In: Þór Jónsson, B., et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13141. Springer, Cham. https://doi.org/10.1007/978-3-030-98358-1_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-98358-1_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-98357-4

  • Online ISBN: 978-3-030-98358-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics