Speech Intelligibility Enhancement By Non-Parallel Speech Style Conversion Using CWT and iMetricGAN Based CycleGAN

Xiao, Jing; Liu, Jiaqi; Li, Dengshi; Zhao, Lanxin; Wang, Qianrui

doi:10.1007/978-3-030-98358-1_43

Speech Intelligibility Enhancement By Non-Parallel Speech Style Conversion Using CWT and iMetricGAN Based CycleGAN

Jing Xiao^15,16,
Jiaqi Liu^15,16,
Dengshi Li¹⁷,
Lanxin Zhao¹⁷ &
…
Qianrui Wang¹⁷

Conference paper
First Online: 15 March 2022

2019 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13141))

Abstract

Speech intelligibility enhancement is a perceptual enhancement technique for clean speech reproduced in noisy environments. Many studies enhance speech intelligibility by speaking style conversion (SSC), which relies solely on the Lombard effect does not work well in strong noise interference. They also model the conversion of fundamental frequency (F0) with a straightforward linear transform and map only a very few dimensions Mel-cepstral coefficients (MCEPs). As F0 and MCEPs are critical aspects of hierarchical intonation, we believe that adequate modeling of these features is essential. In this paper, we make a creative study of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales that describe speech at different time resolutions for effective F0 conversion, and we also express MCEPs with 20 dimensions over baseline 10 dimensions for MCEPs conversion. We utilize an iMetricGAN network to optimize the speech intelligibility metrics in strong noise. Experimental results show that proposed Non-Parallel Speech Style Conversion using CWT and iMetricGAN based CycleGAN (NS-CiC) method outperforms the baselines that significantly increased speech intelligibility in robust noise environments in objective and subjective evaluations.

This work was supported by the National Key Research and Development Program of China (1502-211100026).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Alghamdi, A., Chan, W.Y.: Modified ESTOI for improving speech intelligibility prediction. In: 2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1–5 (2020)
Google Scholar
Alghamdi, N., Maddock, S., Marxer, R., Barker, J., Brown: a corpus of audio-visual lombard speech with frontal and profile views. J. Acoust. Soc. Am. 143(6), EL523–EL529 (2018)
Google Scholar
Chai, T., Draxler, R.R.: Root mean square error (RMSE) or mean absolute error (MAE)?-arguments against avoiding RMSE in the literature. Geosci Mod. Dev 7(3), 1247–1250 (2014)
Article Google Scholar
Garnier, M., Henrich, N.: Speaking in noise: How does the lombard effect improve acoustic contrasts between speech and ambient noise? Comput. Speech Lang. 28(2), 580–597 (2014)
Article Google Scholar
Hu, M., Xiao, J., Liao, L., Wang, Z., Lin, C.W., Wang, M., Satoh, S.: Capturing small, fast-moving objects: frame interpolation via recurrent motion enhancement. IEEE Trans. Circ. Syst. Video Technol. 1 (2021). https://doi.org/10.1109/TCSVT.2021.3110796
Kawanami, H., Iwami, Y., Toda, T., Saruwatari, H., Shikano, K.: GMM-based voice conversion applied to emotional speech synthesis (2003)
Google Scholar
Liao, L., Xiao, J., Wang, Z., Lin, C.W., Satoh, S.: Image inpainting guided by coherence priors of semantics and textures. In: CVPR, pp. 6539–6548 (2021)
Google Scholar
Liao, L., Xiao, J., Wang, Z., Lin, C.-W., Satoh, S.: Guidance and evaluation: semantic-aware image inpainting for mixed scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 683–700. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_41
Chapter Google Scholar
Liao, L., Xiao, J., Wang, Z., Lin, C.W., Satoh, S.: Uncertainty-aware semantic guidance and estimation for image inpainting. IEEE J. Sel. Top. Sig. Process. 15(2), 310–323 (2021)
Article Google Scholar
Ming, H., Huang, D.Y., Xie, L., Wu, J., Dong, M., Li, H.: Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion. In: Interspeech, pp. 2453–2457 (2016)
Google Scholar
Morise, M., Yokomori, F., Ozawa, K.: World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)
Article Google Scholar
Paul, D., Shifas, M.P., Pantazis, Y., Stylianou, Y.: Enhancing speech intelligibility in text-to-speech synthesis using speaking style conversion. arXiv preprint arXiv:2008.05809 (2020)
Rec, I.: P. 800: Methods for subjective determination of transmission quality. ITU (1996)
Google Scholar
Ribeiro, M.S., Clark, R.A.: A multi-level representation of f0 using the continuous wavelet transform and the discrete cosine transform. In: ICASSP, pp. 4909–4913. IEEE (2015)
Google Scholar
Seshadri, S., Juvela, L., Räsänen, O., Alku, P.: Vocal effort based speaking style conversion using vocoder features and parallel learning. IEEE Access 7, 17230–17246 (2019)
Article Google Scholar
Seshadri, S., Juvela, L., Yamagishi, J., Räsänen, O.: Cycle-consistent adversarial networks for non-parallel vocal effort based speaking style conversion. In: ICASSP. IEEE (2019)
Google Scholar
Sisman, B., Li, H.: Wavelet analysis of speaker dependent and independent prosody for voice conversion. In: Interspeech, pp. 52–56 (2018)
Google Scholar
Soloducha, M., Raake, A., Kettler, F., Voigt, P.: Lombard speech database for German language. In: Proceedings of of DAGA 42nd Annual Conference on Acoustics (2016)
Google Scholar
Van Kuyk, S., Kleijn, W.B., Hendriks, R.C.: An evaluation of intrusive instrumental intelligibility metrics. IEEE ACM Trans. Audio Speech Lang. Process. 26(11), 2153–2166 (2018)
Google Scholar
Varga, A., Steeneken, H.J.: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, 430072, China
Jing Xiao & Jiaqi Liu
Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, 430072, China
Jing Xiao & Jiaqi Liu
School of Artificial Intelligence, Jianghan University, Wuhan, 430056, China
Dengshi Li, Lanxin Zhao & Qianrui Wang

Authors

Jing Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dengshi Li
View author publications
You can also search for this author in PubMed Google Scholar
Lanxin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Qianrui Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Xiao .

Editor information

Editors and Affiliations

IT University of Copenhagen, Copenhagen, Denmark
Björn Þór Jónsson
Dublin City University, Dublin, Ireland
Cathal Gurrin
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
Minh-Triet Tran
University of Bergen, Bergen, Norway
Duc-Tien Dang-Nguyen
National Tsing Hua University, Hsinchu, Taiwan
Anita Min-Chun Hu
Hanoi University of Science and Technology, Hanoi, Vietnam
Binh Huynh Thi Thanh
Median Technologies, Valbonne, France
Benoit Huet

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiao, J., Liu, J., Li, D., Zhao, L., Wang, Q. (2022). Speech Intelligibility Enhancement By Non-Parallel Speech Style Conversion Using CWT and iMetricGAN Based CycleGAN. In: Þór Jónsson, B., et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13141. Springer, Cham. https://doi.org/10.1007/978-3-030-98358-1_43

Download citation

DOI: https://doi.org/10.1007/978-3-030-98358-1_43
Published: 15 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98357-4
Online ISBN: 978-3-030-98358-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics