Skip to main content
Log in

A Novel Attention-Guided Generative Adversarial Network for Whisper-to-Normal Speech Conversion

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

Whispered speech is a special voicing style of speech that is employed publicly to protect speech information. It is also the primary pronunciation form for aphonic individuals with laryngectomy for oral communication. Converting whispered speech to normal-voiced speech can significantly improve speech quality and/or speech intelligibility for whisper perception or recognition. Due to the significant voicing style difference between normal speech and whispered speech, it is still a major challenge to estimate normal-voiced speech from its whispered counterpart. Existing whisper-to-normal speech conversion methods aim to learn a nonlinear function of features between whispered speech and its normal counterpart, and the converted normal speech is reconstructed with features selected by the learned function from the training data space. These methods may produce a discontinuous spectrum in successive frames, thus decreasing the speech quality and/or intelligibility of the converted normal speech. This paper proposes a novel generative model (AGAN-W2SC) for whisper-to-normal speech conversion. Unlike the feature mapping model, the proposed AGAN-W2SC model generates a normal speech spectrum from a whispered spectrum. To make the generated spectrum more similar to the reference normal speech, the inner-feature coherence of a whisper as well as the inter-feature coherence between whispered speech and its normal counterpart is modeled in the proposed AGAN-W2SC model. Specifically, a self-attention mechanism is introduced to capture the inner-spectrum structure while a Siamese neural network is adopted to capture the interspectrum structure in the cross-domain. Additionally, the proposed model adopts identity mapping to preserve linguistic information. The proposed AGAN-W2SC is parallel data-free and can be trained at the frame level. Experimental results on whisper-to-normal speech conversion demonstrate the superior performance and effectiveness of the proposed AGAN-W2SC method over all the compared competing methods in terms of speech quality and intelligibility

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

Some or all data, models, or code generated or used during the study are available from the corresponding author by request.

References

  1. Cotescu M, Drugman T, Huybrechts G, Lorenzo-Trueba J, Moinet A. Voice conversion for whispered speech synthesis. IEEE Signal Process Lett. 2020;27:186–90.

    Article  Google Scholar 

  2. Xu M, Shao J, Ding H, Wang L. The effect of aging on identification of Mandarin consonants in normal and whisper registers. Front Psychol. 2022;13:962242.

    Article  Google Scholar 

  3. Hendrickson K, Ernest D. The recognition of whispered speech in real-time. Ear Hear. 2022;43(2):554–62.

    Article  Google Scholar 

  4. Rubin AD, Sataloff RT. Vocal fold paresis and paralysis. Otolaryngol Clin North Am. 2007;40(5):1109–31.

    Article  Google Scholar 

  5. Sulica L. Vocal fold paresis: an evolving clinical concept. Curr Otorhinolaryngol Rep. 2013;1(3):158–62.

    Article  Google Scholar 

  6. Tartter VC. What is in a whisper. J Acoust Soc Am. 1989;86(5):1678–83.

    Article  Google Scholar 

  7. Wallis L, Jackson-Menaldi C, Holland W, Giraldo A. Vocal fold nodule vs. vocal fold polyp: Answer from surgical pathologist and voice pathologist point of view. J Voice. 2004;18(1):125–9.

  8. Mattiske JA, Oates JM, Greenwood KM. Vocal problems among teachers: a review of prevalence, causes, prevention, and treatment. J Voice. 1998;12(4):489–99.

    Article  Google Scholar 

  9. Itoh T, Takeda K, Itakura F. Acoustic analysis and recognition of whispered speech. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding; 2001. p. 429–32.

  10. Zhang C, Hansen JHL. Advancements in whispered speech detection for interactive/speech systems. Signal Acoust Model Speech Commun Disorders. 2018;5:9–32.

    Article  Google Scholar 

  11. Jin Q, Jou SS, Schultz T. Whispering speaker identification. In: Proceedings of IEEE International Conference on Multimedia and Expo; 2007. p. 1027–30.

  12. Fan X, Hansen JHL. Speaker identification for whispered speech based on frequency warping and score competition. In: Proceedings of INTERSPEECH; 2008. p. 1313–6.

  13. Fan X, Hansen JHL. Speaker Identification with whispered speech based on modified LFCC parameters and feature mapping. In: Proceedings of IEEE Conference on Acoustics, Speech, and Signal Processing; 2009. p. 4553–6.

  14. Fan X, Hansen JHL. Acoustic analysis for speaker identification of whispered speech. In: Proceedings of Conference on Acoustics, Speech, and Signal Processing; 2010. p. 5046–9.

  15. Fan X, Hansen JHL. Speaker Identification for whispered speech using modified temporal patterns and MFCCs. In: Proceedings of INTERSPEECH; 2009. p. 912–5.

  16. Ito T, Takeda K, Itakura F. Analysis and recognition of whispered speech. Speech Commun. 2005;45:139–52.

    Article  Google Scholar 

  17. Tajiri Y, Tanaka K, Toda T, Neubig G, Sakti S, Nakamura S. Non-audible murmur enhancement based on statistical conversion using air-and body-conductive microphones in noisy environments. In: Proceedings of INTERSPEECH; 2015. p. 2769–73.

  18. Ahmadi F, McLoughlin IV, Sharifzadeh HR. Analysis-by-synthesis method for whisper-speech reconstruction. In: 2008 IEEE Asia Pacific Conference on Circuits and Systems; 2008. p. 1280–3.

  19. Sharifzadeh HR, McLoughlin IV, Ahmadi F. Reconstruction of normal sounding speech for laryngectomy patients through a modified CELP codec. IEEE Trans Biomed Eng. 2010;57(10):2448–58.

    Article  Google Scholar 

  20. Li JJ, Mcloughlin IV, Dai LR, Ling ZH. Whisper-to-speech conversion using restricted Boltzmann machine arrays. Electron Lett. 2014;50(24):1781–2.

    Article  Google Scholar 

  21. Janke M, Wand M, Heistermann T, Schultz T, Prahallad K. Fundamental frequency generation for whisper-to-audible speech conversion. In: International Conference on Acoustics, Speech and Signal Processing; 2014. p. 2579–83.

  22. Meenakshi GN, Ghosh PK. Whispered speech to neutral speech conversion using bidirectional LSTMs. In: Interspeech. 2018.

  23. Heeren, Willemijn FL. Vocalic correlates of pitch in whispered versus normal speech. J Acoust Soc Am. 2015;138(6):3800–10.

  24. Clarke J, Baskent D, Gaudrain E. Pitch and spectral resolution: a systematic comparison of bottom-up cues for top-down repair of degraded speech. J Acoust Soc Am. 2016;139(1):395–405.

    Article  Google Scholar 

  25. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.

    Article  Google Scholar 

  26. Janke M, Wand M, Heistermann T, Schultz T, Prahallad K. Fundamental frequency generation for whisper-to-audible speech conversion. In: Proceedings of IEEE Conference on Acoustics, Speech, and Signal Processing; 2014. p. 2579–83.

  27. Ji J, Wang M, Zhang X, Lei M. Relation constraint self-attention for image captioning. Neurocomputing. 2022;501:778–89.

    Article  Google Scholar 

  28. Guo M, Liu Z, Mu T, Hu S. Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Trans Pattern Anal Mach Intell. 2022;14(8):1–13.

    Google Scholar 

  29. Wang DL. The time dimension for scene analysis. IEEE Trans Neural Netw. 2005;16(6):1401–26.

    Article  Google Scholar 

  30. Subakan C, Ravanelli M, Cornell S, Bronzi M, Zhong J. Attention is all you need in speech separation. In: IEEE International Conference on Acoustics, Speech and Signal Processing. 2021.

  31. Kaneko T, Kameoka H. CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks. In: Proceedings of 26th European Signal Processing Conference; 2018. p. 2100–4.

  32. Auerbach Benjamin D, Gritton Howard J. Hearing in complex environments: Auditory gain control, attention, and hearing loss. Front Neurosci. 2022;16:1–23.

    Google Scholar 

  33. Thomassen S, Hartung K, Einhäser W, Bendixen A. Low-high-low or high-low-high? Pattern effects on sequential auditory scene analysis. J Acoust Soc Am. 2022;152(5):2758–68.

  34. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Proceedings of Conference on Neural Information Processing Systems; 2014. p. 2672–80.

  35. Shah N, Shah NJ, Patil HA. Effectiveness of generative adversarial network for non-audible murmur-to-whisper speech conversion. In: Proceedings of INTERSPEECH; 2018. p. 3157–61.

  36. Patel M, Parmar M, Doshi S, Shah N, Patil. Novel inception-GAN for whispered-to-normal speech conversion. In: Proceedings of 10th ISCA Speech Synthesis Workshop; 2019. p. 87–92.

  37. Purohit M, Patel M, Malaviya H, Patil A, Parmar M, Shah N, Doshi S, Patil HA. Intelligibility improvement of dysarthric speech using MMSE DiscoGAN. In: Proceedings of Conference on Signal Processing and Communications; 2020. p. 1–5.

  38. Parmar M, Doshi S, Shah NJ, Patel M, Patil HA. Effectiveness of cross-domain architectures for whisper-to-normal speech conversion. In: Proceedings of 27th European Signal Processing Conference; 2019. p. 1–5.

  39. Zhang H, Goodfellow I, Metaxas D, Odena A. Self-attention generative adversarial networks. In: Proceedings of International Conference on Machine Learning. 2019 .

  40. Amodio M, Krishnaswamy S. TraVeLGAN: Image-to-image translation by transformation vector learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition; 2019. p. 8975–84.

  41. Melekhov I, Kannala J, Rahtu E. Siamese network features for image matching. In: Proceedings of 23rd Conference Pattern Recognition; 2016. p. 378–83.

  42. Gao Y, Singh R, Raj B. Voice impersonation using generative adversarial networks. In: Proceedings of Conference on Acoustics, Speech, and Signal Processing; 2018. p. 2506–10.

  43. Zhu J, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of IEEE Conference on Computer Vision; 2017. p. 2242–51.

  44. Taigman Y, Polyak A, Wolf L. Unsupervised cross-domain image generation. In: Proceedings of the International Conference on Learning Representations. 2016.

  45. Yamagishi J, Brown G, Yang CY, Clark R, King S. CSTR NAM TIMIT Plus, [dataset]. Centre for Speech Technology Research: University of Edinburgh; 2021.

    Google Scholar 

  46. Toda T, Nakagiri M, Shikano K. Statistical voice conversion techniques for body-conducted unvoiced speech enhancement. IEEE Trans Audio Speech Language Process. 2012;20(9):2505–17.

    Article  Google Scholar 

  47. Meenakshi GN, Ghosh PK. Whispered speech to neutral speech conversion using bidirectional LSTMs. In: Proceedings of INTERSPEECH; 2018. p. 491–5.

  48. Griffin D, Lim J. Signal estimation from modified short-time Fourier transform. In: Proceedings of IEEE Conference on Acoustics, Speech, and Signal Processing; 1983. p. 804–7.

  49. Erro D, Sainz I, Navas E, Hernaez I. Improved HNM based vocoder for statistical synthesizers. In: INTERSPEECH, Florence, Italy; 2011. p. 1809–12.

  50. Taal CH, Hendriks RC, Heusdens R, Jensen J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: Proceedings of Conference on Acoustics, Speech, and Signal Processing; 2010. p. 4214–7.

  51. Rix AW, Beerends JG, Hollier M, Hekstra AP. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. 2001. p.749–752.

  52. Kubichek R. Mel-cepstral distance measure for objective speech quality assessment. In: Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing; 1993. p. 125–8.

  53. Gray A, Markel J. Distance measures for speech processing. IEEE Trans Acoust Speech Signal Process. 1976;24(5):380–91.

    Article  Google Scholar 

  54. Malfait L, Berger J, Kastner M. P.563 - The ITU-T standard for single-ended speech quality assessment. IEEE Trans Audio Speech Language Process. 2006;14(6):1924–34.

Download references

Acknowledgements

Many thanks to the anonymous reviewers for their comments.

Funding

This work was partly supported by the Natural Science Fund Project of China under No. 61301295 and No. 61372137, the Anhui Natural Science Fund Project under No. 1908085MF209, and the Anhui University Natural Science Research Project under No. KJ2018A0018.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Zhou.

Ethics declarations

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, T., Pan, Q., Zhou, J. et al. A Novel Attention-Guided Generative Adversarial Network for Whisper-to-Normal Speech Conversion. Cogn Comput 15, 778–792 (2023). https://doi.org/10.1007/s12559-023-10108-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-023-10108-9

Keywords

Navigation