A Novel Attention-Guided Generative Adversarial Network for Whisper-to-Normal Speech Conversion

Gao, Teng; Pan, Qing; Zhou, Jian; Wang, Huabin; Tao, Liang; Kwan, Hon Keung

doi:10.1007/s12559-023-10108-9

A Novel Attention-Guided Generative Adversarial Network for Whisper-to-Normal Speech Conversion

Published: 16 January 2023

Volume 15, pages 778–792, (2023)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

Teng Gao¹,
Qing Pan¹,
Jian Zhou ORCID: orcid.org/0000-0001-6509-5520¹,
Huabin Wang¹,
Liang Tao¹ &
…
Hon Keung Kwan²

343 Accesses
Explore all metrics

Abstract

Whispered speech is a special voicing style of speech that is employed publicly to protect speech information. It is also the primary pronunciation form for aphonic individuals with laryngectomy for oral communication. Converting whispered speech to normal-voiced speech can significantly improve speech quality and/or speech intelligibility for whisper perception or recognition. Due to the significant voicing style difference between normal speech and whispered speech, it is still a major challenge to estimate normal-voiced speech from its whispered counterpart. Existing whisper-to-normal speech conversion methods aim to learn a nonlinear function of features between whispered speech and its normal counterpart, and the converted normal speech is reconstructed with features selected by the learned function from the training data space. These methods may produce a discontinuous spectrum in successive frames, thus decreasing the speech quality and/or intelligibility of the converted normal speech. This paper proposes a novel generative model (AGAN-W2SC) for whisper-to-normal speech conversion. Unlike the feature mapping model, the proposed AGAN-W2SC model generates a normal speech spectrum from a whispered spectrum. To make the generated spectrum more similar to the reference normal speech, the inner-feature coherence of a whisper as well as the inter-feature coherence between whispered speech and its normal counterpart is modeled in the proposed AGAN-W2SC model. Specifically, a self-attention mechanism is introduced to capture the inner-spectrum structure while a Siamese neural network is adopted to capture the interspectrum structure in the cross-domain. Additionally, the proposed model adopts identity mapping to preserve linguistic information. The proposed AGAN-W2SC is parallel data-free and can be trained at the frame level. Experimental results on whisper-to-normal speech conversion demonstrate the superior performance and effectiveness of the proposed AGAN-W2SC method over all the compared competing methods in terms of speech quality and intelligibility

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Enhancement Using Generative Adversarial Network (GAN)

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

ATT:Adversarial Trained Transformer for Speech Enhancement

Data Availability

Some or all data, models, or code generated or used during the study are available from the corresponding author by request.

References

Cotescu M, Drugman T, Huybrechts G, Lorenzo-Trueba J, Moinet A. Voice conversion for whispered speech synthesis. IEEE Signal Process Lett. 2020;27:186–90.
Article Google Scholar
Xu M, Shao J, Ding H, Wang L. The effect of aging on identification of Mandarin consonants in normal and whisper registers. Front Psychol. 2022;13:962242.
Article Google Scholar
Hendrickson K, Ernest D. The recognition of whispered speech in real-time. Ear Hear. 2022;43(2):554–62.
Article Google Scholar
Rubin AD, Sataloff RT. Vocal fold paresis and paralysis. Otolaryngol Clin North Am. 2007;40(5):1109–31.
Article Google Scholar
Sulica L. Vocal fold paresis: an evolving clinical concept. Curr Otorhinolaryngol Rep. 2013;1(3):158–62.
Article Google Scholar
Tartter VC. What is in a whisper. J Acoust Soc Am. 1989;86(5):1678–83.
Article Google Scholar
Wallis L, Jackson-Menaldi C, Holland W, Giraldo A. Vocal fold nodule vs. vocal fold polyp: Answer from surgical pathologist and voice pathologist point of view. J Voice. 2004;18(1):125–9.
Mattiske JA, Oates JM, Greenwood KM. Vocal problems among teachers: a review of prevalence, causes, prevention, and treatment. J Voice. 1998;12(4):489–99.
Article Google Scholar
Itoh T, Takeda K, Itakura F. Acoustic analysis and recognition of whispered speech. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding; 2001. p. 429–32.
Zhang C, Hansen JHL. Advancements in whispered speech detection for interactive/speech systems. Signal Acoust Model Speech Commun Disorders. 2018;5:9–32.
Article Google Scholar
Jin Q, Jou SS, Schultz T. Whispering speaker identification. In: Proceedings of IEEE International Conference on Multimedia and Expo; 2007. p. 1027–30.
Fan X, Hansen JHL. Speaker identification for whispered speech based on frequency warping and score competition. In: Proceedings of INTERSPEECH; 2008. p. 1313–6.
Fan X, Hansen JHL. Speaker Identification with whispered speech based on modified LFCC parameters and feature mapping. In: Proceedings of IEEE Conference on Acoustics, Speech, and Signal Processing; 2009. p. 4553–6.
Fan X, Hansen JHL. Acoustic analysis for speaker identification of whispered speech. In: Proceedings of Conference on Acoustics, Speech, and Signal Processing; 2010. p. 5046–9.
Fan X, Hansen JHL. Speaker Identification for whispered speech using modified temporal patterns and MFCCs. In: Proceedings of INTERSPEECH; 2009. p. 912–5.
Ito T, Takeda K, Itakura F. Analysis and recognition of whispered speech. Speech Commun. 2005;45:139–52.
Article Google Scholar
Tajiri Y, Tanaka K, Toda T, Neubig G, Sakti S, Nakamura S. Non-audible murmur enhancement based on statistical conversion using air-and body-conductive microphones in noisy environments. In: Proceedings of INTERSPEECH; 2015. p. 2769–73.
Ahmadi F, McLoughlin IV, Sharifzadeh HR. Analysis-by-synthesis method for whisper-speech reconstruction. In: 2008 IEEE Asia Pacific Conference on Circuits and Systems; 2008. p. 1280–3.
Sharifzadeh HR, McLoughlin IV, Ahmadi F. Reconstruction of normal sounding speech for laryngectomy patients through a modified CELP codec. IEEE Trans Biomed Eng. 2010;57(10):2448–58.
Article Google Scholar
Li JJ, Mcloughlin IV, Dai LR, Ling ZH. Whisper-to-speech conversion using restricted Boltzmann machine arrays. Electron Lett. 2014;50(24):1781–2.
Article Google Scholar
Janke M, Wand M, Heistermann T, Schultz T, Prahallad K. Fundamental frequency generation for whisper-to-audible speech conversion. In: International Conference on Acoustics, Speech and Signal Processing; 2014. p. 2579–83.
Meenakshi GN, Ghosh PK. Whispered speech to neutral speech conversion using bidirectional LSTMs. In: Interspeech. 2018.
Heeren, Willemijn FL. Vocalic correlates of pitch in whispered versus normal speech. J Acoust Soc Am. 2015;138(6):3800–10.
Clarke J, Baskent D, Gaudrain E. Pitch and spectral resolution: a systematic comparison of bottom-up cues for top-down repair of degraded speech. J Acoust Soc Am. 2016;139(1):395–405.
Article Google Scholar
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.
Article Google Scholar
Janke M, Wand M, Heistermann T, Schultz T, Prahallad K. Fundamental frequency generation for whisper-to-audible speech conversion. In: Proceedings of IEEE Conference on Acoustics, Speech, and Signal Processing; 2014. p. 2579–83.
Ji J, Wang M, Zhang X, Lei M. Relation constraint self-attention for image captioning. Neurocomputing. 2022;501:778–89.
Article Google Scholar
Guo M, Liu Z, Mu T, Hu S. Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Trans Pattern Anal Mach Intell. 2022;14(8):1–13.
Google Scholar
Wang DL. The time dimension for scene analysis. IEEE Trans Neural Netw. 2005;16(6):1401–26.
Article Google Scholar
Subakan C, Ravanelli M, Cornell S, Bronzi M, Zhong J. Attention is all you need in speech separation. In: IEEE International Conference on Acoustics, Speech and Signal Processing. 2021.
Kaneko T, Kameoka H. CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks. In: Proceedings of 26th European Signal Processing Conference; 2018. p. 2100–4.
Auerbach Benjamin D, Gritton Howard J. Hearing in complex environments: Auditory gain control, attention, and hearing loss. Front Neurosci. 2022;16:1–23.
Google Scholar
Thomassen S, Hartung K, Einhäser W, Bendixen A. Low-high-low or high-low-high? Pattern effects on sequential auditory scene analysis. J Acoust Soc Am. 2022;152(5):2758–68.
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Proceedings of Conference on Neural Information Processing Systems; 2014. p. 2672–80.
Shah N, Shah NJ, Patil HA. Effectiveness of generative adversarial network for non-audible murmur-to-whisper speech conversion. In: Proceedings of INTERSPEECH; 2018. p. 3157–61.
Patel M, Parmar M, Doshi S, Shah N, Patil. Novel inception-GAN for whispered-to-normal speech conversion. In: Proceedings of 10th ISCA Speech Synthesis Workshop; 2019. p. 87–92.
Purohit M, Patel M, Malaviya H, Patil A, Parmar M, Shah N, Doshi S, Patil HA. Intelligibility improvement of dysarthric speech using MMSE DiscoGAN. In: Proceedings of Conference on Signal Processing and Communications; 2020. p. 1–5.
Parmar M, Doshi S, Shah NJ, Patel M, Patil HA. Effectiveness of cross-domain architectures for whisper-to-normal speech conversion. In: Proceedings of 27th European Signal Processing Conference; 2019. p. 1–5.
Zhang H, Goodfellow I, Metaxas D, Odena A. Self-attention generative adversarial networks. In: Proceedings of International Conference on Machine Learning. 2019 .
Amodio M, Krishnaswamy S. TraVeLGAN: Image-to-image translation by transformation vector learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition; 2019. p. 8975–84.
Melekhov I, Kannala J, Rahtu E. Siamese network features for image matching. In: Proceedings of 23rd Conference Pattern Recognition; 2016. p. 378–83.
Gao Y, Singh R, Raj B. Voice impersonation using generative adversarial networks. In: Proceedings of Conference on Acoustics, Speech, and Signal Processing; 2018. p. 2506–10.
Zhu J, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of IEEE Conference on Computer Vision; 2017. p. 2242–51.
Taigman Y, Polyak A, Wolf L. Unsupervised cross-domain image generation. In: Proceedings of the International Conference on Learning Representations. 2016.
Yamagishi J, Brown G, Yang CY, Clark R, King S. CSTR NAM TIMIT Plus, [dataset]. Centre for Speech Technology Research: University of Edinburgh; 2021.
Google Scholar
Toda T, Nakagiri M, Shikano K. Statistical voice conversion techniques for body-conducted unvoiced speech enhancement. IEEE Trans Audio Speech Language Process. 2012;20(9):2505–17.
Article Google Scholar
Meenakshi GN, Ghosh PK. Whispered speech to neutral speech conversion using bidirectional LSTMs. In: Proceedings of INTERSPEECH; 2018. p. 491–5.
Griffin D, Lim J. Signal estimation from modified short-time Fourier transform. In: Proceedings of IEEE Conference on Acoustics, Speech, and Signal Processing; 1983. p. 804–7.
Erro D, Sainz I, Navas E, Hernaez I. Improved HNM based vocoder for statistical synthesizers. In: INTERSPEECH, Florence, Italy; 2011. p. 1809–12.
Taal CH, Hendriks RC, Heusdens R, Jensen J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: Proceedings of Conference on Acoustics, Speech, and Signal Processing; 2010. p. 4214–7.
Rix AW, Beerends JG, Hollier M, Hekstra AP. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. 2001. p.749–752.
Kubichek R. Mel-cepstral distance measure for objective speech quality assessment. In: Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing; 1993. p. 125–8.
Gray A, Markel J. Distance measures for speech processing. IEEE Trans Acoust Speech Signal Process. 1976;24(5):380–91.
Article Google Scholar
Malfait L, Berger J, Kastner M. P.563 - The ITU-T standard for single-ended speech quality assessment. IEEE Trans Audio Speech Language Process. 2006;14(6):1924–34.

Download references

Acknowledgements

Many thanks to the anonymous reviewers for their comments.

Funding

This work was partly supported by the Natural Science Fund Project of China under No. 61301295 and No. 61372137, the Anhui Natural Science Fund Project under No. 1908085MF209, and the Anhui University Natural Science Research Project under No. KJ2018A0018.

Author information

Authors and Affiliations

Anhui Provincial Key Laboratory of Multimodal Cognitive Computing, School of Computer Science and Technology, Anhui University, Hefei, 230601, China
Teng Gao, Qing Pan, Jian Zhou, Huabin Wang & Liang Tao
Department of Electrical and Computer Engineering, University of Windsor, Windsor, ON, N9B 3P4, Canada
Hon Keung Kwan

Authors

Teng Gao
View author publications
You can also search for this author in PubMed Google Scholar
Qing Pan
View author publications
You can also search for this author in PubMed Google Scholar
Jian Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Huabin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Liang Tao
View author publications
You can also search for this author in PubMed Google Scholar
Hon Keung Kwan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Zhou.

Ethics declarations

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gao, T., Pan, Q., Zhou, J. et al. A Novel Attention-Guided Generative Adversarial Network for Whisper-to-Normal Speech Conversion. Cogn Comput 15, 778–792 (2023). https://doi.org/10.1007/s12559-023-10108-9

Download citation

Received: 07 February 2022
Accepted: 02 January 2023
Published: 16 January 2023
Issue Date: March 2023
DOI: https://doi.org/10.1007/s12559-023-10108-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Novel Attention-Guided Generative Adversarial Network for Whisper-to-Normal Speech Conversion

Abstract

Access this article

Similar content being viewed by others

Speech Enhancement Using Generative Adversarial Network (GAN)

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

ATT:Adversarial Trained Transformer for Speech Enhancement

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical Approval

Conflict of Interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Novel Attention-Guided Generative Adversarial Network for Whisper-to-Normal Speech Conversion

Abstract

Access this article

Similar content being viewed by others

Speech Enhancement Using Generative Adversarial Network (GAN)

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

ATT:Adversarial Trained Transformer for Speech Enhancement

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical Approval

Conflict of Interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation