Optimal Near-End Speech Intelligibility Improvement Using CLPSO-Based Voice Transformation in Realistic Noisy Environments

Biswas, Ritujoy; Nathwani, Karan

doi:10.1007/s00034-022-02106-3

Optimal Near-End Speech Intelligibility Improvement Using CLPSO-Based Voice Transformation in Realistic Noisy Environments

Published: 25 July 2022

Volume 41, pages 6999–7034, (2022)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

271 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

The proposed work attempts to improve the near-end intelligibility of speech at very low signal-to-noise ratios (SNRs). Additionally, the prerequisite of noise statistics that existing intelligibility improvement methods require is not a limitation of the proposed approach. To this end, the shaping parameters of the voice transformation function (VTF) are optimized. This optimization of the shaping parameters of the VTF corresponds to the combined modification that includes formant shifting, nonuniform time scaling, smoothing, and energy re-distributions in comprehensive learning particle swarm optimization (CLPSO) framework. The optimal parameters of the combined modifications are obtained by jointly maximizing the short time objective intelligibility, perceptual evaluation of speech quality and signal-to-distortion ratio metrics being used as the cost function in CLPSO. The outcome at the end is an improvement in intelligibility that is significantly higher than the ones obtained by applying these methods individually, while preserving the quality. As a side result, a Gaussian process regression is also employed to estimate the shaping parameters of VTF at arbitrary SNRs—other than the ones which were used during CLPSO training.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Conventional and contemporary approaches used in text to speech synthesis: a review

Article 13 November 2022

Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions

Article 27 March 2024

Fundamentals, present and future perspectives of speech enhancement

Article 22 January 2020

Availability of Data and Materials

The dataset used in the current work is the CHAINS (CHAracterizing INdividual Speakers) dataset [2].

Notes

References

G. Biagetti et al., Speaker identification in noisy conditions using short sequences of speech frames, in Intelligent Decision Technologies 2017 (2018), pp. 43–52. ISBN: 978-3-319-59423-1. https://doi.org/10.1007/978-3-319-59424-8_5
F. Cummins et al., The chains corpus: characterizing individual speakers, in SPECOM, vol. 6, SPC RAS. (2006), pp. 431–435
Y. Ephraim, D. Malah, Speech enhancement using a minimum—mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984). https://doi.org/10.1109/TASSP.1984.1164453
Article Google Scholar
F. Farias, R. Coelho, Blind adaptive mask to improve intelligibility of non-stationary noisy speech. IEEE Signal Process. Lett. 28, 1170–1174 (2021). https://doi.org/10.1109/LSP.2021.3086405
Article Google Scholar
E. Fonseca et al., Freesound datasets: a platform for the creation of open audio datasets, in Proceedings of the 18th ISMIR Conference, Suzhou, China [Canada] (2017), pp. 486–493
J.D. Griffiths, Optimum linear filter for speech transmission. J. Acoust. Soc. Am. 43(1), 81–86 (1968). https://doi.org/10.1121/1.1910768
Article Google Scholar
R. Hendriks et al., Optimal near-end speech intelligibility improvement incorporating additive noise and late reverberation under an approximation of the short-time SII. Trans. Audio Speech Lang. Process. 23, 851–862 (2015). https://doi.org/10.1109/TASLP.2015.2409780
Article Google Scholar
Y. Hu, P. Loizou. Subjective comparison of speech enhancement algorithms, in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 1 (2006), pp. I–I. https://doi.org/10.1109/ICASSP.2006.1659980
Y. Jiang, H. Zhou, Z. Feng, Performance analysis of ideal binary masks in speech enhancement, in 4th International Congress on Image and Signal Processing, vol. 5 (IEEE, 2011), pp. 2422–2425. https://doi.org/10.1109/CISP.2011.6100732
J. J. Liang et al. “Comprehensive learning particle swarm optimizer for global optimization of multimodal functions”. In: Transactions on Evolutionary Computation 10.3 (2006), pp. 281–295. https://doi.org/10.1109/TEVC.2005.857610
R. Martin, Spectral subtraction based on minimum statistics. Power 6(8), 1182–1185 (1994)
Google Scholar
N. McLaughlin, J. Ming, D. Crookes, Speaker recognition in noisy conditions with limited training data, in 2011 19th European Signal Processing Conference (2011), pp. 1294–1298
J. Ming et al., Robust speaker recognition in noisy conditions. IEEE Trans. Audio Speech Lang. Process. (2007). https://doi.org/10.1109/TASL.2007.899278
Article Google Scholar
K. Nathwani, Intelligibility improvement using kalman filtering & EM approach in formant shifting framework, in International Symposium on Signal Processing and Information Technology (ISSPIT) (IEEE, 2019), pp. 1–6. https://doi.org/10.1109/ISSPIT47144.2019.9001849
K. Nathwani, P. Pandit, R.M. Hegde, Group delay based methods for speaker segregation and its application in multimedia information retrieval. IEEE Trans. Multimed. 15(6), 1326–1339 (2013)
Article Google Scholar
K. Nathwani et al., Formant shifting for speech intelligibility improvement in car noise environment, in International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 5375–5379. https://doi.org/10.1109/ICASSP.2016.7472704
K. Nathwani et al., Speech intelligibility improvement in car noise environment by voice transformation. Speech Commun. 91, 17–27 (2017). https://doi.org/10.1016/j.specom.2017.04.007
Article Google Scholar
R.J. Niederjohn, J.H. Grotelueschen, The enhancement of speech intelligibility in high noise levels by high-pass filtering followed by rapid amplitude compression. IEEE Trans. Acoust. Speech Signal Process. 24(4), 277–282 (1976). https://doi.org/10.1109/TASSP.1976.1162824
Article Google Scholar
R. Patel et al., Nonlinear excitation control of diesel generator: a command filter backstepping approach. Trans. Ind. Inform. (2020). https://doi.org/10.1109/TII.2020.3017744
Article Google Scholar
L. Rabiner, R. Schafer, Theory and Applications of Digital Speech Processing (Prentice Hall Press, Hoboken, 2010)
Google Scholar
M. Rahmati, R. Effatnejad, A. Safari, Comprehensive learning particle swarm optimization (CLPSO) for multi-objective optimal power flow. Indian J. Sci. Technol. 7(3), 262–270 (2014). https://doi.org/10.17485/ijst/2014/v7i3.7
Article Google Scholar
C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning (The MIT Press, Cambridge, 2006)
MATH Google Scholar
A.W. Rix et al., Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2001), pp. 749–752
M. Song et al., A time-weighted method for predicting the intelligibility of speech in the presence of interfering sounds, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 5589–5593. https://doi.org/10.1109/ICASSP.2018.8462124
T. Sreenivas, P. Kirnapure, Codebook constrained Wiener filtering for speech enhancement. IEEE Trans. Speech Audio Process. 4(5), 383–389 (1996). https://doi.org/10.1109/89.536932
Article Google Scholar
C. Taal, R. Hendriks, H. Richard, Speech energy redistribution for intelligibility improvement in noise based on a perceptual distortion measure. Comput. Speech Lang. 28(4), 858–872 (2014). https://doi.org/10.1016/j.csl.2013.11.003
Article Google Scholar
C. Taal, J. Jensen, SII-based speech preprocessing for intelligibility improvement in noise, in Annual Conference of the International Speech Communication Association. INTERSPEECH (2013), pp. 3582–3586
C.H. Taal et al., A short-time objective intelligibility measure for time-frequency weighted noisy speech, in International Conference on Acoustics, Speech and Signal Processing (IEEE, 2010), pp. 4214–4217. https://doi.org/10.1109/ICASSP.2010.5495701
Y. Tang., Background adaptation for improved listening experience in broadcasting, in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 8008–8012
H. Valbret, E. Moulines, J.-P. Tubach, Voice transformation using PSOLA technique. Speech Commun. 11(23), 175–187 (1992)
Article Google Scholar
E. Vincent et al., The second ‘chime’ speech separation and recognition challenge: datasets, tasks and baselines, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2013), pp. 126–130. https://doi.org/10.1109/ICASSP.2013.6637622.
D. Wang, On ideal binary mask as the computational goal of auditory scene analysis, in Speech Separation by Humans and Machines (Springer, 2005), pp. 181–197
B. Xia, C. Bao, Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification. Speech Commun. 60, 13–29 (2014). https://doi.org/10.1016/j.specom.2014.02.001
Article Google Scholar
K. Yamamoto et al., Predicting speech intelligibility using a Gammachirp envelope distortion index based on the signal-to-distortion ratio, in INTERSPEECH (2017), pp. 2949–2953
S. Zahorian, H. Hu, A spectral/temporal method for robust fundamental frequency tracking. J. Acoust. Soc. Am. 123, 4559–4571 (2008). https://doi.org/10.1121/1.2916590
Article Google Scholar
A. Zehtabian et al., A novel speech enhancement approach based on singular value decomposition and genetic algorithm, in IEEE International Conference of Soft Computing and Pattern Recognition (IEEE, 2010), pp. 430–435. https://doi.org/10.1109/SOCPAR.2010.5686627.

Download references

Funding

This work is supported through Project No. CRG/2018/003920 under the CRG-SERB Scheme.

Author information

Authors and Affiliations

Department of Electrical Engineering, Indian Institute of Technology Jammu, Jammu and Kashmir, India
Ritujoy Biswas & Karan Nathwani

Authors

Ritujoy Biswas
View author publications
You can also search for this author in PubMed Google Scholar
Karan Nathwani
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors are the sole contributors to this work, the simulations have been carried on MATLAB, and pertinent results have been presented in this work. This work has been supervised by Dr Karan Nathwani.

Corresponding author

Correspondence to Ritujoy Biswas.

Ethics declarations

Conflict of interests

The author(s) declare that there is no conflict of interest.

Code availability

Custom code.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Biswas, R., Nathwani, K. Optimal Near-End Speech Intelligibility Improvement Using CLPSO-Based Voice Transformation in Realistic Noisy Environments. Circuits Syst Signal Process 41, 6999–7034 (2022). https://doi.org/10.1007/s00034-022-02106-3

Download citation

Received: 25 July 2021
Revised: 28 June 2022
Accepted: 30 June 2022
Published: 25 July 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s00034-022-02106-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal Near-End Speech Intelligibility Improvement Using CLPSO-Based Voice Transformation in Realistic Noisy Environments

Abstract

Access this article

Similar content being viewed by others

Conventional and contemporary approaches used in text to speech synthesis: a review

Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions

Fundamentals, present and future perspectives of speech enhancement

Availability of Data and Materials

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interests

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimal Near-End Speech Intelligibility Improvement Using CLPSO-Based Voice Transformation in Realistic Noisy Environments

Abstract

Access this article

Similar content being viewed by others

Conventional and contemporary approaches used in text to speech synthesis: a review

Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions

Fundamentals, present and future perspectives of speech enhancement

Availability of Data and Materials

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interests

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation