Skip to main content
Log in

Optimal Near-End Speech Intelligibility Improvement Using CLPSO-Based Voice Transformation in Realistic Noisy Environments

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

The proposed work attempts to improve the near-end intelligibility of speech at very low signal-to-noise ratios (SNRs). Additionally, the prerequisite of noise statistics that existing intelligibility improvement methods require is not a limitation of the proposed approach. To this end, the shaping parameters of the voice transformation function (VTF) are optimized. This optimization of the shaping parameters of the VTF corresponds to the combined modification that includes formant shifting, nonuniform time scaling, smoothing, and energy re-distributions in comprehensive learning particle swarm optimization (CLPSO) framework. The optimal parameters of the combined modifications are obtained by jointly maximizing the short time objective intelligibility, perceptual evaluation of speech quality and signal-to-distortion ratio metrics being used as the cost function in CLPSO. The outcome at the end is an improvement in intelligibility that is significantly higher than the ones obtained by applying these methods individually, while preserving the quality. As a side result, a Gaussian process regression is also employed to estimate the shaping parameters of VTF at arbitrary SNRs—other than the ones which were used during CLPSO training.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Availability of Data and Materials

The dataset used in the current work is the CHAINS (CHAracterizing INdividual Speakers) dataset [2].

Notes

  1. https://chains.ucd.ie/index.php.

  2. https://tinyurl.com/yrk4fwuy.

References

  1. G. Biagetti et al., Speaker identification in noisy conditions using short sequences of speech frames, in Intelligent Decision Technologies 2017 (2018), pp. 43–52. ISBN: 978-3-319-59423-1. https://doi.org/10.1007/978-3-319-59424-8_5

  2. F. Cummins et al., The chains corpus: characterizing individual speakers, in SPECOM, vol. 6, SPC RAS. (2006), pp. 431–435

  3. Y. Ephraim, D. Malah, Speech enhancement using a minimum—mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984). https://doi.org/10.1109/TASSP.1984.1164453

    Article  Google Scholar 

  4. F. Farias, R. Coelho, Blind adaptive mask to improve intelligibility of non-stationary noisy speech. IEEE Signal Process. Lett. 28, 1170–1174 (2021). https://doi.org/10.1109/LSP.2021.3086405

    Article  Google Scholar 

  5. E. Fonseca et al., Freesound datasets: a platform for the creation of open audio datasets, in Proceedings of the 18th ISMIR Conference, Suzhou, China [Canada] (2017), pp. 486–493

  6. J.D. Griffiths, Optimum linear filter for speech transmission. J. Acoust. Soc. Am. 43(1), 81–86 (1968). https://doi.org/10.1121/1.1910768

    Article  Google Scholar 

  7. R. Hendriks et al., Optimal near-end speech intelligibility improvement incorporating additive noise and late reverberation under an approximation of the short-time SII. Trans. Audio Speech Lang. Process. 23, 851–862 (2015). https://doi.org/10.1109/TASLP.2015.2409780

    Article  Google Scholar 

  8. Y. Hu, P. Loizou. Subjective comparison of speech enhancement algorithms, in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 1 (2006), pp. I–I. https://doi.org/10.1109/ICASSP.2006.1659980

  9. Y. Jiang, H. Zhou, Z. Feng, Performance analysis of ideal binary masks in speech enhancement, in 4th International Congress on Image and Signal Processing, vol. 5 (IEEE, 2011), pp. 2422–2425. https://doi.org/10.1109/CISP.2011.6100732

  10. J. J. Liang et al. “Comprehensive learning particle swarm optimizer for global optimization of multimodal functions”. In: Transactions on Evolutionary Computation 10.3 (2006), pp. 281–295. https://doi.org/10.1109/TEVC.2005.857610

  11. R. Martin, Spectral subtraction based on minimum statistics. Power 6(8), 1182–1185 (1994)

    Google Scholar 

  12. N. McLaughlin, J. Ming, D. Crookes, Speaker recognition in noisy conditions with limited training data, in 2011 19th European Signal Processing Conference (2011), pp. 1294–1298

  13. J. Ming et al., Robust speaker recognition in noisy conditions. IEEE Trans. Audio Speech Lang. Process. (2007). https://doi.org/10.1109/TASL.2007.899278

    Article  Google Scholar 

  14. K. Nathwani, Intelligibility improvement using kalman filtering & EM approach in formant shifting framework, in International Symposium on Signal Processing and Information Technology (ISSPIT) (IEEE, 2019), pp. 1–6. https://doi.org/10.1109/ISSPIT47144.2019.9001849

  15. K. Nathwani, P. Pandit, R.M. Hegde, Group delay based methods for speaker segregation and its application in multimedia information retrieval. IEEE Trans. Multimed. 15(6), 1326–1339 (2013)

    Article  Google Scholar 

  16. K. Nathwani et al., Formant shifting for speech intelligibility improvement in car noise environment, in International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 5375–5379. https://doi.org/10.1109/ICASSP.2016.7472704

  17. K. Nathwani et al., Speech intelligibility improvement in car noise environment by voice transformation. Speech Commun. 91, 17–27 (2017). https://doi.org/10.1016/j.specom.2017.04.007

    Article  Google Scholar 

  18. R.J. Niederjohn, J.H. Grotelueschen, The enhancement of speech intelligibility in high noise levels by high-pass filtering followed by rapid amplitude compression. IEEE Trans. Acoust. Speech Signal Process. 24(4), 277–282 (1976). https://doi.org/10.1109/TASSP.1976.1162824

    Article  Google Scholar 

  19. R. Patel et al., Nonlinear excitation control of diesel generator: a command filter backstepping approach. Trans. Ind. Inform. (2020). https://doi.org/10.1109/TII.2020.3017744

    Article  Google Scholar 

  20. L. Rabiner, R. Schafer, Theory and Applications of Digital Speech Processing (Prentice Hall Press, Hoboken, 2010)

    Google Scholar 

  21. M. Rahmati, R. Effatnejad, A. Safari, Comprehensive learning particle swarm optimization (CLPSO) for multi-objective optimal power flow. Indian J. Sci. Technol. 7(3), 262–270 (2014). https://doi.org/10.17485/ijst/2014/v7i3.7

    Article  Google Scholar 

  22. C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning (The MIT Press, Cambridge, 2006)

    MATH  Google Scholar 

  23. A.W. Rix et al., Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2001), pp. 749–752

  24. M. Song et al., A time-weighted method for predicting the intelligibility of speech in the presence of interfering sounds, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 5589–5593. https://doi.org/10.1109/ICASSP.2018.8462124

  25. T. Sreenivas, P. Kirnapure, Codebook constrained Wiener filtering for speech enhancement. IEEE Trans. Speech Audio Process. 4(5), 383–389 (1996). https://doi.org/10.1109/89.536932

    Article  Google Scholar 

  26. C. Taal, R. Hendriks, H. Richard, Speech energy redistribution for intelligibility improvement in noise based on a perceptual distortion measure. Comput. Speech Lang. 28(4), 858–872 (2014). https://doi.org/10.1016/j.csl.2013.11.003

    Article  Google Scholar 

  27. C. Taal, J. Jensen, SII-based speech preprocessing for intelligibility improvement in noise, in Annual Conference of the International Speech Communication Association. INTERSPEECH (2013), pp. 3582–3586

  28. C.H. Taal et al., A short-time objective intelligibility measure for time-frequency weighted noisy speech, in International Conference on Acoustics, Speech and Signal Processing (IEEE, 2010), pp. 4214–4217. https://doi.org/10.1109/ICASSP.2010.5495701

  29. Y. Tang., Background adaptation for improved listening experience in broadcasting, in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 8008–8012

  30. H. Valbret, E. Moulines, J.-P. Tubach, Voice transformation using PSOLA technique. Speech Commun. 11(23), 175–187 (1992)

    Article  Google Scholar 

  31. E. Vincent et al., The second ‘chime’ speech separation and recognition challenge: datasets, tasks and baselines, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2013), pp. 126–130. https://doi.org/10.1109/ICASSP.2013.6637622.

  32. D. Wang, On ideal binary mask as the computational goal of auditory scene analysis, in Speech Separation by Humans and Machines (Springer, 2005), pp. 181–197

  33. B. Xia, C. Bao, Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification. Speech Commun. 60, 13–29 (2014). https://doi.org/10.1016/j.specom.2014.02.001

    Article  Google Scholar 

  34. K. Yamamoto et al., Predicting speech intelligibility using a Gammachirp envelope distortion index based on the signal-to-distortion ratio, in INTERSPEECH (2017), pp. 2949–2953

  35. S. Zahorian, H. Hu, A spectral/temporal method for robust fundamental frequency tracking. J. Acoust. Soc. Am. 123, 4559–4571 (2008). https://doi.org/10.1121/1.2916590

    Article  Google Scholar 

  36. A. Zehtabian et al., A novel speech enhancement approach based on singular value decomposition and genetic algorithm, in IEEE International Conference of Soft Computing and Pattern Recognition (IEEE, 2010), pp. 430–435. https://doi.org/10.1109/SOCPAR.2010.5686627.

Download references

Funding

This work is supported through Project No. CRG/2018/003920 under the CRG-SERB Scheme.

Author information

Authors and Affiliations

Authors

Contributions

The authors are the sole contributors to this work, the simulations have been carried on MATLAB, and pertinent results have been presented in this work. This work has been supervised by Dr Karan Nathwani.

Corresponding author

Correspondence to Ritujoy Biswas.

Ethics declarations

Conflict of interests

The author(s) declare that there is no conflict of interest.

Code availability

Custom code.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Biswas, R., Nathwani, K. Optimal Near-End Speech Intelligibility Improvement Using CLPSO-Based Voice Transformation in Realistic Noisy Environments. Circuits Syst Signal Process 41, 6999–7034 (2022). https://doi.org/10.1007/s00034-022-02106-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-02106-3

Keywords

Navigation