Abstract
The proposed work attempts to improve the near-end intelligibility of speech at very low signal-to-noise ratios (SNRs). Additionally, the prerequisite of noise statistics that existing intelligibility improvement methods require is not a limitation of the proposed approach. To this end, the shaping parameters of the voice transformation function (VTF) are optimized. This optimization of the shaping parameters of the VTF corresponds to the combined modification that includes formant shifting, nonuniform time scaling, smoothing, and energy re-distributions in comprehensive learning particle swarm optimization (CLPSO) framework. The optimal parameters of the combined modifications are obtained by jointly maximizing the short time objective intelligibility, perceptual evaluation of speech quality and signal-to-distortion ratio metrics being used as the cost function in CLPSO. The outcome at the end is an improvement in intelligibility that is significantly higher than the ones obtained by applying these methods individually, while preserving the quality. As a side result, a Gaussian process regression is also employed to estimate the shaping parameters of VTF at arbitrary SNRs—other than the ones which were used during CLPSO training.
Similar content being viewed by others
Availability of Data and Materials
The dataset used in the current work is the CHAINS (CHAracterizing INdividual Speakers) dataset [2].
References
G. Biagetti et al., Speaker identification in noisy conditions using short sequences of speech frames, in Intelligent Decision Technologies 2017 (2018), pp. 43–52. ISBN: 978-3-319-59423-1. https://doi.org/10.1007/978-3-319-59424-8_5
F. Cummins et al., The chains corpus: characterizing individual speakers, in SPECOM, vol. 6, SPC RAS. (2006), pp. 431–435
Y. Ephraim, D. Malah, Speech enhancement using a minimum—mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984). https://doi.org/10.1109/TASSP.1984.1164453
F. Farias, R. Coelho, Blind adaptive mask to improve intelligibility of non-stationary noisy speech. IEEE Signal Process. Lett. 28, 1170–1174 (2021). https://doi.org/10.1109/LSP.2021.3086405
E. Fonseca et al., Freesound datasets: a platform for the creation of open audio datasets, in Proceedings of the 18th ISMIR Conference, Suzhou, China [Canada] (2017), pp. 486–493
J.D. Griffiths, Optimum linear filter for speech transmission. J. Acoust. Soc. Am. 43(1), 81–86 (1968). https://doi.org/10.1121/1.1910768
R. Hendriks et al., Optimal near-end speech intelligibility improvement incorporating additive noise and late reverberation under an approximation of the short-time SII. Trans. Audio Speech Lang. Process. 23, 851–862 (2015). https://doi.org/10.1109/TASLP.2015.2409780
Y. Hu, P. Loizou. Subjective comparison of speech enhancement algorithms, in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 1 (2006), pp. I–I. https://doi.org/10.1109/ICASSP.2006.1659980
Y. Jiang, H. Zhou, Z. Feng, Performance analysis of ideal binary masks in speech enhancement, in 4th International Congress on Image and Signal Processing, vol. 5 (IEEE, 2011), pp. 2422–2425. https://doi.org/10.1109/CISP.2011.6100732
J. J. Liang et al. “Comprehensive learning particle swarm optimizer for global optimization of multimodal functions”. In: Transactions on Evolutionary Computation 10.3 (2006), pp. 281–295. https://doi.org/10.1109/TEVC.2005.857610
R. Martin, Spectral subtraction based on minimum statistics. Power 6(8), 1182–1185 (1994)
N. McLaughlin, J. Ming, D. Crookes, Speaker recognition in noisy conditions with limited training data, in 2011 19th European Signal Processing Conference (2011), pp. 1294–1298
J. Ming et al., Robust speaker recognition in noisy conditions. IEEE Trans. Audio Speech Lang. Process. (2007). https://doi.org/10.1109/TASL.2007.899278
K. Nathwani, Intelligibility improvement using kalman filtering & EM approach in formant shifting framework, in International Symposium on Signal Processing and Information Technology (ISSPIT) (IEEE, 2019), pp. 1–6. https://doi.org/10.1109/ISSPIT47144.2019.9001849
K. Nathwani, P. Pandit, R.M. Hegde, Group delay based methods for speaker segregation and its application in multimedia information retrieval. IEEE Trans. Multimed. 15(6), 1326–1339 (2013)
K. Nathwani et al., Formant shifting for speech intelligibility improvement in car noise environment, in International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 5375–5379. https://doi.org/10.1109/ICASSP.2016.7472704
K. Nathwani et al., Speech intelligibility improvement in car noise environment by voice transformation. Speech Commun. 91, 17–27 (2017). https://doi.org/10.1016/j.specom.2017.04.007
R.J. Niederjohn, J.H. Grotelueschen, The enhancement of speech intelligibility in high noise levels by high-pass filtering followed by rapid amplitude compression. IEEE Trans. Acoust. Speech Signal Process. 24(4), 277–282 (1976). https://doi.org/10.1109/TASSP.1976.1162824
R. Patel et al., Nonlinear excitation control of diesel generator: a command filter backstepping approach. Trans. Ind. Inform. (2020). https://doi.org/10.1109/TII.2020.3017744
L. Rabiner, R. Schafer, Theory and Applications of Digital Speech Processing (Prentice Hall Press, Hoboken, 2010)
M. Rahmati, R. Effatnejad, A. Safari, Comprehensive learning particle swarm optimization (CLPSO) for multi-objective optimal power flow. Indian J. Sci. Technol. 7(3), 262–270 (2014). https://doi.org/10.17485/ijst/2014/v7i3.7
C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning (The MIT Press, Cambridge, 2006)
A.W. Rix et al., Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2001), pp. 749–752
M. Song et al., A time-weighted method for predicting the intelligibility of speech in the presence of interfering sounds, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 5589–5593. https://doi.org/10.1109/ICASSP.2018.8462124
T. Sreenivas, P. Kirnapure, Codebook constrained Wiener filtering for speech enhancement. IEEE Trans. Speech Audio Process. 4(5), 383–389 (1996). https://doi.org/10.1109/89.536932
C. Taal, R. Hendriks, H. Richard, Speech energy redistribution for intelligibility improvement in noise based on a perceptual distortion measure. Comput. Speech Lang. 28(4), 858–872 (2014). https://doi.org/10.1016/j.csl.2013.11.003
C. Taal, J. Jensen, SII-based speech preprocessing for intelligibility improvement in noise, in Annual Conference of the International Speech Communication Association. INTERSPEECH (2013), pp. 3582–3586
C.H. Taal et al., A short-time objective intelligibility measure for time-frequency weighted noisy speech, in International Conference on Acoustics, Speech and Signal Processing (IEEE, 2010), pp. 4214–4217. https://doi.org/10.1109/ICASSP.2010.5495701
Y. Tang., Background adaptation for improved listening experience in broadcasting, in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 8008–8012
H. Valbret, E. Moulines, J.-P. Tubach, Voice transformation using PSOLA technique. Speech Commun. 11(23), 175–187 (1992)
E. Vincent et al., The second ‘chime’ speech separation and recognition challenge: datasets, tasks and baselines, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2013), pp. 126–130. https://doi.org/10.1109/ICASSP.2013.6637622.
D. Wang, On ideal binary mask as the computational goal of auditory scene analysis, in Speech Separation by Humans and Machines (Springer, 2005), pp. 181–197
B. Xia, C. Bao, Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification. Speech Commun. 60, 13–29 (2014). https://doi.org/10.1016/j.specom.2014.02.001
K. Yamamoto et al., Predicting speech intelligibility using a Gammachirp envelope distortion index based on the signal-to-distortion ratio, in INTERSPEECH (2017), pp. 2949–2953
S. Zahorian, H. Hu, A spectral/temporal method for robust fundamental frequency tracking. J. Acoust. Soc. Am. 123, 4559–4571 (2008). https://doi.org/10.1121/1.2916590
A. Zehtabian et al., A novel speech enhancement approach based on singular value decomposition and genetic algorithm, in IEEE International Conference of Soft Computing and Pattern Recognition (IEEE, 2010), pp. 430–435. https://doi.org/10.1109/SOCPAR.2010.5686627.
Funding
This work is supported through Project No. CRG/2018/003920 under the CRG-SERB Scheme.
Author information
Authors and Affiliations
Contributions
The authors are the sole contributors to this work, the simulations have been carried on MATLAB, and pertinent results have been presented in this work. This work has been supervised by Dr Karan Nathwani.
Corresponding author
Ethics declarations
Conflict of interests
The author(s) declare that there is no conflict of interest.
Code availability
Custom code.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Biswas, R., Nathwani, K. Optimal Near-End Speech Intelligibility Improvement Using CLPSO-Based Voice Transformation in Realistic Noisy Environments. Circuits Syst Signal Process 41, 6999–7034 (2022). https://doi.org/10.1007/s00034-022-02106-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-02106-3