Abstract
Speech enhancement is a key component in voice communication technology as it serves as an important pre-processing step for systems such as acoustic echo cancellation, speech separation, speech conversions, etc. A low-latency speech enhancement algorithm is desirable since long latency means delaying the entire system’s response. In STFT-based systems, reducing algorithmic latency by using smaller STFT window sizes leads to significant degradation in speech quality. By introducing a simple additional compensation window along with the original short main window in the analysis step of STFT, we preserve signal quality – comparable to that of the original high latency system while reducing the algorithmic latency from 42 ms to 5 ms. Experiments on the full-band VCD dataset and a large full-band Microsoft’s internal dataset show the effectiveness of the proposed method.
Work performed while Minh N. Bui was an research intern at Microsoft.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Allen, J.: Short term spectral analysis, synthesis, and modification by discrete Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 25(3), 235–238 (1977). https://doi.org/10.1109/TASSP.1977.1162950
Braun, S., Gamper, H., Reddy, C.K.A., Tashev, I.: Towards efficient models for real-time deep noise suppression (2021). https://doi.org/10.48550/ARXIV.2101.09249, https://arxiv.org/abs/2101.09249
Dubey, H., et al.: Deep speech enhancement challenge at ICASSP 2023. In: ICASSP (2023)
Dubey, H., et al.: ICASSP 2022 deep noise suppression challenge. In: ICASSP (2022)
Graetzer, S., et al.: Clarity-2021 challenges: machine learning challenges for advancing hearing aid processing. In: Interspeech (2021)
Li, C.Y., Vu, N.T.: Improving speech recognition on noisy speech via speech enhancement with multi-discriminators CycleGAN. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 830–836 (2021). https://api.semanticscholar.org/CorpusID:245123920
Li, Q., Gao, F., Guan, H., Ma, K.: Real-time monaural speech enhancement with short-time discrete cosine transform (2021). https://doi.org/10.48550/ARXIV.2102.04629, https://arxiv.org/abs/2102.04629
Pandey, A., Liu, C., Wang, Y., Saraf, Y.: Dual application of speech enhancement for automatic speech recognition. In: IEEE Spoken Language Technology Workshop, SLT 2021, Shenzhen, China, 19-22 January 2021, pp. 223–228. IEEE (2021). https://doi.org/10.1109/SLT48900.2021.9383624, https://doi.org/10.1109/SLT48900.2021.9383624
Rix, A.W., Beerends, J.G., Hollier, M., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 2, pp. 749–752 (2001). https://api.semanticscholar.org/CorpusID:5325454
Schröter, H., Escalante, A.N., Rosenkranz, T., Maier, A.K.: DeepFilternet: a low complexity speech enhancement framework for full-band audio based on deep filtering. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7407–7411 (2021). https://api.semanticscholar.org/CorpusID:238634774
Taal, C., Hendriks, R., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech, pp. 4214 – 4217 (2010). https://doi.org/10.1109/ICASSP.2010.5495701
Taherian, H., Eskimez, S.E., Yoshioka, T., Wang, H., Chen, Z., Huang, X.: One model to enhance them all: array geometry agnostic multi-channel personalized speech enhancement. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 271–275 (2021). https://api.semanticscholar.org/CorpusID:239049883
Valentini-Botinhao, C.: Noisy speech database for training speech enhancement algorithms and TTS models (2017)
Vihari, S., Murthy, A., Soni, P., Naik, D.: Comparison of speech enhancement algorithms. Procedia Comput. Sci. 89, 666–676 (2016). https://doi.org/10.1016/j.procs.2016.06.032
Wang, Z.Q., Wichern, G., Watanabe, S., Roux, J.L.: STFT-domain neural speech enhancement with very low algorithmic latency. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 397–410 (2022). https://api.semanticscholar.org/CorpusID:248300088
Westhausen, N.L., Meyer, B.T.: Acoustic Echo Cancellation with the Dual-Signal Transformation LSTM Network. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7138–7142 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413510
Wisdom, S., Hershey, J.R., Wilson, K.W., Thorpe, J., Chinen, M., Patton, B., Saurous, R.A.: Differentiable consistency constraints for improved deep speech enhancement. CoRR abs/1811.08521 (2018). http://arxiv.org/abs/1811.08521
Wood, S.U.N., Rouat, J.: Unsupervised low latency speech enhancement with RT-GCC-NMF. IEEE Journal of Selected Topics in Signal Processing 13(2), 332–346 (2019). https://doi.org/10.1109/jstsp.2019.2909193
Zhang, G., Yu, L., Wang, C., Wei, J.: Multi-scale temporal frequency convolutional network with axial attention for speech enhancement. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 9122–9126 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746610
Zhang, Z., Zhang, L., Zhuang, X., Qian, Y., Li, H., Wang, M.: FB-MSTCN: a full-band single-channel speech enhancement method based on multi-scale temporal convolutional network (2022). https://doi.org/10.48550/ARXIV.2203.07684, https://arxiv.org/abs/2203.07684
Zhao, S., Ma, B., Watcharasupat, K.N., Gan, W.S.: FRCRN: boosting feature representation using frequency recurrence for monaural speech enhancement (2022). https://doi.org/10.48550/ARXIV.2206.07293, https://arxiv.org/abs/2206.07293
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bui, M.N., Tran, D.N., Koishida, K., Tran, T.D., Chin, P. (2024). Improving Low-Latency Mono-Channel Speech Enhancement by Compensation Windows in STFT Analysis. In: Cherifi, H., Rocha, L.M., Cherifi, C., Donduran, M. (eds) Complex Networks & Their Applications XII. COMPLEX NETWORKS 2023. Studies in Computational Intelligence, vol 1141. Springer, Cham. https://doi.org/10.1007/978-3-031-53468-3_31
Download citation
DOI: https://doi.org/10.1007/978-3-031-53468-3_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53467-6
Online ISBN: 978-3-031-53468-3
eBook Packages: EngineeringEngineering (R0)