Improving Low-Latency Mono-Channel Speech Enhancement by Compensation Windows in STFT Analysis

Bui, Minh N.; Tran, Dung N.; Koishida, Kazuhito; Tran, Trac D.; Chin, Peter

doi:10.1007/978-3-031-53468-3_31

Minh N. Bui⁶,
Dung N. Tran⁷,
Kazuhito Koishida⁷,
Trac D. Tran⁶ &
…
Peter Chin⁸

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1141))

Included in the following conference series:

International Conference on Complex Networks and Their Applications

964 Accesses

Abstract

Speech enhancement is a key component in voice communication technology as it serves as an important pre-processing step for systems such as acoustic echo cancellation, speech separation, speech conversions, etc. A low-latency speech enhancement algorithm is desirable since long latency means delaying the entire system’s response. In STFT-based systems, reducing algorithmic latency by using smaller STFT window sizes leads to significant degradation in speech quality. By introducing a simple additional compensation window along with the original short main window in the analysis step of STFT, we preserve signal quality – comparable to that of the original high latency system while reducing the algorithmic latency from 42 ms to 5 ms. Experiments on the full-band VCD dataset and a large full-band Microsoft’s internal dataset show the effectiveness of the proposed method.

Work performed while Minh N. Bui was an research intern at Microsoft.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Allen, J.: Short term spectral analysis, synthesis, and modification by discrete Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 25(3), 235–238 (1977). https://doi.org/10.1109/TASSP.1977.1162950
Article Google Scholar
Braun, S., Gamper, H., Reddy, C.K.A., Tashev, I.: Towards efficient models for real-time deep noise suppression (2021). https://doi.org/10.48550/ARXIV.2101.09249, https://arxiv.org/abs/2101.09249
Dubey, H., et al.: Deep speech enhancement challenge at ICASSP 2023. In: ICASSP (2023)
Google Scholar
Dubey, H., et al.: ICASSP 2022 deep noise suppression challenge. In: ICASSP (2022)
Google Scholar
Graetzer, S., et al.: Clarity-2021 challenges: machine learning challenges for advancing hearing aid processing. In: Interspeech (2021)
Google Scholar
Li, C.Y., Vu, N.T.: Improving speech recognition on noisy speech via speech enhancement with multi-discriminators CycleGAN. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 830–836 (2021). https://api.semanticscholar.org/CorpusID:245123920
Li, Q., Gao, F., Guan, H., Ma, K.: Real-time monaural speech enhancement with short-time discrete cosine transform (2021). https://doi.org/10.48550/ARXIV.2102.04629, https://arxiv.org/abs/2102.04629
Pandey, A., Liu, C., Wang, Y., Saraf, Y.: Dual application of speech enhancement for automatic speech recognition. In: IEEE Spoken Language Technology Workshop, SLT 2021, Shenzhen, China, 19-22 January 2021, pp. 223–228. IEEE (2021). https://doi.org/10.1109/SLT48900.2021.9383624, https://doi.org/10.1109/SLT48900.2021.9383624
Rix, A.W., Beerends, J.G., Hollier, M., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 2, pp. 749–752 (2001). https://api.semanticscholar.org/CorpusID:5325454
Schröter, H., Escalante, A.N., Rosenkranz, T., Maier, A.K.: DeepFilternet: a low complexity speech enhancement framework for full-band audio based on deep filtering. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7407–7411 (2021). https://api.semanticscholar.org/CorpusID:238634774
Taal, C., Hendriks, R., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech, pp. 4214 – 4217 (2010). https://doi.org/10.1109/ICASSP.2010.5495701
Taherian, H., Eskimez, S.E., Yoshioka, T., Wang, H., Chen, Z., Huang, X.: One model to enhance them all: array geometry agnostic multi-channel personalized speech enhancement. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 271–275 (2021). https://api.semanticscholar.org/CorpusID:239049883
Valentini-Botinhao, C.: Noisy speech database for training speech enhancement algorithms and TTS models (2017)
Google Scholar
Vihari, S., Murthy, A., Soni, P., Naik, D.: Comparison of speech enhancement algorithms. Procedia Comput. Sci. 89, 666–676 (2016). https://doi.org/10.1016/j.procs.2016.06.032
Article Google Scholar
Wang, Z.Q., Wichern, G., Watanabe, S., Roux, J.L.: STFT-domain neural speech enhancement with very low algorithmic latency. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 397–410 (2022). https://api.semanticscholar.org/CorpusID:248300088
Westhausen, N.L., Meyer, B.T.: Acoustic Echo Cancellation with the Dual-Signal Transformation LSTM Network. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7138–7142 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413510
Wisdom, S., Hershey, J.R., Wilson, K.W., Thorpe, J., Chinen, M., Patton, B., Saurous, R.A.: Differentiable consistency constraints for improved deep speech enhancement. CoRR abs/1811.08521 (2018). http://arxiv.org/abs/1811.08521
Wood, S.U.N., Rouat, J.: Unsupervised low latency speech enhancement with RT-GCC-NMF. IEEE Journal of Selected Topics in Signal Processing 13(2), 332–346 (2019). https://doi.org/10.1109/jstsp.2019.2909193
Zhang, G., Yu, L., Wang, C., Wei, J.: Multi-scale temporal frequency convolutional network with axial attention for speech enhancement. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 9122–9126 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746610
Zhang, Z., Zhang, L., Zhuang, X., Qian, Y., Li, H., Wang, M.: FB-MSTCN: a full-band single-channel speech enhancement method based on multi-scale temporal convolutional network (2022). https://doi.org/10.48550/ARXIV.2203.07684, https://arxiv.org/abs/2203.07684
Zhao, S., Ma, B., Watcharasupat, K.N., Gan, W.S.: FRCRN: boosting feature representation using frequency recurrence for monaural speech enhancement (2022). https://doi.org/10.48550/ARXIV.2206.07293, https://arxiv.org/abs/2206.07293

Download references

Author information

Authors and Affiliations

Johns Hopkins University, Baltimore, MD, 21211, USA
Minh N. Bui & Trac D. Tran
Microsoft Corporation, Redmond, WA, 98052, USA
Dung N. Tran & Kazuhito Koishida
Dartmouth College, Hanover, NH, 03755, USA
Peter Chin

Authors

Minh N. Bui
View author publications
You can also search for this author in PubMed Google Scholar
Dung N. Tran
View author publications
You can also search for this author in PubMed Google Scholar
Kazuhito Koishida
View author publications
You can also search for this author in PubMed Google Scholar
Trac D. Tran
View author publications
You can also search for this author in PubMed Google Scholar
Peter Chin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Minh N. Bui .

Editor information

Editors and Affiliations

University of Burgundy, Dijon Cedex, France
Hocine Cherifi
Thomas J. Watson College of Engineering and Applied Sciences, Binghamton University, Binghamton, NY, USA
Luis M. Rocha
IUT Lumière - Université Lyon 2, University of Lyon, Bron, France
Chantal Cherifi
Department of Economics, Yildiz Technical University, Istanbul, Türkiye
Murat Donduran

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bui, M.N., Tran, D.N., Koishida, K., Tran, T.D., Chin, P. (2024). Improving Low-Latency Mono-Channel Speech Enhancement by Compensation Windows in STFT Analysis. In: Cherifi, H., Rocha, L.M., Cherifi, C., Donduran, M. (eds) Complex Networks & Their Applications XII. COMPLEX NETWORKS 2023. Studies in Computational Intelligence, vol 1141. Springer, Cham. https://doi.org/10.1007/978-3-031-53468-3_31

Download citation

DOI: https://doi.org/10.1007/978-3-031-53468-3_31
Published: 20 February 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53467-6
Online ISBN: 978-3-031-53468-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Improving Low-Latency Mono-Channel Speech Enhancement by Compensation Windows in STFT Analysis