Skip to main content

Advertisement

Log in

Multi-stage strength estimation network with cross attention for single channel speech enhancement

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Speech enhancement is a fundamental task for acoustic signal processing, which is still an unsolved challenge. Recently, with the rapid development of deep learning, data-driven approaches based on a variety of different modules in machine learning have made great progress in speech enhancement. Each of these basic modules have unique advantages as well as certain limitations. Inspired by the blocks’ unique preferences and the distinguishing feature of speech signals, we proposed a multi-stage strength estimation network with cross-attention for single-channel speech enhancement in this paper. The proposed method consists of a feature-wised fusion block using the attention mechanism and the strength estimation block using FFT and sequential representations (FTB). We first describe the speech enhancement problem mathematically, after which we compared the proposed method with some well-known speech enhancement methods on the 50-h DNS and LibriFSD50K dataset, showing that the proposed method can pay full attention to both time and frequency domains and achieve satisfying results. Further ablation studies are also carried out to prove the effectiveness of each section of the proposed method, and the results show the effectiveness of the proposed method. By the exhibit of the proposed method, we show the effectiveness of improving the performance of speech enhancement models by utilizing modules with different properties, which pointing out a promising direction for the future development.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability statements

All datasets used in our research are public access from internet.

Notes

  1. https://github.com/microsoft/DNS-Challenge/tree/master

  2. https://github.com/etzinis/fedenhance

  3. https://github.com/microsoft/DNS-Challenge/tree/master/DNSMOS

References

  1. Priyanka, S.S., Kumar, T.K.: Multi-channel speech enhancement using early and late fusion convolutional neural networks. Signal Image Video Process. 17(4), 973–979 (2023)

    Article  Google Scholar 

  2. Gay, S.L., Benesty, J.: Acoustic Signal Processing for Telecommunication vol. 551. Springer, Berlin (2012)

  3. Shen, Y.-L., Huang, C.-Y., Wang, S.-S., Tsao, Y., Wang, H.-M., Chi, T.-S.: Reinforcement learning based speech enhancement for robust speech recognition. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6750–6754 (2019). IEEE

  4. Reddy, C.K.A., Shankar, N., Bhat, G.S., Charan, R., Panahi, I.: An individualized super-gaussian single microphone speech enhancement for hearing aid users with smartphone as an assistive device. IEEE Signal Process. Lett. 24(11), 1601–1605 (2017)

    Article  Google Scholar 

  5. Bai, Z., Zhang, X.-L.: Speaker recognition based on deep learning: an overview. Neural Netw. 140, 65–99 (2021)

    Article  Google Scholar 

  6. Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979). https://doi.org/10.1109/TASSP.1979.1163209

    Article  Google Scholar 

  7. Ephraim, Y.: Statistical-model-based speech enhancement systems. Proc. IEEE 80(10), 1526–1555 (1992). https://doi.org/10.1109/5.168664

    Article  Google Scholar 

  8. Zhao, Y., Xu, B., Giri, R., Zhang, T.: Perceptually guided speech enhancement using deep neural networks. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5074–5078 (2018). https://doi.org/10.1109/ICASSP.2018.8462593

  9. Feng, X., Zhang, Y., Glass, J.: Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1759–1763 (2014). https://doi.org/10.1109/ICASSP.2014.6853900

  10. Gao, T., Du, J., Dai, L.-R., Lee, C.-H.: Densely connected progressive learning for lstm-based speech enhancement. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5054–5058 (2018). https://doi.org/10.1109/ICASSP.2018.8461861

  11. Park, S.R., Lee, J.: A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132 (2016)

  12. Westhausen, N.L., Meyer, B.T.: Dual-signal transformation lstm network for real-time noise suppression. arXiv preprint arXiv:2005.07551 (2020)

  13. Choi, H.-S., Kim, J.-H., Huh, J., Kim, A., Ha, J.-W., Lee, K.: Phase-aware speech enhancement with deep complex u-net. In: International Conference on Learning Representations (2018)

  14. Li, A., Liu, W., Luo, X., Yu, G., Zheng, C., Li, X.: A simultaneous denoising and dereverberation framework with target decoupling. arXiv preprint arXiv:2106.12743 (2021)

  15. Tzinis, E., Casebeer, J., Wang, Z., Smaragdis, P.: Separate but together: unsupervised federated learning for speech enhancement from non-iid data. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2021). IEEE

  16. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

  17. Reddy, C.K., Gopal, V., Cutler, R.: Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In: ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6493–6497 (2021). IEEE

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 52275296), National Key Research and Development Program of China (No. 2022YFC2402700), the China Scholarships Council (No. 202306420024), the Graduate Innovation Program of China University of Mining and Technology (No. 2024WLJCRCZL112), the Postgraduate Research & Practice Innovation Program of Jiangsu Province (No. KYCX24_2729), and the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Author information

Authors and Affiliations

Authors

Contributions

ZZ: Data Curation, Methodology, Conceptualization, Writing—Original Draft. YD: Software, Writing—Original Draft, Investigation. WC: Visualization, Project administration. YC: Writing—Original Draft, Investigation. WG: Project administration, Supervision. HL: Supervision, Funding acquisition. Both authors co-wrote the manuscript and approved the final version for submission.

Corresponding author

Correspondence to Houguang Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Ding, Y., Chen, W. et al. Multi-stage strength estimation network with cross attention for single channel speech enhancement. SIViP 18, 6937–6948 (2024). https://doi.org/10.1007/s11760-024-03364-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-024-03364-1

Keywords