FTDCN: Full Two-Dimensional Convolution Network for Speech Enhancement in Time-Frequency Domain

Liu, Maoqing; Liu, Hongqing; Zhou, Yi; Gan, Lu

doi:10.1007/978-3-031-34790-0_8

Maoqing Liu¹⁹,
Hongqing Liu¹⁹,
Yi Zhou¹⁹ &
…
Lu Gan²⁰

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 500))

Included in the following conference series:

International Conference on Communications and Networking in China

282 Accesses

Abstract

The dual-path structure achieves superior performance in monaural speech enhancement (SE), demonstrating the importance of modeling the long-range spectral patterns of a single frame. In this paper, two novel causal temporal convolutional network (TCN) modules, inter-frame complex-valued two-dimensional TCN (Inter-CTTCN) and intra-frame complex-valued two-dimensional TCN (Intra-CTTCN), are proposed to capture the long-range spectral dependence within a single frame and the long-term dependence between frames, respectively. These two lightweight TCN components, which are composed entirely of two-dimensional convolutions, maintain a high dimension feature representation that facilitates the distinction between speech and noise. We join the Inter-CTTCN and Intra-CTTCN with a gated complex-valued convolutional encoder and decoder structure to design a full two-dimensional convolutional network (FTDCN) for SE in the time-frequency (T-F) domain. Using noisy speech as input, the proposed model was experimentally evaluated on the datasets of Interspeech 2020 Deep Noise Suppression Challenge (DNS Challenge 2020). The NB-PESQ of our proposed model exceeds the DNS Challenge 2020 first-placed model by 0.19 and our model requires only 0.8 M parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tan, K., Wang, D.L.: A convolutional recurrent neural network for real-time speech enhancement. Interspeech 2018, 3229–3233 (2018)
Google Scholar
Luo, Y., Mesgarani, N.: Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700. IEEE (2018)
Google Scholar
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
Kishore, V., Tiwari, N., Paramasivam, P.: Improved speech enhancement using tcn with multiple encoder-decoder layers. In: Interspeech, pp. 4531–4535 (2020)
Google Scholar
Luo, Y., Mesgarani, N.: Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Proc. 27(8), 1256–1266 (2019)
Article Google Scholar
Pandey. A., Wang, D.: Tcnn: Temporal convolutional neural network for real-time speech enhancement in the time domain. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6875–6879. IEEE (2019)
Google Scholar
Yin, D., Luo, C., Xiong, Z., Zeng, W.: Phasen: A phase-and-harmonics-aware speech enhancement network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 9458–9465 (2020)
Google Scholar
Le, X., Chen, H., Chen, K., Lu, J.: Dpcrn: Dual-path convolution recurrent network for single channel speech enhancement. arXiv preprint arXiv:2107.05429 (2021)
Lv, S., Hu, Y., Zhang, S., Xie, L.: Dccrn+: Channel-wise subband dccrn with snr estimation for speech enhancement. arXiv preprint arXiv:2106.08672 (2021)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. arXiv preprint arXiv:1505.04597 (2015)
Zhao, S., Nguyen, T.H., Ma, B.: Monaural speech enhancement with complex convolutional block attention module and joint time frequency losses. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6648–6652. IEEE (2021)
Google Scholar
Choi, H.-S., Kim, J.-H., Huh, J., Kim, A., Ha, J.-W., Lee, K.: Phase-aware speech enhancement with deep complex u-net. In: International Conference on Learning Representations (2018)
Google Scholar
Hu, Y., et al.: Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264 (2020)
Van Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: International Conference on Machine Learning, pp. 1747–1756. PMLR (2016)
Google Scholar
Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: International Conference on Machine Learning, pp. 933–941. PMLR (2017)
Google Scholar
Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.: Conditional image generation with pixelcnn decoders. In: Advances in Neural Information Processing Systems 29 (2016)
Google Scholar
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Proc. 24(3), 483–492 (2015)
Article Google Scholar
Reddy, C.K.A., et al.: The interspeech 2020 deep noise suppression challenge: Datasets, subjective speech quality and testing framework. arXiv preprint arXiv:2001.08662 (2020)
Xia, Y., Braun, S., Reddy, C.K.A., Dubey, H., Cutler, R., Tashev, I.: Weighted speech distortion losses for neural-network-based real-time speech enhancement. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 871–875. IEEE (2020)
Google Scholar
Westhausen, N.L., Meyer, B.T.: Dual-signal transformation lstm network for real-time noise suppression. arXiv preprint arXiv:2005.07551 (2020)

Download references

Author information

Authors and Affiliations

School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing, China
Maoqing Liu, Hongqing Liu & Yi Zhou
College of Engineering, Design and Physical Science, Brunel University, London, UB8 3PH, UK
Lu Gan

Authors

Maoqing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hongqing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Lu Gan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maoqing Liu .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Feifei Gao
Fudan University, Shanghai, China
Jun Wu
Chongqing University, Chongqing, China
Yun Li
Shanghai University, Shanghai, China
Honghao Gao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, M., Liu, H., Zhou, Y., Gan, L. (2023). FTDCN: Full Two-Dimensional Convolution Network for Speech Enhancement in Time-Frequency Domain. In: Gao, F., Wu, J., Li, Y., Gao, H. (eds) Communications and Networking. ChinaCom 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 500. Springer, Cham. https://doi.org/10.1007/978-3-031-34790-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-34790-0_8
Published: 10 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34789-4
Online ISBN: 978-3-031-34790-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

FTDCN: Full Two-Dimensional Convolution Network for Speech Enhancement in Time-Frequency Domain