Skip to main content
Log in

MAF-Net: multidimensional attention fusion network for multichannel speech separation

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Recent studies have shown that multichannel narrow-band speech separation achieves remarkable performance, while most successful deep learning-based studies directly work on the full-band spectrum of speech. Motivated by these two different but complementary trends, this paper proposes a multidimensional attention fusion network (MAF-Net) to automatically exploit and fuse narrow-band and full-band speech separation information, aiming at enhancing the performance of multichannel speech separation in a reverberation environment. Specifically, it extracts effective narrow-band and full-band information from three dimensions, temporal, spatial and channel, and dynamically integrates them through a feature fusion mechanism. First, the narrow-band feature extractor (NBFE) collects temporal information frame by frame to model context dependency, and it takes multichannel mixed signals of one frequency as input and generates a narrow-band feature map. Then, the multidimensional attention fusion module (MAFM) is proposed to adaptively fuse narrow-band and full-band information from spatial and channel dimensions. Finally, the arrangement of attention modules within the MAFM is designed to maximize the utilization of spatial and channel attention features, including the parallel MAFM (P-MAFM) and sequential MAFM (S-MAFM). Experimental results demonstrate that our proposed method outperforms other advanced methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The experimental data in this paper is based on the publicly available dataset WSJ0-2mix, and we do not provide data availability statement.

References

  1. Haykin, S., Chen, Z.: The cocktail party problem. Neural Comput. 17, 1875–1902 (2005)

    Article  Google Scholar 

  2. Cherry, E.C.: Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 25, 975–979 (1953)

    Article  Google Scholar 

  3. Yilmaz, O., Rickard, S.: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process. 52, 1830–1847 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  4. Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: Discriminative embeddings for segmentation and separation. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35 (2016)

  5. Yu, D., Kolbæk, M., Tan, Z.-H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245 (2017)

  6. Kolbæk, M., Yu, D., Tan, Z.-H., Jensen, J.: Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25, 1901–1913 (2017)

    Article  Google Scholar 

  7. Liu, Y., Wang, D.: Divide and conquer: A deep CASA approach to talker-independent monaural speaker separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27, 2092–2102 (2019)

    Article  Google Scholar 

  8. Erdogan, H., Hershey, J.R., Watanabe, S., Le Roux, J.: Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 708–712 (2015)

  9. Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24, 483–492 (2015)

    Article  Google Scholar 

  10. Luo, Y., Mesgarani, N.: Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700 (2018)

  11. Luo, Y., Mesgarani, N.: Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM trans. Audio Speech Lang. Process. 27, 1256–1266 (2019)

    Article  Google Scholar 

  12. Stoller, D., Ewert, S., Dixon, S.: Wave-u-net: A multi-scale neural network for end-to-end audio source separation. Preprint at https://arXiv.org/arXiv:1806.03185 (2018)

  13. Lam, M.W., Wang, J., Su, D., Yu, D.: Sandglasset: A light multi-granularity self-attentive network for time-domain speech separation. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5759–5763 (2021)

  14. Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., Zhong, J.: Attention is all you need in speech separation. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25 (2021)

  15. Gannot, S., Vincent, E., Markovich-Golan, S., Ozerov, A.: A consolidated perspective on multimicrophone speech enhancement and source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 25, 692–730 (2017)

    Article  Google Scholar 

  16. Wang, D., Chen, J.: Supervised speech separation based on deep learning: An overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26, 1702–1726 (2018)

    Article  Google Scholar 

  17. Erdogan, H., Hershey, J.R., Watanabe, S., Mandel, M.I., Le Roux, J.: Improved mvdr beamforming using single-channel mask prediction networks. In: Interspeech, pp. 1981–1985 (2016)

  18. Heymann, J., Drude, L., Chinaev, A., Haeb-Umbach, R.: BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 444–451 (2015)

  19. Yoshioka, T., Ito, N., Delcroix, M., Ogawa, A., Kinoshita, K., Fujimoto, M., Yu, C., Fabian, W.J., Espi, M., Higuchi, T.: The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 436–443 (2015)

  20. Barker, J., Marxer, R., Vincent, E., Watanabe, S.: The third ‘CHiME’ speech separation and recognition challenge: Analysis and outcomes. Comput. Speech Lang. 46, 605–626 (2017)

    Article  Google Scholar 

  21. Wang, Z.-Q., Le Roux, J., Hershey, J.R.: Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2018)

  22. Wang, Z.-Q., Wang, D.: Combining spectral and spatial features for deep learning based blind speaker separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27, 457–468 (2018)

    Article  Google Scholar 

  23. Yoshioka, T., Erdogan, H., Chen, Z., Alleva, F.: Multi-microphone neural speech separation for far-field multi-talker speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5739–5743 (2018)

  24. Wang, Z.-Q., Wang, P., Wang, D.: Multi-microphone complex spectral mapping for utterance-wise and continuous speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2001–2014 (2021)

    Article  Google Scholar 

  25. Luo, Y., Han, C., Mesgarani, N., Ceolini, E., Liu, S.-C.: FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 260–267 (2019)

  26. Luo, Y., Chen, Z., Mesgarani, N., Yoshioka, T.: End-to-end microphone permutation and number invariant multi-channel speech separation. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6394–6398 (2020)

  27. Wang, Z.-Q., Zhang, X., Wang, D.: Robust speaker localization guided by deep learning-based time-frequency masking. IEEE/ACM Trans. Audio Speech Lang. Process. 27, 178–188 (2018)

    Article  Google Scholar 

  28. Chakrabarty, S., Wang, D., Habets, E.A.: Time-frequency masking based online speech enhancement with multi-channel data using convolutional neural networks. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 476–480 (2018)

  29. Winter, S., Kellermann, W., Sawada, H., Makino, S.: MAP-based underdetermined blind source separation of convolutive mixtures by hierarchical clustering and-norm minimization. EURASIP J. Adv. Signal Process. 2007, 1–12 (2006)

    Article  MATH  Google Scholar 

  30. Boeddeker, C., Heitkaemper, J., Schmalenstroeer, J., Drude, L., Heymann, J., Haeb-Umbach, R.: Front-end processing for the CHiME-5 dinner party scenario. in CHiME5 Workshop, Hyderabad, India (2018)

  31. Mandel, M.I., Weiss, R.J., Ellis, D.P.: Model-based expectation-maximization source separation and localization. IEEE Trans. Audio Speech Lang. Process. 18, 382–394 (2009)

    Article  Google Scholar 

  32. Gannot, S., Burshtein, D., Weinstein, E.: Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. Signal Process. 49, 1614–1626 (2001)

    Article  Google Scholar 

  33. Makino, S., Lee, T.-W., Sawada, H.: Blind Speech Separation. Springer (2007)

    Book  Google Scholar 

  34. Quan, C., Li, X.: Multi-channel narrow-band deep speech separation with full-band permutation invariant training. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 541–545 (2022)

  35. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  36. Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: ECA-Net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11534–11542 (2020)

  37. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

  38. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  39. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp. 630-645 (2016)

  40. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)

  41. Zagoruyko, S., Komodakis, N.: Wide residual networks. Preprint at https://arXiv.org/arXiv:1605.07146 (2016)

  42. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  43. Zhang, Z., Xu, Y., Yu, M., Zhang, S.-X., Chen, L., Williamson, D.S., Yu, D.: Multi-channel multi-frame ADL-MVDR for target speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3526–3540 (2021)

    Article  Google Scholar 

  44. Fu, Y., Wu, J., Hu, Y., Xing, M., Xie, L.: Desnet: A multi-channel network for simultaneous speech dereverberation, enhancement and separation. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 857–864 (2021)

  45. Xiang, X., Zhang, X., Chen, H.: A nested u-net with self-attention and dense connectivity for monaural speech enhancement. IEEE Signal Process. Lett. 29, 105–109 (2021)

    Article  Google Scholar 

  46. Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 510–519 (2019)

  47. Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang, Z., Sun, Y., He, T., Mueller, J., Manmatha, R.: Resnest: Split-attention networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2736–2746 (2022)

  48. Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3560–3569 (2021)

  49. Le Roux, J., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR–half-baked or well done? In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630 (2019)

  50. Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14, 1462–1469 (2006)

    Article  Google Scholar 

  51. Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), pp. 749–752 (2001)

  52. Doclo, S., Gannot, S., Moonen, M., Spriet, A., Haykin, S., Liu, K.R.: Acoustic beamforming for hearing aid applications. In: Handbook on Array Processing and Sensor Networks, pp. 269–302. Wiley, Hoboken (2010)

    Chapter  Google Scholar 

  53. Chen, H., Yi, Y., Feng, D., Zhang, P.: Beam-Guided TasNet: An iterative speech separation framework with multi-channel output. Preprint at https://arXiv.org/arXiv:2102.02998 (2021)

  54. Gu, R., Wu, J., Zhang, S.-X., Chen, L., Xu, Y., Yu, M., Su, D., Zou, Y., Yu, D.: End-to-end multi-channel speech separation. Preprint at https://arXiv.org/arXiv:1905.06286 (2019)

Download references

Funding

The funding was supported by National Natural Science Foundation of China, 61571279.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, HL and QH; methodology, HL and QH; software, HL; validation, HL and QH; formal analysis, QH; investigation, HL; resources, QH; data curation, HL; writing—original draft preparation, HL; writing—review and editing, HL and QH; visualization, HL; supervision, QH; project administration, QH; funding acquisition, QH. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Qinghua Huang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Additional information

Communicated by J. Gao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, H., Huang, Q. MAF-Net: multidimensional attention fusion network for multichannel speech separation. Multimedia Systems 29, 3703–3720 (2023). https://doi.org/10.1007/s00530-023-01155-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-023-01155-1

Keywords

Navigation