Skip to main content
Log in

Multi-stage music separation network with dual-branch attention and hybrid convolution

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

In this paper, we propose a lightweight multi-stage network for monaural vocal and accompaniment separation. We design a dual-branch attention (DBA) module to obtain the correlation of each position pair and that among the channels of feature maps, respectively. The square CNN (i.e. the size of the filter is k× k) shares the weights of each of the square areas in feature maps that which makes its ability of feature extraction limited. In order to address it, we propose a hybrid convolution (HC) block based on hybrid convolutional mechanism instead of square CNN to capture the dependencies along with the time dimension and the frequency dimension respectively. The ablation experiments demonstrate that the DBA module and HC block can assist in improving the separation performance. Experimental results show that our proposed network achieves outstanding performance on the MIR-1K dataset only with fewer parameters, and competitive performance compared with state-of-the-arts on DSD100 and MUSDB18 datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://sites.google.com/site/unvoicedsoundseparation/mir-1k

  2. https://github.com/faroit/dsdtools

  3. https://github.com/sigsep/sigsep-mus-db

  4. A Pytorch implementation is available at: https://github.com/YadongChen-1016/Separation

References

  • Choi, W., Kim, M., Chung, J., & et al. (2020). Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation. In Proceedings of the 21th international society for music information retrieval conference, ISMIR 2020, Montreal, Canada, October 11–16, 2020 (pp. 192–198).

  • Dang, F., Chen, H., & Zhang, P. (2021). Dpt-fsnet: dual-path transformer based full-band and sub-band fusion network for speech enhancement. arXiv:2104.13002.

  • Défossez, A., Usunier, N., Bottou, L., & et al. (2019). Demucs: deep extractor for music sources with extra unlabeled data remixed. arXiv:1909.01174.

  • Ding, X., Guo, Y., Ding, G., & et al. (2019). Acnet: strengthening the kernel skeletons for powerful CNN via asymmetric convolution blocks. In Proceedings of the 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019 (pp. 1911–1920). IEEE. https://doi.org/10.1109/ICCV.2019.00200.

  • Dziubinski, M., Dalka, P., & Kostek, B. (2005). Estimation of musical sound separation algorithm effectiveness employing neural networks. J Intell Inf Syst, 24(2-3), 133–157. https://doi.org/10.1007/s10844-005-0320-x.

    Article  Google Scholar 

  • Fitzgerald, D., Liutkus, A., & Badeau, R. (2016). PROJET - spatial audio separation using projections. In Proceedings of the 2016 IEEE international conference on acoustics, speech and signal processing, ICASSP 2016 (pp. 36–40). Shanghai: IEEE. https://doi.org/10.1109/ICASSP.2016.7471632.

  • Gong, Y., Dai, L., & Tang, J. (2021). A selection function for pitched instrument source separation. Multimedia Systems, 1–9.

  • Hao, X., Su, X., Horaud, R., & et al. (2021). Fullsubnet: a full-band and sub-band fusion model for real-time single-channel speech enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021 (pp. 6633–6637). IEEE. https://doi.org/10.1109/ICASSP39728.2021.9414177.

  • He, K., Zhang, X., Ren, S., & et al. (2016). Identity mappings in deep residual networks. In Proceedings of the Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, Lecture Notes in Computer Science, vol. 9908 (pp. 630–645). Springer. https://doi.org/10.1007/978-3-319-46493-0_38.

  • Hennequin, R., Khlif, A., Voituret, F., & et al. (2020). Spleeter: a fast and efficient music source separation tool with pre-trained models. J Open Source Softw, 5(56), 2154. https://doi.org/10.21105/joss.02154.

  • Hsu, C., & Jang, J.R. (2010). On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset. IEEE Trans Speech Audio Process, 18(2), 310–319. https://doi.org/10.1109/TASL.2009.2026503.

  • Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 (pp. 7132–7141). IEEE Computer Society. https://doi.org/10.1109/CVPR.2018.00745.

  • Hu, Y., & Liu, G. (2014). Singer identification based on computational auditory scene analysis and missing feature methods. J Intell Inf Syst, 42(3), 333–352. https://doi.org/10.1007/s10844-013-0271-6.

  • Hu, Y., & Liu, G. (2015). Separation of singing voice using nonnegative matrix partial co-factorization for singer identification. IEEE ACM Trans Audio Speech Lang Process, 23(4), 643–653. https://doi.org/10.1109/TASLP.2015.2396681.

  • Huang, G., Liu, Z., van der Maaten, L., & et al. (2017). Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 (pp. 2261–2269). IEEE Computer Society. https://doi.org/10.1109/CVPR.2017.243.

  • Huang, P., Chen, S.D., Smaragdis, P., & et al. (2012). Singing-voice separation from monaural recordings using robust principal component analysis. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012, Kyoto, Japan, March 25-30, 2012 (pp. 57–60). IEEE. https://doi.org/10.1109/ICASSP.2012.6287816.

  • Huang, P., Kim, M., Hasegawa-Johnson, M., & et al. (2014). Singing-voice separation from monaural recordings using deep recurrent neural networks. In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR 2014, Taipei, Taiwan, October 27-31, 2014 (pp. 477–482).

  • Huang, P., Kim, M., Hasegawa-Johnson, M., & et al. (2015). Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE ACM Trans Audio Speech Lang Process, 23(12), 2136–2147. https://doi.org/10.1109/TASLP.2015.2468583.

  • Huang, Z., Wang, X., Huang, L., & et al. (2019). Ccnet: criss-cross attention for semantic segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019 (pp. 603–612). IEEE. https://doi.org/10.1109/ICCV.2019.00069.

  • Jansson, A., Humphrey, E.J., Montecchio, N., & et al. (2017). Singing voice separation with deep u-net convolutional networks. In Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR 2017, Suzhou, China, October 23-27, 2017 (pp. 745–751).

  • Jeong, I., & Lee, K. (2017). Singing voice separation using RPCA with weighted l_1 -norm. In P. Tichavský, M. Babaie-Zadeh, O.J.J Michel, et al. (Eds.), Proceedings of the Latent Variable Analysis and Signal Separation - 13th International Conference, LVA/ICA 2017, Grenoble, France, February 21-23, 2017, Proceedings (pp. 553–562), https://doi.org/10.1007/978-3-319-53547-0_52.

  • Kingma, D.P., & Ba, J. (2015). Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015. Conference Track Proceedings.

  • Koteswararao, Y.V., & Rao, C.R. (2021). Multichannel speech separation using hybrid gomf and enthalpy-based deep neural networks. Multimedia Systems, 27(2), 271–286.

    Article  Google Scholar 

  • Lee, D., Kim, S., & Choi, J. (2021). Inter-channel conv-tasnet for multichannel speech enhancement. arXiv:2111.04312.

  • Lee, Y., Hwang, J., Lee, S., & et al. (2019). An energy and gpu-computation efficient backbone network for real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019 (pp. 752–760). Computer Vision Foundation / IEEE. https://doi.org/10.1109/CVPRW.2019.00103.

  • Li, T., Chen, J., Hou, H., & et al. (2021). Sams-net: a sliced attention-based neural network for music source separation. In Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021, Hong Kong, January 24-27, 2021 (pp. 1–5). IEEE. https://doi.org/10.1109/ISCSLP49672.2021.9362081.

  • Liu, H., Xie, L., Wu, J., & et al. (2020). Channel-wise subband input for better voice and accompaniment separation on high resolution music. In Proceedings of the Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020 (pp. 1241–1245). ISCA.

  • Liutkus, A., Stöter, F., Rafii, Z., & et al. (2017). The 2016 signal separation evaluation campaign. In Proceedings of the Latent Variable Analysis and Signal Separation - 13th International Conference, LVA/ICA 2017, Grenoble, France, February 21-23, 2017, Proceedings (pp. 323–332). https://doi.org/10.1007/978-3-319-53547-0_31.

  • Lluís, F., Pons, J., & Serra, X. (2019). End-to-end music source separation: Is it possible in the waveform domain? In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019 (pp. 4619–4623). ISCA.

  • Luo, Y., & Mesgarani, N. (2019). Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE ACM Trans Audio Speech Lang Process, 27(8), 1256–1266. https://doi.org/10.1109/TASLP.2019.2915167.

  • Luo, Y., Chen, Z., & Yoshioka, T. (2020). Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020 (pp. 46–50). IEEE. https://doi.org/10.1109/ICASSP40776.2020.9054266.

  • Nachmani, E., Adi, Y., & Wolf, L. (2020). Voice separation with an unknown number of multiple speakers. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, vol. 119. (pp. 7164–7175). PMLR.

  • Nugraha, A.A., Liutkus, A., & Vincent, E. (2016). Multichannel music separation with deep neural networks. In Proceedings of the 24th European Signal Processing Conference, EUSIPCO 2016, Budapest, Hungary, August 29 - September 2, 2016 (pp. 1748–1752). IEEE. https://doi.org/10.1109/EUSIPCO.2016.7760548.

  • Park, S., Kim, T., Lee, K., & et al. (2018). Music source separation using stacked hourglass networks. In Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018 (pp. 289–296).

  • Rosner, A., & Kostek, B. (2018). Automatic music genre classification based on musical instrument track separation. J Intell Inf Syst, 50(2), 363–384. https://doi.org/10.1007/s10844-017-0464-5.

  • Roux, J.L., Hershey, J.R., & Weninger, F. (2015). Deep NMF for speech separation. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015 (pp. 66–70). IEEE. https://doi.org/10.1109/ICASSP.2015.7177933.

  • Samuel, D., Ganeshan, A., & Naradowsky, J. (2020). Meta-learning extractors for music source separation. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020 (pp. 816–820). IEEE. https://doi.org/10.1109/ICASSP40776.2020.9053513.

  • Sawata, R., Uhlich, S., Takahashi, S., & et al. (2021). All for one and one for all: improving music separation by bridging networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021 (pp. 51–55). IEEE. https://doi.org/10.1109/ICASSP39728.2021.9414044.

  • Sebastian, J., & Murthy, H.A. (2016). Group delay based music source separation using deep recurrent neural networks. In Proceedings of the 2016 International Conference on Signal Processing and Communications (SPCOM). (pp. 1–5). IEEE.

  • Stoller, D., Ewert, S., & Dixon, S. (2018). Wave-u-net: a multi-scale neural network for end-to-end audio source separation. In Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018 (pp. 334–340).

  • Stöter, F., Liutkus, A., & Ito, N. (2018). The 2018 signal separation evaluation campaign. In Proceedings of the Latent Variable Analysis and Signal Separation - 14th International Conference, LVA/ICA 2018, Guildford, UK, July 2-5, 2018, Proceedings, Lecture Notes in Computer Science, vol. 10891 (pp. 293–305). Springer. https://doi.org/10.1007/978-3-319-93764-9_28.

  • Stöter, F., Uhlich, S., Liutkus, A., & et al. (2019). Open-unmix - a reference implementation for music source separation. J Open Source Softw, 4(41), 1667. https://doi.org/10.21105/joss.01667.

  • Szegedy, C., Vanhoucke, V., Ioffe, S., & et al. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 (pp. 2818–2826). IEEE Computer Society. https://doi.org/10.1109/CVPR.2016.308.

  • Takahashi, N., & Mitsufuji, Y. (2017). Multi-scale multi-band densenets for audio source separation. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2017, New Paltz, NY, USA, October 15-18, 2017 (pp. 21–25). IEEE.

  • Takahashi, N., & Mitsufuji, Y. (2020). D3net: densely connected multidilated densenet for music source separation. arXiv:201001733.

  • Takahashi, N., Goswami, N., & Mitsufuji, Y. (2018). Mmdenselstm: an efficient combination of convolutional and recurrent neural networks for audio source separation. In Proceedings of the 16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018, Tokyo, Japan, September 17-20, 2018 (pp. 106–110). IEEE. https://doi.org/10.1109/IWAENC.2018.8521383.

  • Tang, C., Luo, C., Zhao, Z., & et al. (2020). Joint time-frequency and time domain learning for speech enhancement. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020 (pp. 3816–3822). ijcai.org.

  • Uhlich, S., Giron, F., & Mitsufuji, Y. (2015). Deep neural network based instrument extraction from music. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015 (pp. 2135–2139). IEEE. https://doi.org/10.1109/ICASSP.2015.7178348.

  • Uhlich, S., Porcu, M., Giron, F., & et al. (2017). Improving music source separation based on deep neural networks through data augmentation and network blending. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017 (pp. 261–265). IEEE. https://doi.org/10.1109/ICASSP.2017.7952158.

  • Vaswani, A., Shazeer, N., Parmar, N., & et al. (2017). Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA (pp. 5998–6008).

  • Wang, X., Girshick, R.B., Gupta, A., & et al. (2018). Non-local neural networks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 (pp. 7794–7803). Computer Vision Foundation / IEEE Computer Society. https://doi.org/10.1109/CVPR.2018.00813.

  • Yang, Y. (2013). Low-rank representation of both singing voice and music accompaniment via learned dictionaries. In Proceedings of the 14th International Society for Music Information Retrieval Conference, ISMIR 2013, Curitiba, Brazil, November 4-8, 2013 (pp. 427–432).

  • Yuan, W., Wang, S., Li, X., & et al. (2019). A skip attention mechanism for monaural singing voice separation. IEEE Signal Process Lett, 26(10), 1481–1485. https://doi.org/10.1109/LSP.2019.2935867.

  • Zhu, B., Li, W., Li, R., & et al. (2013). Multi-stage non-negative matrix factorization for monaural singing voice separation. IEEE Trans Speech Audio Process, 21(10), 2096–2107. https://doi.org/10.1109/TASL.2013.2266773.

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (NSFC) (61761041, U1903213), Tianshan Innovation Team Plan Project of Xinjiang (202101642)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ying Hu.

Ethics declarations

Conflict of interest

The authors declare that they have no confict of interest.

Additional information

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Y., Hu, Y., He, L. et al. Multi-stage music separation network with dual-branch attention and hybrid convolution. J Intell Inf Syst 59, 635–656 (2022). https://doi.org/10.1007/s10844-022-00711-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-022-00711-x

Keywords

Navigation