A Distinct Synthesizer Convolutional TasNet for Singing Voice Separation

Tian, Congzhou; Yang, Deshun; Chen, Xiaoou

doi:10.1007/978-3-030-37731-1_4

Congzhou Tian¹⁶,
Deshun Yang¹⁶ &
Xiaoou Chen¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11961))

Included in the following conference series:

International Conference on Multimedia Modeling

2803 Accesses

Abstract

Deep learning methods have already been used for music source separation for several years and proved to be very effective. Most of them choose Fourier Transform as the front-end process to get a spectrogram representation, which has its drawback though. Perhaps the spectrogram representation is just suitable for human to understand sounds, but not the best representation used by powerful neural networks for singing voice separation. TasNet (Time Audio Separation Network) has been proposed recently to solve monaural speech separation in the time domain by modeling each source as a weighted sum of a common set of basis signals. Then the fully-convolutional TasNet raised recently achieves great improvements in speech separation. In this paper, we first show convolutional TasNet can also be used in singing voice separation and bring about improvements on the dataset DSD100 in the singing voice separation task. Then based on the fact that in singing voice separation, the difference between the singing voice and the accompaniment is far more remarkable than the difference between the voices of two different people in speech separation, we employ separate sets of basis signals and separate encoder outputs for the singing voice and the accompaniment respectively, which makes a further improved model, distinct synthesizer convolutional TasNet (ds-cTasNet).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hsu, C., Jang, J.R.: On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset. IEEE Trans. Audio Speech Lang. Process. 18(2), 310–319 (2010). https://doi.org/10.1109/TASL.2009.2026503
Article Google Scholar
Huang, P., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23(12), 2136–2147 (2015). https://doi.org/10.1109/TASLP.2015.2468583
Article Google Scholar
Isik, Y., Roux, J.L., Chen, Z., Watanabe, S., Hershey, J.R.: Single-channel multi-speaker separation using deep clustering. In: Interspeech 2016, 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, 8–12 September 2016, pp. 545–549 (2016). https://doi.org/10.21437/Interspeech.2016-1176
Jansson, A., Humphrey, E.J., Montecchio, N., Bittner, R.M., Kumar, A., Weyde, T.: Singing voice separation with deep U-Net convolutional networks. In: Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR 2017, Suzhou, China, 23–27 October 2017, pp. 745–751 (2017). https://ismir2017.smcnus.org/wp-content/uploads/2017/10/171_Paper.pdf
Kolbæk, M., Yu, D., Tan, Z., Jensen, J.: Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(10), 1901–1913 (2017). https://doi.org/10.1109/TASLP.2017.2726762
Article Google Scholar
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1003–1012, July 2017. https://doi.org/10.1109/CVPR.2017.113
Luo, Y., Chen, Z., Hershey, J.R., Le Roux, J., Mesgarani, N.: Deep clustering and conventional networks for music separation: stronger together. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 61–65, March 2017. https://doi.org/10.1109/ICASSP.2017.7952118
Luo, Y., Mesgarani, N.: Tasnet: Time-domain audio separation network for real-time, single-channel speech separation. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700, April 2018. https://doi.org/10.1109/ICASSP.2018.8462116
Luo, Y., Mesgarani, N.: Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019). https://doi.org/10.1109/TASLP.2019.2915167
Article Google Scholar
Nugraha, A.A., Liutkus, A., Vincent, E.: Multichannel audio source separation with deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 24(9), 1652–1664 (2016). https://doi.org/10.1109/TASLP.2016.2580946
Article Google Scholar
Simpson, A.J.R., Roma, G., Plumbley, M.D.: Deep Karaoke: extracting vocals from musical mixtures using a convolutional deep neural network. In: Vincent, E., Yeredor, A., Koldovský, Z., Tichavský, P. (eds.) LVA/ICA 2015. LNCS, vol. 9237, pp. 429–436. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22482-4_50
Chapter Google Scholar
Venkataramani, S., Casebeer, J., Smaragdis, P.: End-to-end source separation with adaptive front-ends. In: 2018 52nd Asilomar Conference on Signals, Systems, and Computers, pp. 684–688, October 2018. https://doi.org/10.1109/ACSSC.2018.8645535
Vincent, E., Gribonval, R., Fevotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006). https://doi.org/10.1109/TSA.2005.858005
Article Google Scholar

Download references

Author information

Authors and Affiliations

Peking University, Beijing, China
Congzhou Tian, Deshun Yang & Xiaoou Chen

Authors

Congzhou Tian
View author publications
You can also search for this author in PubMed Google Scholar
Deshun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoou Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Congzhou Tian .

Editor information

Editors and Affiliations

Korea Advanced Institute of Science and, Daejeon, Korea (Republic of)
Yong Man Ro
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
Korea Advanced Institute of Science and Technology, Daejeon, Korea (Republic of)
Junmo Kim
National Cheng Kung University, Tainan City, Taiwan
Wei-Ta Chu
Tsinghua University, Beijing, China
Peng Cui
Korea Advanced Institute of Science and Technology, Daejeon, Korea (Republic of)
Jung-Woo Choi
National Tsing Hua University, Hsinchu, Taiwan
Min-Chun Hu
Ghent University, Ghent, Belgium
Wesley De Neve

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tian, C., Yang, D., Chen, X. (2020). A Distinct Synthesizer Convolutional TasNet for Singing Voice Separation. In: Ro, Y., et al. MultiMedia Modeling. MMM 2020. Lecture Notes in Computer Science(), vol 11961. Springer, Cham. https://doi.org/10.1007/978-3-030-37731-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-37731-1_4
Published: 24 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37730-4
Online ISBN: 978-3-030-37731-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics