Abstract
Techniques of speech separation have changed rapidly in the last few years. The traditional recurrent neural networks (RNNs) have been replaced by any other architecture like convolutional neural networks (CNNs) steadily. Although these models have improved the performance greatly in speed and accuracy, it is still inevitable to sacrifice some long-term dependency. As a result, the separated signals are vulnerable to be wrong assigned. This situation could be even common when the mixed speech is sparse, like the communication. In this paper, a two-stage training recipe with a restriction term based on scale-invariant signal-to-noise ratio (SISNR) is put forward to prevent wrong assignment problem on sparsely mixed speech. The experiment is conducted on the mixture of Japanese Newspaper Article Sentences (JNAS). According to the experiments, the proposed approach can work efficiently on sparse data (overlapping rate around 50%), and performances are improved consequently. In order to test the application of speech separation in actual situations, such as meeting transcription, the separation results are also evaluated by speech recognition. The results show that the character error rate is reduced by 10% compared to the baseline.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)
Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: Proceedings of ICASSP, pp. 1562–1566 (2014)
Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: Proceedings of ICASSP, pp. 31–35 (2016)
Chen, Z., Luo, Y., Mesgarani, N.: Deep attractor network for single-microphone speaker separation. In: Proceedings of ICASSP, pp. 246–250 (2017)
Chen, J., Wang, D.: Long short-term memory for speaker generalization in supervised speech separation. J. Acoust. Soc. Am. 141(6), 4705–4714 (2017)
Kolbæk, M., Yu, D., Tan, Z.H., Jensen, J.: Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(10), 1901–1913 (2017)
Luo, Y., Mesgarani, N.: Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Yang, G.P., Tuan, C.I., Lee, H.Y., Lee, L.S.: Improved speech separation with time-and-frequency cross-domain joint embedding and clustering. arXiv preprint arXiv:1904.07845 (2019)
Luo, Y., Chen, Z., Yoshioka, T.: Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In: Proceedings of ICASSP, pp. 46–50 (2020)
Pandey, A., Wang, D.: TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain. In: Proceedings of ICASSP, pp. 6875–6879 (2019)
Le Roux, J., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR-half-baked or well done? In: Proceedings of ICASSP, pp. 626–630 (2019)
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Luo, Y., Chen, Z., Mesgarani, N.: Speaker-independent speech separation with deep attractor network. IEEE/ACM Trans. Audio Speech Lang. Process. 26(4), 787–796 (2018)
Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: Proceedings of ICASSP, pp. 241–245 (2017)
Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)
Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979 (2020)
Acknowledgments
We used “ASJ Japanese Newspaper Article Sentences Read Speech Corpus” provided by Speech Resources Consortium, National Institute of Informatics.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Dang, S., Matsumoto, T., Kudo, H., Takeuchi, Y. (2021). A Restriction Training Recipe for Speech Separation on Sparsely Mixed Speech. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Communications in Computer and Information Science, vol 1517. Springer, Cham. https://doi.org/10.1007/978-3-030-92310-5_85
Download citation
DOI: https://doi.org/10.1007/978-3-030-92310-5_85
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92309-9
Online ISBN: 978-3-030-92310-5
eBook Packages: Computer ScienceComputer Science (R0)