A Restriction Training Recipe for Speech Separation on Sparsely Mixed Speech

Dang, Shaoxiang; Matsumoto, Tetsuya; Kudo, Hiroaki; Takeuchi, Yoshinori

doi:10.1007/978-3-030-92310-5_85

Shaoxiang Dang¹⁰,
Tetsuya Matsumoto¹⁰,
Hiroaki Kudo¹⁰ &
…
Yoshinori Takeuchi¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1517))

Included in the following conference series:

International Conference on Neural Information Processing

1780 Accesses

Abstract

Techniques of speech separation have changed rapidly in the last few years. The traditional recurrent neural networks (RNNs) have been replaced by any other architecture like convolutional neural networks (CNNs) steadily. Although these models have improved the performance greatly in speed and accuracy, it is still inevitable to sacrifice some long-term dependency. As a result, the separated signals are vulnerable to be wrong assigned. This situation could be even common when the mixed speech is sparse, like the communication. In this paper, a two-stage training recipe with a restriction term based on scale-invariant signal-to-noise ratio (SISNR) is put forward to prevent wrong assignment problem on sparsely mixed speech. The experiment is conducted on the mixture of Japanese Newspaper Article Sentences (JNAS). According to the experiments, the proposed approach can work efficiently on sparse data (overlapping rate around 50%), and performances are improved consequently. In order to test the application of speech separation in actual situations, such as meeting transcription, the separation results are also evaluated by speech recognition. The results show that the character error rate is reduced by 10% compared to the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)
Article Google Scholar
Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: Proceedings of ICASSP, pp. 1562–1566 (2014)
Google Scholar
Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: Proceedings of ICASSP, pp. 31–35 (2016)
Google Scholar
Chen, Z., Luo, Y., Mesgarani, N.: Deep attractor network for single-microphone speaker separation. In: Proceedings of ICASSP, pp. 246–250 (2017)
Google Scholar
Chen, J., Wang, D.: Long short-term memory for speaker generalization in supervised speech separation. J. Acoust. Soc. Am. 141(6), 4705–4714 (2017)
Article MathSciNet Google Scholar
Kolbæk, M., Yu, D., Tan, Z.H., Jensen, J.: Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(10), 1901–1913 (2017)
Article Google Scholar
Luo, Y., Mesgarani, N.: Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Article Google Scholar
Yang, G.P., Tuan, C.I., Lee, H.Y., Lee, L.S.: Improved speech separation with time-and-frequency cross-domain joint embedding and clustering. arXiv preprint arXiv:1904.07845 (2019)
Luo, Y., Chen, Z., Yoshioka, T.: Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In: Proceedings of ICASSP, pp. 46–50 (2020)
Google Scholar
Pandey, A., Wang, D.: TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain. In: Proceedings of ICASSP, pp. 6875–6879 (2019)
Google Scholar
Le Roux, J., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR-half-baked or well done? In: Proceedings of ICASSP, pp. 626–630 (2019)
Google Scholar
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Article Google Scholar
Luo, Y., Chen, Z., Mesgarani, N.: Speaker-independent speech separation with deep attractor network. IEEE/ACM Trans. Audio Speech Lang. Process. 26(4), 787–796 (2018)
Article Google Scholar
Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: Proceedings of ICASSP, pp. 241–245 (2017)
Google Scholar
Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)
Article Google Scholar
Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979 (2020)

Download references

Acknowledgments

We used “ASJ Japanese Newspaper Article Sentences Read Speech Corpus” provided by Speech Resources Consortium, National Institute of Informatics.

Author information

Authors and Affiliations

Graduate School of Informatics, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan
Shaoxiang Dang, Tetsuya Matsumoto & Hiroaki Kudo
School of Informatics, Daido University, 10-3, Takiharu-cho, Minani-ku, Nagoya, 457-8530, Japan
Yoshinori Takeuchi

Authors

Shaoxiang Dang
View author publications
You can also search for this author in PubMed Google Scholar
Tetsuya Matsumoto
View author publications
You can also search for this author in PubMed Google Scholar
Hiroaki Kudo
View author publications
You can also search for this author in PubMed Google Scholar
Yoshinori Takeuchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shaoxiang Dang .

Editor information

Editors and Affiliations

Sampoerna University, Jakarta, Indonesia
Teddy Mantoro
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee
Sampoerna University, Jakarta, Indonesia
Media Anugerah Ayu
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Universitas Indonesia, Depok, Indonesia
Achmad Nizar Hidayanto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dang, S., Matsumoto, T., Kudo, H., Takeuchi, Y. (2021). A Restriction Training Recipe for Speech Separation on Sparsely Mixed Speech. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Communications in Computer and Information Science, vol 1517. Springer, Cham. https://doi.org/10.1007/978-3-030-92310-5_85

Download citation

DOI: https://doi.org/10.1007/978-3-030-92310-5_85
Published: 02 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92309-9
Online ISBN: 978-3-030-92310-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Restriction Training Recipe for Speech Separation on Sparsely Mixed Speech