Skip to main content

A Restriction Training Recipe for Speech Separation on Sparsely Mixed Speech

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2021)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1517))

Included in the following conference series:

  • 1780 Accesses

Abstract

Techniques of speech separation have changed rapidly in the last few years. The traditional recurrent neural networks (RNNs) have been replaced by any other architecture like convolutional neural networks (CNNs) steadily. Although these models have improved the performance greatly in speed and accuracy, it is still inevitable to sacrifice some long-term dependency. As a result, the separated signals are vulnerable to be wrong assigned. This situation could be even common when the mixed speech is sparse, like the communication. In this paper, a two-stage training recipe with a restriction term based on scale-invariant signal-to-noise ratio (SISNR) is put forward to prevent wrong assignment problem on sparsely mixed speech. The experiment is conducted on the mixture of Japanese Newspaper Article Sentences (JNAS). According to the experiments, the proposed approach can work efficiently on sparse data (overlapping rate around 50%), and performances are improved consequently. In order to test the application of speech separation in actual situations, such as meeting transcription, the separation results are also evaluated by speech recognition. The results show that the character error rate is reduced by 10% compared to the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)

    Article  Google Scholar 

  2. Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: Proceedings of ICASSP, pp. 1562–1566 (2014)

    Google Scholar 

  3. Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: Proceedings of ICASSP, pp. 31–35 (2016)

    Google Scholar 

  4. Chen, Z., Luo, Y., Mesgarani, N.: Deep attractor network for single-microphone speaker separation. In: Proceedings of ICASSP, pp. 246–250 (2017)

    Google Scholar 

  5. Chen, J., Wang, D.: Long short-term memory for speaker generalization in supervised speech separation. J. Acoust. Soc. Am. 141(6), 4705–4714 (2017)

    Article  MathSciNet  Google Scholar 

  6. Kolbæk, M., Yu, D., Tan, Z.H., Jensen, J.: Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(10), 1901–1913 (2017)

    Article  Google Scholar 

  7. Luo, Y., Mesgarani, N.: Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)

    Article  Google Scholar 

  8. Yang, G.P., Tuan, C.I., Lee, H.Y., Lee, L.S.: Improved speech separation with time-and-frequency cross-domain joint embedding and clustering. arXiv preprint arXiv:1904.07845 (2019)

  9. Luo, Y., Chen, Z., Yoshioka, T.: Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In: Proceedings of ICASSP, pp. 46–50 (2020)

    Google Scholar 

  10. Pandey, A., Wang, D.: TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain. In: Proceedings of ICASSP, pp. 6875–6879 (2019)

    Google Scholar 

  11. Le Roux, J., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR-half-baked or well done? In: Proceedings of ICASSP, pp. 626–630 (2019)

    Google Scholar 

  12. Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)

    Article  Google Scholar 

  13. Luo, Y., Chen, Z., Mesgarani, N.: Speaker-independent speech separation with deep attractor network. IEEE/ACM Trans. Audio Speech Lang. Process. 26(4), 787–796 (2018)

    Article  Google Scholar 

  14. Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: Proceedings of ICASSP, pp. 241–245 (2017)

    Google Scholar 

  15. Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)

    Article  Google Scholar 

  16. Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979 (2020)

Download references

Acknowledgments

We used “ASJ Japanese Newspaper Article Sentences Read Speech Corpus” provided by Speech Resources Consortium, National Institute of Informatics.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shaoxiang Dang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dang, S., Matsumoto, T., Kudo, H., Takeuchi, Y. (2021). A Restriction Training Recipe for Speech Separation on Sparsely Mixed Speech. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Communications in Computer and Information Science, vol 1517. Springer, Cham. https://doi.org/10.1007/978-3-030-92310-5_85

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-92310-5_85

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-92309-9

  • Online ISBN: 978-3-030-92310-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics