Skip to main content
Log in

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

We propose a novel speaker-dependent (SD) multi-condition (MC) training approach to joint learning of deep neural networks (DNNs) of acoustic models and an explicit speech separation structure for recognition of multi-talker mixed speech in a single-channel setting. First, an MC acoustic modeling framework is established to train a SD-DNN model in multi-talker scenarios. Such a recognizer significantly reduces the decoding complexity and improves the recognition accuracy over those using speaker-independent DNN models with a complicated joint decoding structure assuming the speaker identities in mixed speech are known. In addition, a SD regression DNN for mapping the acoustic features of mixed speech to the speech features of a target speaker is jointly trained with the SD-DNN based acoustic models. Experimental results on Speech Separation Challenge (SSC) small-vocabulary recognition show that the proposed approach under multi-condition training achieves an average word error rate (WER) of 3.8%, yielding a relative WER reduction of 65.1% from a top performance, DNN-based pre-processing only approach we proposed earlier under clean-condition training (Tu et al. 2016). Furthermore, the proposed joint training DNN framework generates a relative WER reduction of 13.2% from state-of-the-art systems under multi-condition training. Finally, the effectiveness of the proposed approach is also verified on the Wall Street Journal (WSJ0) task with medium-vocabulary continuous speech recognition in a simulated multi-talker setting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4

Similar content being viewed by others

References

  1. Tu, Y., Du, J., Dai, L., & Lee, C. (2016). In Proc. ISCSLP.

  2. Radfar, M.H., & Dansereau, R.M. (2007). IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2299.

    Article  Google Scholar 

  3. Cooke, M., Hershey, J.R., & Rennie, S.J. (2010). Computer Speech and Language, 24(1), 1.

    Article  Google Scholar 

  4. Kristjansson, T.T., Hershey, J.R., Olsen, P.A., Rennie, S.J., & Gopinath, R.A. (2006). In Proc. annual conference of international speech communication association. (INTERSPEECH).

  5. Virtanen, T. (2006). In Proc. annual conference of international speech communication association. (INTERSPEECH).

  6. Weiss, R.J., & Ellis, D.P.W. (2007). In Proc. IEEE workshop on applications of signal processing to audio and acoustics (WASPAA) (pp. 114–117).

  7. Ghahramani, Z., & Jordan, M.I. (1997). Machine Learning, 29, 245.

    Article  Google Scholar 

  8. Schmidt, M.N., & Olsson, R. (2006). In Proc. annual conference of international speech communication association. (INTERSPEECH).

  9. Jackson, E.P. (2006). In Proc. annual conference of international speech communication association. (INTERSPEECH).

  10. Wang, D., & Brown, G.J. (2006). Journal of the Acoustical Society of America, 124(1), 13.

    Google Scholar 

  11. Barker, J., Ma, N., Coy, A., & Cooke, M. (2010). Computer Speech and Language, 24(1), 94.

    Article  Google Scholar 

  12. Ming, J., Hazen, T.J., & Glass, J.R. (2010). Computer Speech and Language, 24(1), 67.

    Article  Google Scholar 

  13. Shao, Y., Srinivasan, S., Jin, Z., & Wang, D. (2010). Computer Speech and Language, 24(1), 77.

    Article  Google Scholar 

  14. Reynolds, D.A., & Rose, R.C. (1995). IEEE Transactions on Speech and Audio Processing, 3(1), 72.

    Article  Google Scholar 

  15. Hinton, G.E., & Salakhutdinov, R. (2006). Science, 313(5786), 504.

    Article  MathSciNet  Google Scholar 

  16. Hinton, G.E., Osindero, S., & Teh, Y. (2006). Neural Computation, 18(7), 1527.

    Article  MathSciNet  Google Scholar 

  17. Dahl, G.E., Yu, D., Deng, L., & Acero, A. (2012). In IEEE Transactions on audio, speech, and language processing.

  18. Mohamed, A., Dahl, G.E., & Hinton, G.E. (2012). IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14.

    Article  Google Scholar 

  19. Hinton, G.E., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A.W., Vanhoucke, V., Nguyen, P., Sainath, T.N., & et al. (2012). IEEE Signal Processing Magazine, 29(6), 82.

    Article  Google Scholar 

  20. Du, J., Tu, Y., Dai, L., & Lee, C. (2016). IEEE Transactions on Audio, Speech, and Language Processing, 24(8), 1424.

    Article  Google Scholar 

  21. Huang, P., Kim, M., Hasegawajohnson, M., & Smaragdis, P. (2015). IEEE Transactions on Audio, Speech, and Language Processing, 23(12), 2136.

    Article  Google Scholar 

  22. Zohrer, M., Peharz, R., & Pernkopf, F. (2015). IEEE Transactions on Audio, Speech, and Language Processing, 23(12), 2398.

    Article  Google Scholar 

  23. Weng, C., Yu, D., Seltzer, M.L., & Droppo, J. (2015). IEEE Transactions on Audio, Speech, and Language Processing, 23(10), 1670.

    Article  Google Scholar 

  24. Mohri, M., Pereira, F., & Riley, M.P. (2002). Computer Speech and Language, 16(1), 69.

    Article  Google Scholar 

  25. Nadeu, C., Macho, D., & Hernando, J. (2000). Speech Communication, 34(1), 93.

    Google Scholar 

  26. Paul, D.B., & Baker, J.M. (1992). In Proc. 5th DARPA speech and natural lang. workshop (pp. 357–362).

  27. Tu, Y., Du, J., Dai, L., & Lee, C. (2015). In Proc. ICASSP (pp. 61–65).

  28. Yu, D., Seltzer, M.L., Li, J., & Seide, F. (2013). In Proc. CoRR, Vol. 1301.

  29. Zhang, Y., & Glass, J.R. (2009). In Proc. IEEE automat. Speech recognition and understanding workshop.(ASRU).

  30. Wilpon, J.G., Lee, C.H., & Rabiner, L.R. (1989). In Proc. ICASSP (pp. 254–257).

  31. Tu, Y., Du, J., Dai, L., & Lee, C. (2015). In Proc. ICSP (pp. 532–536).

  32. Gales, M. (1998). Computer Speech and Language, 12(2), 75.

    Article  Google Scholar 

  33. Hu, Y., & Huo, Q. (2007). In Proc. annual conference of international speech communication association. (INTERSPEECH).

  34. Bengio, Y. (2009). Foundat. and Trends Mach Learn, 2(1), 1.

    Article  Google Scholar 

  35. Ephraim, Y., & Malah, D. (1984). IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(6), 1109.

    Article  Google Scholar 

  36. Seide, F., Li, G., Chen, X., & Yu, D. (2011). In Proc. IEEE automat. speech recognition and understanding workshop. (ASRU).

  37. De Boer, P., Kroese, D.P., Mannor, S., & Rubinstein, R.y. (2005). Annals of Operations Research, 134 (1), 19.

    Article  MathSciNet  Google Scholar 

  38. Cooke, M., & Lee, T.W. (2016). http://staffwww.dcs.shef.ac.uk/people/M.Cooke/SpeechSeparationChallenge.htm.

  39. Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). Journal of the Acoustical Society of America, 120(5), 2421.

    Article  Google Scholar 

  40. Xu, Y., Du, J., Dai, L., & Lee, C. (2014). IEEE Signal Processing Letters, 21(1), 65.

    Article  Google Scholar 

  41. Hinton, G.E. A practical guide to training restricted Boltzmann machines (University of Toronto, 2010).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants No. 61671422.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Du.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tu, YH., Du, J. & Lee, CH. A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech. J Sign Process Syst 90, 963–973 (2018). https://doi.org/10.1007/s11265-017-1295-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-017-1295-x

Keywords

Navigation