A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Tu, Yan-Hui; Du, Jun; Lee, Chin-Hui

doi:10.1007/s11265-017-1295-x

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Published: 04 October 2017

Volume 90, pages 963–973, (2018)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Yan-Hui Tu¹,
Jun Du¹ &
Chin-Hui Lee²

579 Accesses
6 Citations
Explore all metrics

Abstract

We propose a novel speaker-dependent (SD) multi-condition (MC) training approach to joint learning of deep neural networks (DNNs) of acoustic models and an explicit speech separation structure for recognition of multi-talker mixed speech in a single-channel setting. First, an MC acoustic modeling framework is established to train a SD-DNN model in multi-talker scenarios. Such a recognizer significantly reduces the decoding complexity and improves the recognition accuracy over those using speaker-independent DNN models with a complicated joint decoding structure assuming the speaker identities in mixed speech are known. In addition, a SD regression DNN for mapping the acoustic features of mixed speech to the speech features of a target speaker is jointly trained with the SD-DNN based acoustic models. Experimental results on Speech Separation Challenge (SSC) small-vocabulary recognition show that the proposed approach under multi-condition training achieves an average word error rate (WER) of 3.8%, yielding a relative WER reduction of 65.1% from a top performance, DNN-based pre-processing only approach we proposed earlier under clean-condition training (Tu et al. 2016). Furthermore, the proposed joint training DNN framework generates a relative WER reduction of 13.2% from state-of-the-art systems under multi-condition training. Finally, the effectiveness of the proposed approach is also verified on the Wall Street Journal (WSJ0) task with medium-vocabulary continuous speech recognition in a simulated multi-talker setting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Article 02 August 2023

Lightweight target speaker separation network based on joint training

Article Open access 06 December 2023

Effectiveness of dereverberation, feature transformation, discriminative training methods, and system combination approach for various reverberant environments

Article Open access 30 June 2015

References

Tu, Y., Du, J., Dai, L., & Lee, C. (2016). In Proc. ISCSLP.
Radfar, M.H., & Dansereau, R.M. (2007). IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2299.
Article Google Scholar
Cooke, M., Hershey, J.R., & Rennie, S.J. (2010). Computer Speech and Language, 24(1), 1.
Article Google Scholar
Kristjansson, T.T., Hershey, J.R., Olsen, P.A., Rennie, S.J., & Gopinath, R.A. (2006). In Proc. annual conference of international speech communication association. (INTERSPEECH).
Virtanen, T. (2006). In Proc. annual conference of international speech communication association. (INTERSPEECH).
Weiss, R.J., & Ellis, D.P.W. (2007). In Proc. IEEE workshop on applications of signal processing to audio and acoustics (WASPAA) (pp. 114–117).
Ghahramani, Z., & Jordan, M.I. (1997). Machine Learning, 29, 245.
Article Google Scholar
Schmidt, M.N., & Olsson, R. (2006). In Proc. annual conference of international speech communication association. (INTERSPEECH).
Jackson, E.P. (2006). In Proc. annual conference of international speech communication association. (INTERSPEECH).
Wang, D., & Brown, G.J. (2006). Journal of the Acoustical Society of America, 124(1), 13.
Google Scholar
Barker, J., Ma, N., Coy, A., & Cooke, M. (2010). Computer Speech and Language, 24(1), 94.
Article Google Scholar
Ming, J., Hazen, T.J., & Glass, J.R. (2010). Computer Speech and Language, 24(1), 67.
Article Google Scholar
Shao, Y., Srinivasan, S., Jin, Z., & Wang, D. (2010). Computer Speech and Language, 24(1), 77.
Article Google Scholar
Reynolds, D.A., & Rose, R.C. (1995). IEEE Transactions on Speech and Audio Processing, 3(1), 72.
Article Google Scholar
Hinton, G.E., & Salakhutdinov, R. (2006). Science, 313(5786), 504.
Article MathSciNet Google Scholar
Hinton, G.E., Osindero, S., & Teh, Y. (2006). Neural Computation, 18(7), 1527.
Article MathSciNet Google Scholar
Dahl, G.E., Yu, D., Deng, L., & Acero, A. (2012). In IEEE Transactions on audio, speech, and language processing.
Mohamed, A., Dahl, G.E., & Hinton, G.E. (2012). IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14.
Article Google Scholar
Hinton, G.E., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A.W., Vanhoucke, V., Nguyen, P., Sainath, T.N., & et al. (2012). IEEE Signal Processing Magazine, 29(6), 82.
Article Google Scholar
Du, J., Tu, Y., Dai, L., & Lee, C. (2016). IEEE Transactions on Audio, Speech, and Language Processing, 24(8), 1424.
Article Google Scholar
Huang, P., Kim, M., Hasegawajohnson, M., & Smaragdis, P. (2015). IEEE Transactions on Audio, Speech, and Language Processing, 23(12), 2136.
Article Google Scholar
Zohrer, M., Peharz, R., & Pernkopf, F. (2015). IEEE Transactions on Audio, Speech, and Language Processing, 23(12), 2398.
Article Google Scholar
Weng, C., Yu, D., Seltzer, M.L., & Droppo, J. (2015). IEEE Transactions on Audio, Speech, and Language Processing, 23(10), 1670.
Article Google Scholar
Mohri, M., Pereira, F., & Riley, M.P. (2002). Computer Speech and Language, 16(1), 69.
Article Google Scholar
Nadeu, C., Macho, D., & Hernando, J. (2000). Speech Communication, 34(1), 93.
Google Scholar
Paul, D.B., & Baker, J.M. (1992). In Proc. 5th DARPA speech and natural lang. workshop (pp. 357–362).
Tu, Y., Du, J., Dai, L., & Lee, C. (2015). In Proc. ICASSP (pp. 61–65).
Yu, D., Seltzer, M.L., Li, J., & Seide, F. (2013). In Proc. CoRR, Vol. 1301.
Zhang, Y., & Glass, J.R. (2009). In Proc. IEEE automat. Speech recognition and understanding workshop.(ASRU).
Wilpon, J.G., Lee, C.H., & Rabiner, L.R. (1989). In Proc. ICASSP (pp. 254–257).
Tu, Y., Du, J., Dai, L., & Lee, C. (2015). In Proc. ICSP (pp. 532–536).
Gales, M. (1998). Computer Speech and Language, 12(2), 75.
Article Google Scholar
Hu, Y., & Huo, Q. (2007). In Proc. annual conference of international speech communication association. (INTERSPEECH).
Bengio, Y. (2009). Foundat. and Trends Mach Learn, 2(1), 1.
Article Google Scholar
Ephraim, Y., & Malah, D. (1984). IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(6), 1109.
Article Google Scholar
Seide, F., Li, G., Chen, X., & Yu, D. (2011). In Proc. IEEE automat. speech recognition and understanding workshop. (ASRU).
De Boer, P., Kroese, D.P., Mannor, S., & Rubinstein, R.y. (2005). Annals of Operations Research, 134 (1), 19.
Article MathSciNet Google Scholar
Cooke, M., & Lee, T.W. (2016). http://staffwww.dcs.shef.ac.uk/people/M.Cooke/SpeechSeparationChallenge.htm.
Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). Journal of the Acoustical Society of America, 120(5), 2421.
Article Google Scholar
Xu, Y., Du, J., Dai, L., & Lee, C. (2014). IEEE Signal Processing Letters, 21(1), 65.
Article Google Scholar
Hinton, G.E. A practical guide to training restricted Boltzmann machines (University of Toronto, 2010).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants No. 61671422.

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Yan-Hui Tu & Jun Du
Georgia Institute of Technology, Atlanta, GA, USA
Chin-Hui Lee

Authors

Yan-Hui Tu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Du
View author publications
You can also search for this author in PubMed Google Scholar
Chin-Hui Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Du.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tu, YH., Du, J. & Lee, CH. A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech. J Sign Process Syst 90, 963–973 (2018). https://doi.org/10.1007/s11265-017-1295-x

Download citation

Received: 05 February 2017
Revised: 23 August 2017
Accepted: 21 September 2017
Published: 04 October 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s11265-017-1295-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Abstract

Access this article

Similar content being viewed by others

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Lightweight target speaker separation network based on joint training

Effectiveness of dereverberation, feature transformation, discriminative training methods, and system combination approach for various reverberant environments

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Abstract

Access this article

Similar content being viewed by others

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Lightweight target speaker separation network based on joint training

Effectiveness of dereverberation, feature transformation, discriminative training methods, and system combination approach for various reverberant environments

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation