Abstract
In this chapter we explore sequence-discriminative training techniques for neural-network–hidden-Markov-model (NN-HMM) hybrid speech recognition systems. We first review different sequence-discriminative training criteria for NN-HMM hybrid systems, including maximum mutual information (MMI), boosted, minimum phone error, and state-level minimum Bayes risk (sMBR). We then focus on the sMBR criterion, and demonstrate a few heuristics, such as denominator language model order and frame-smoothing, that may improve the recognition performance. We further propose a two-forward-pass procedure to speed up sequence-discriminative training when memory is the main constraint. Experiments were conducted on the AMI meeting corpus.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Agarwal, A., Akchurin, E., Basoglu, C., Chen, G., Cyphers, S., Droppo, J., Eversole, A., Guenter, B., Hillebrand, M., Hoens, R., Huang, X., Huang, Z., Ivanov, V., Kamenev, A., Kranen, P., Kuchaiev, O., Manousek, W., May, A., Mitra, B., Nano, O., Navarro, G., Orlov, A., Parthasarathi, H., Peng, B., Padmilac, M., Reznichenko, A., Seide, F., Seltzer, M.L., Slaney, M., Stolcke, A., Wang, Y., Wang, H., Yao, K., Yu, D., Zhang, Y., Zweig, G.: An introduction to computational networks and the computational network toolkit. Technical Report MSR-TR-2014-112, Microsoft Research (2014)
Bahl, L., Brown, P.F., De Souza, P.V., Mercer, R.L.: Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 86, pp. 49–52 (1986)
Bridle, J., Dodd, L.: An Alphanet approach to optimising input transformations for continuous speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 277–280. IEEE (1991)
Carletta, J.: Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus. Lang. Resour. Eval. 41(2), 181–190 (2007)
Chen, K., Huo, Q.: Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT approach. IEEE/ACM Trans. Audio Speech Lang. Process. 24(7), 1185–1193 (2016)
Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)
Fiscus, J.G., Ajot, J., Radde, N., Laprun, C.: Multiple dimension Levenshtein edit distance calculations for evaluating automatic speech recognition systems during simultaneous speech. In: Proceedings of the International Conference on Language Resources and Evaluation (LERC) (2006)
Gibson, M., Hain, T.: Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition. In: Proceedings of INTERSPEECH (2006)
Gopalakrishnan, P., Kanevsky, D., Nadas, A., Nahamoo, D., Picheny, M.: Decoder selection based on cross-entropies. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 20–23. IEEE, New York (1988)
Graves, A., Jaitly, N., Mohamed, A.R.: Hybrid speech recognition with deep bidirectional LSTM. In: Proceedings of Automatic Speech Recognition and Understanding (ASRU), pp. 273–278. IEEE, New York (2013)
Heigold, G., McDermott, E., Vanhoucke, V., Senior, A., Bacchiani, M.: Asynchronous stochastic optimization for sequence training of deep neural networks. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5587–5591. IEEE, New York (2014)
Kaiser, J., Horvat, B., Kacic, Z.: A novel loss function for the overall risk criterion based discriminative training of HMM models. In: Proceedings of the Sixth International Conference on Spoken Language Processing (2000)
Kapadia, S., Valtchev, V., Young, S.: MMI training for continuous phoneme recognition on the TIMIT database. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp. 491–494. IEEE, New York (1993)
Kingsbury, B.: Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3761–3764. IEEE, New York (2009)
Kingsbury, B., Sainath, T.N., Soltau, H.: Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization. In: Proceedings of INTERSPEECH (2012)
Povey, D.: Discriminative training for large vocabulary speech recognition. Ph.D. thesis, University of Cambridge (2005)
Povey, D., Kingsbury, B.: Evaluation of proposed modifications to MPE for large scale discriminative training. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. IV-321. IEEE, New York (2007)
Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I-105. IEEE, New York (2002)
Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4057–4060. IEEE, New York (2008)
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: Proceedings of Automatic Speech Recognition and Understanding (ASRU), EPFL-CONF-192584. IEEE Signal Processing Society, Piscataway (2011)
Sak, H., Senior, A., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition (2014). arXiv preprint arXiv:1402.1128
Sak, H., Vinyals, O., Heigold, G., Senior, A., McDermott, E., Monga, R., Mao, M.: Sequence discriminative distributed training of long short-term memory recurrent neural networks. In: Proceedings of INTERSPEECH (2014)
Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of Automatic Speech Recognition and Understanding (ASRU), pp. 24–29. IEEE, New York (2011)
Su, H., Li, G., Yu, D., Seide, F.: Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6664–6668. IEEE, New York (2013)
Valtchev, V., Odell, J., Woodland, P.C., Young, S.J.: MMIE training of large vocabulary recognition systems. Speech Commun. 22(4), 303–314 (1997)
Veselỳ, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: Proceedings of INTERSPEECH, pp. 2345–2349 (2013)
Wang, G., Sim, K.C.: Sequential classification criteria for NNs in automatic speech recognition. In: Proceedings of INTERSPEECH (2011)
Williams, R.J., Peng, J.: An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Comput. 2(4), 490–501 (1990)
Yu, D., Deng, L.: Automatic Speech Recognition, pp. 137–153. Springer, London (2015)
Zhang, Y., Chen, G., Yu, D., Yao, K., Khudanpur, S., Glass, J.: Highway long short-term memory RNNs for distant speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, New York (2016)
Zilly, J.G., Srivastava, R.K., Koutník, J., Schmidhuber, J.: Recurrent highway networks (2016). arXiv preprint arXiv:1607.03474
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Chen, G., Zhang, Y., Yu, D. (2017). Sequence-Discriminative Training of Neural Networks. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-64680-0_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)