Abstract
With the development of deep learning technology, speech recognition based on deep neural networks has been continuously improved in recent years. However, the performance of minority language speech recognition still cannot compare with that on majority language whose data can be collected and transcribed easily relatively. Therefore, we attempt to work out an effective data sharing method cross different languages to improve the performance of minority language speech recognition. We proposed a speech attribute detector model under an end-to-end framework, and then we utilized the detector to extract features for minority language speech recognition. To the best of our knowledge, this is the first end-to-end model extracting distinctive features. We implemented our experiments on Tibetan and Mandarin. The results showed the significant improvements were achieved on Tibetan phoneme recognition via utilizing the Mandarin data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The distinctive features are a set of distinguishing attributes that are summarized by linguists to differentiate phonemes, and reflect the different states of speech organs.
References
Cohen, P., Dharanipragada, S., Gros, J., Monkowski, M.: Towards a universal speech recognizer for multiple languages. In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pp. 591–598 (1997)
Burget, L., Schwarz, P., Agarwal, M., Akyazi, P., Feng, K., Ghoshal, A., et al.: Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models. In: IEEE International Conference on Acoustics Speech and Signal Processing, vol. 130, pp. 4334–4337 (2010)
Huang, J.T., Li, J., Yu, D., Deng, L., Gong, Y.: Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7304–7308 (2013)
Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M., Devin, M., et al.: Multilingual acoustic models using distributed deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8619–8623 (2013)
Grezl, F., Karafiat, M., Vesely, K.: Adaptation of multilingual stacked bottle-neck neural network structure for new language. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7654–7658 (2014)
Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: INTERSPEECH 2011, Conference of the International Speech Communication Association, Florence, Italy, August, pp. 237–240 (2011)
Schultz, T., Waibel, A.: Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Commun. 35(1–2), 31–51 (2001)
International Phonetic Association: Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press, Cambridge (1999)
Niesler, T.: Language-dependent state clustering for multilingual acoustic modelling. Speech Commun. 49(6), 453–463 (2007)
Lin, H., Deng, L., Yu, D., Gong, Y., Acero, A., Lee, C.H.: A study on multilingual acoustic modeling for large vocabulary ASR. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4333–4336 (2009)
Lee, C.H., Siniscalchi, S.M.: An information-extraction approach to speech processing: analysis, detection, verification, and recognition. Proc. IEEE 101(5), 1089–1115 (2013)
Siniscalchi, S.M., Svendsen, T., Lee, C.H.: A bottom-up stepwise knowledge-integration approach to large vocabulary continuous speech recognition using weighted finite state machines, pp. 901–904 (2011)
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: International Conference on Machine Learning, pp. 1764–1772 (2014)
Maas, A., Xie, Z., Dan, J., Ng, A.: Lexicon-free conversational speech recognition with neural networks. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 345–354 (2015)
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4960–4964 (2016)
Graves, A., Gomez, F.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International Conference on Machine Learning, vol. 2006, pp. 369–376 (2006)
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (2002)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
PHOIBLE Online. http://phoible.org
Acknowledgments
The work is supported in part by the National Natural Science Foundation of China (No. 11590773, No. U1713217), the Key Program of National Social Science Foundation of China (No. 12 & ZD119).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Fu, T., Gao, S., Wu, X. (2018). Improving Minority Language Speech Recognition Based on Distinctive Features. In: Peng, Y., Yu, K., Lu, J., Jiang, X. (eds) Intelligence Science and Big Data Engineering. IScIDE 2018. Lecture Notes in Computer Science(), vol 11266. Springer, Cham. https://doi.org/10.1007/978-3-030-02698-1_36
Download citation
DOI: https://doi.org/10.1007/978-3-030-02698-1_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02697-4
Online ISBN: 978-3-030-02698-1
eBook Packages: Computer ScienceComputer Science (R0)