A Public Chinese Dataset for Language Model Adaptation

Bai, Ye; Yi, Jiangyan; Tao, Jianhua; Wen, Zhengqi; Fan, Cunhang

doi:10.1007/s11265-019-01482-5

A Public Chinese Dataset for Language Model Adaptation

Published: 16 October 2019

Volume 92, pages 839–851, (2020)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Ye Bai^1,2,
Jiangyan Yi ORCID: orcid.org/0000-0003-2422-4618¹,
Jianhua Tao^1,2,3,
Zhengqi Wen¹ &
…
Cunhang Fan^1,2

273 Accesses
2 Citations
Explore all metrics

Abstract

A language model (LM) is an important part of a speech recognition system. The performance of an LM is affected when the domains of training data and test data are different. Language model adaptation is to compensate for this mismatch. However, there is no public dataset in Chinese for evaluating language model adaptation. In this paper, we present a public Chinese dataset called CLMAD for language model adaptation. The dataset consists of four domains: sport, stock, fashion, and finance. The differences in these four domains are evaluated. We present baselines for two commonly used adaptation techniques: interpolation for n-gram, and fine-tuning for recurrent neural network language models (RNNLMs). For n-gram interpolation, when the source domain and target domain are relatively similar, the adapted model can be improved. But interpolating LMs of very different domains does not obtain improvement. For RNNLMs, fine-tuning whole network achieves the largest improvement over only fine-tuning softmax layer or embedding layer. When the domain difference is large, the improvement of the adapted RNNLM is significant. We also provide speech recognition results on AISHELL-1 with the LMs trained on CLMAD. CLMAD can be freely downloaded at http://www.openslr.org/55/ .

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparison of Language Model Training Techniques in a Continuous Speech Recognition System for Serbian

USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition Experiments

Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling

Notes

References

Jurafsky, D. (2000). Speech & language processing. Pearson Education India.
Rosenfeld, R. (2000). Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8), 1270–1278.
Article Google Scholar
Bellegarda, J. R. (2004). Statistical language model adaptation: review and perspectives. Speech Communication, 42(1), 93–108.
Article Google Scholar
Kuhn, R., & De Mori, R. (1990). A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(6), 570–583.
Article Google Scholar
Jelinek, F., Merialdo, B., Roukos, S., & Strauss, M. (1991). A dynamic language model for speech recognition. In Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California, February 19-22, 1991.
Rao, P. S., Dharanipragada, S., & Roukos, S. (1997). MDI adaptation of language models across corpora. In Fifth European Conference on Speech Communication and Technology.
Xu, W., & Rudnicky, A. (2000). Can artificial neural networks learn language models?. In Sixth International Conference on Spoken Language Processing.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137–1155.
MATH Google Scholar
Mikolov, T., Karafiát, M., Burget, L., Černocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.
Takase, S., Suzuki, J., & Nagata, M. (2018). Direct Output Connection for a High-Rank Language Model. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 4599-4609).
Yang, Z., Dai, Z., Salakhutdinov, R., & Cohen, W. W. (2017). Breaking the softmax bottleneck: A high-rank RNN language model. arXiv preprint arXiv:1711.03953.
Chen, X., Tan, T., Liu, X., Lanchantin, P., Wan, M., Gales, M. J., & Woodland, P. C. (2015). Recurrent neural network language model adaptation for multi-genre broadcast speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association.
Deena, S., Ng, R. W., Madhyashta, P., Specia, L., & Hain, T. (2017). Semi-supervised adaptation of RNNLMs by fine-tuning with domain-specific auxiliary features. In Eighteenth Annual Conference of the International Speech Communication Association.
Li, K., Xu, H., Wang, Y., Povey, D., & Khudanpur, S. (2018). Recurrent neural network language model adaptation for conversational speech recognition. In Nighteenth Annual Conference of the International Speech Communication Association.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
Melis, G., Dyer, C., & Blunsom, P. (2017). On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589.
Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-Aware Neural Language Models. In AAAI (pp. 2741-2749).
Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., & Robinson, T. (2014). One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. In Fifteenth Annual Conference of the International Speech Communication Association.
Xu, H., Li, K., Wang, Y., Wang, J., Kang, S., Chen, X., ... & Khudanpur, S. (2018, April). Neural network language modeling with letter-based features and importance sampling. In Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on. IEEE.
Xu, H., Chen, T., Gao, D., Wang, Y., Li, K., Goel, N., ... & Khudanpur, S. (2018). A Pruned RNNLM Lattice-Rescoring Algorithm for Automatic Speech Recognition. In Nighteenth Annual Conference of the International Speech Communication Association.
Zhang, Y., Zhang, P., & Yan, Y. (2018). Improving Language Modeling with an Adversarial Critic for Automatic Speech Recognition. Proc. Interspeech, 2018, 3348–3352.
Article Google Scholar
Bell, P., Gales, M. J., Hain, T., Kilgour, J., Lanchantin, P., Liu, X., ... & Woodland, P. C. (2015). The MGB challenge: Evaluating multi-genre broadcast media recognition. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on (pp. 687-693). IEEE.
Zhang, H. P., Yu, H. K., Xiong, D. Y., & Liu, Q. (2003). HHMM-based Chinese lexical analyzer ICTCLAS. In Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17 (pp. 184-187). Association for Computational Linguistics.
Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In Seventh international conference on spoken language processing.
Kuznetsov, V., Liao, H., Mohri, M., Riley, M., & Roark, B. (2016). Learning N-Gram Language Models from Uncertain Data. In Seventeenth Annual Conference of the International Speech Communication Association.
Beneš, K., Kesiraju, S., Burget, L. (2018) i-Vectors in Language Modeling: An Efficient Way of Domain Adaptation for Feed-Forward Models. In Nighteenth Annual Conference of the International Speech Communication Association.
Ma, M., Nirschl, M., Biadsy, F., & Kumar, S. (2017). Approaches for neural-network language model adaptation. In Eighteenth Annual Conference of the International Speech Communication Association.
Gangireddy, S. R., Swietojanski, P., Bell, P., & Renals, S. (2016). Unsupervised Adaptation of Recurrent Neural Network Language Models. In Seventeenth Annual Conference of the International Speech Communication Association.
Andrés-Ferrer, J., Bodenstab, N., & Vozila, P. (2018). Efficient Language Model Adaptation with Noise Contrastive Estimation and Kullback-Leibler Regularization. In Nighteenth Annual Conference of the International Speech Communication Association.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Chung, J., et al. (2014) "Empirical evaluation of gated recurrent neural networks on sequence modeling." arXiv preprint arXiv:1412.3555.
Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560.
Article Google Scholar
Liu, X., et al. (2014) "Efficient lattice rescoring using recurrent neural network language models." 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
Emami, A., and Mangu L. (2007) "Empirical study of neural network language models for Arabic speech recognition." 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU). IEEE.
Schwenk, H. (2007). Continuous space language models. Computer Speech & Language, 21(3), 492–518.
Article Google Scholar
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016). Tensorflow: a system for large-scale machine learning. In OSDI (Vol. 16, pp. 265-283).
Bu, H., Du, J., Na, X., Wu, B., & Zheng, H. (2017). AIShell-1: An open-source Mandarin speech corpus and a speech recognition baseline. In 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA) (pp. 1-5). IEEE.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Silovsky, J. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. EPFL-CONF-192584). IEEE Signal Processing Society.
Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association.
Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., ... & Khudanpur, S. (2016). Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI. In Seventeenth Annual Conference of the International Speech Communication Association.

Download references

Acknowledgments

This work is supported by the National Key R&D Program of China (No. 2017YFB1002802). We thank NLP lab of Tsinghua University to provide THUCNews corpus, and Dr. Zhiyuan Liu to admit us to extend this dataset. We thank anonymous reviewers for their invaluable comments.

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Science, Beijing, China
Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengqi Wen & Cunhang Fan
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Ye Bai, Jianhua Tao & Cunhang Fan
CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Science, Beijing, China
Jianhua Tao

Authors

Ye Bai
View author publications
You can also search for this author in PubMed Google Scholar
Jiangyan Yi
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Tao
View author publications
You can also search for this author in PubMed Google Scholar
Zhengqi Wen
View author publications
You can also search for this author in PubMed Google Scholar
Cunhang Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiangyan Yi.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Table 18 Perplexities between every two domains of data. Each row shows the complexities of a trigram model on each test set. The part in grey shading is the same as Table 4.

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bai, Y., Yi, J., Tao, J. et al. A Public Chinese Dataset for Language Model Adaptation. J Sign Process Syst 92, 839–851 (2020). https://doi.org/10.1007/s11265-019-01482-5

Download citation

Received: 15 February 2019
Revised: 19 June 2019
Accepted: 09 September 2019
Published: 16 October 2019
Issue Date: August 2020
DOI: https://doi.org/10.1007/s11265-019-01482-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Public Chinese Dataset for Language Model Adaptation

Abstract

Access this article

Similar content being viewed by others

A Comparison of Language Model Training Techniques in a Continuous Speech Recognition System for Serbian

USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition Experiments

Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Public Chinese Dataset for Language Model Adaptation

Abstract

Access this article

Similar content being viewed by others

A Comparison of Language Model Training Techniques in a Continuous Speech Recognition System for Serbian

USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition Experiments

Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation