Skip to main content
Log in

A Public Chinese Dataset for Language Model Adaptation

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

A language model (LM) is an important part of a speech recognition system. The performance of an LM is affected when the domains of training data and test data are different. Language model adaptation is to compensate for this mismatch. However, there is no public dataset in Chinese for evaluating language model adaptation. In this paper, we present a public Chinese dataset called CLMAD for language model adaptation. The dataset consists of four domains: sport, stock, fashion, and finance. The differences in these four domains are evaluated. We present baselines for two commonly used adaptation techniques: interpolation for n-gram, and fine-tuning for recurrent neural network language models (RNNLMs). For n-gram interpolation, when the source domain and target domain are relatively similar, the adapted model can be improved. But interpolating LMs of very different domains does not obtain improvement. For RNNLMs, fine-tuning whole network achieves the largest improvement over only fine-tuning softmax layer or embedding layer. When the domain difference is large, the improvement of the adapted RNNLM is significant. We also provide speech recognition results on AISHELL-1 with the LMs trained on CLMAD. CLMAD can be freely downloaded at http://www.openslr.org/55/ .

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2

Similar content being viewed by others

Notes

  1. http://www.openslr.org/55/

  2. http://thuctc.thunlp.org/

  3. http://www.sina.com

References

  1. Jurafsky, D. (2000). Speech & language processing. Pearson Education India.

  2. Rosenfeld, R. (2000). Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8), 1270–1278.

    Article  Google Scholar 

  3. Bellegarda, J. R. (2004). Statistical language model adaptation: review and perspectives. Speech Communication, 42(1), 93–108.

    Article  Google Scholar 

  4. Kuhn, R., & De Mori, R. (1990). A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(6), 570–583.

    Article  Google Scholar 

  5. Jelinek, F., Merialdo, B., Roukos, S., & Strauss, M. (1991). A dynamic language model for speech recognition. In Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California, February 19-22, 1991.

  6. Rao, P. S., Dharanipragada, S., & Roukos, S. (1997). MDI adaptation of language models across corpora. In Fifth European Conference on Speech Communication and Technology.

  7. Xu, W., & Rudnicky, A. (2000). Can artificial neural networks learn language models?. In Sixth International Conference on Spoken Language Processing.

  8. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137–1155.

    MATH  Google Scholar 

  9. Mikolov, T., Karafiát, M., Burget, L., Černocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.

  10. Takase, S., Suzuki, J., & Nagata, M. (2018). Direct Output Connection for a High-Rank Language Model. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 4599-4609).

  11. Yang, Z., Dai, Z., Salakhutdinov, R., & Cohen, W. W. (2017). Breaking the softmax bottleneck: A high-rank RNN language model. arXiv preprint arXiv:1711.03953.

  12. Chen, X., Tan, T., Liu, X., Lanchantin, P., Wan, M., Gales, M. J., & Woodland, P. C. (2015). Recurrent neural network language model adaptation for multi-genre broadcast speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association.

  13. Deena, S., Ng, R. W., Madhyashta, P., Specia, L., & Hain, T. (2017). Semi-supervised adaptation of RNNLMs by fine-tuning with domain-specific auxiliary features. In Eighteenth Annual Conference of the International Speech Communication Association.

  14. Li, K., Xu, H., Wang, Y., Povey, D., & Khudanpur, S. (2018). Recurrent neural network language model adaptation for conversational speech recognition. In Nighteenth Annual Conference of the International Speech Communication Association.

  15. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

  16. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.

  17. Melis, G., Dyer, C., & Blunsom, P. (2017). On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589.

  18. Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-Aware Neural Language Models. In AAAI (pp. 2741-2749).

  19. Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., & Robinson, T. (2014). One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. In Fifteenth Annual Conference of the International Speech Communication Association.

  20. Xu, H., Li, K., Wang, Y., Wang, J., Kang, S., Chen, X., ... & Khudanpur, S. (2018, April). Neural network language modeling with letter-based features and importance sampling. In Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on. IEEE.

  21. Xu, H., Chen, T., Gao, D., Wang, Y., Li, K., Goel, N., ... & Khudanpur, S. (2018). A Pruned RNNLM Lattice-Rescoring Algorithm for Automatic Speech Recognition. In Nighteenth Annual Conference of the International Speech Communication Association.

  22. Zhang, Y., Zhang, P., & Yan, Y. (2018). Improving Language Modeling with an Adversarial Critic for Automatic Speech Recognition. Proc. Interspeech, 2018, 3348–3352.

    Article  Google Scholar 

  23. Bell, P., Gales, M. J., Hain, T., Kilgour, J., Lanchantin, P., Liu, X., ... & Woodland, P. C. (2015). The MGB challenge: Evaluating multi-genre broadcast media recognition. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on (pp. 687-693). IEEE.

  24. Zhang, H. P., Yu, H. K., Xiong, D. Y., & Liu, Q. (2003). HHMM-based Chinese lexical analyzer ICTCLAS. In Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17 (pp. 184-187). Association for Computational Linguistics.

  25. Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In Seventh international conference on spoken language processing.

  26. Kuznetsov, V., Liao, H., Mohri, M., Riley, M., & Roark, B. (2016). Learning N-Gram Language Models from Uncertain Data. In Seventeenth Annual Conference of the International Speech Communication Association.

  27. Beneš, K., Kesiraju, S., Burget, L. (2018) i-Vectors in Language Modeling: An Efficient Way of Domain Adaptation for Feed-Forward Models. In Nighteenth Annual Conference of the International Speech Communication Association.

  28. Ma, M., Nirschl, M., Biadsy, F., & Kumar, S. (2017). Approaches for neural-network language model adaptation. In Eighteenth Annual Conference of the International Speech Communication Association.

  29. Gangireddy, S. R., Swietojanski, P., Bell, P., & Renals, S. (2016). Unsupervised Adaptation of Recurrent Neural Network Language Models. In Seventeenth Annual Conference of the International Speech Communication Association.

  30. Andrés-Ferrer, J., Bodenstab, N., & Vozila, P. (2018). Efficient Language Model Adaptation with Noise Contrastive Estimation and Kullback-Leibler Regularization. In Nighteenth Annual Conference of the International Speech Communication Association.

  31. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  32. Chung, J., et al. (2014) "Empirical evaluation of gated recurrent neural networks on sequence modeling." arXiv preprint arXiv:1412.3555.

  33. Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560.

    Article  Google Scholar 

  34. Liu, X., et al. (2014) "Efficient lattice rescoring using recurrent neural network language models." 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.

  35. Emami, A., and Mangu L. (2007) "Empirical study of neural network language models for Arabic speech recognition." 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU). IEEE.

  36. Schwenk, H. (2007). Continuous space language models. Computer Speech & Language, 21(3), 492–518.

    Article  Google Scholar 

  37. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016). Tensorflow: a system for large-scale machine learning. In OSDI (Vol. 16, pp. 265-283).

  38. Bu, H., Du, J., Na, X., Wu, B., & Zheng, H. (2017). AIShell-1: An open-source Mandarin speech corpus and a speech recognition baseline. In 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA) (pp. 1-5). IEEE.

  39. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Silovsky, J. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. EPFL-CONF-192584). IEEE Signal Processing Society.

  40. Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association.

  41. Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., ... & Khudanpur, S. (2016). Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI. In Seventeenth Annual Conference of the International Speech Communication Association.

Download references

Acknowledgments

This work is supported by the National Key R&D Program of China (No. 2017YFB1002802). We thank NLP lab of Tsinghua University to provide THUCNews corpus, and Dr. Zhiyuan Liu to admit us to extend this dataset. We thank anonymous reviewers for their invaluable comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiangyan Yi.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Table 18 Perplexities between every two domains of data. Each row shows the complexities of a trigram model on each test set. The part in grey shading is the same as Table 4.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bai, Y., Yi, J., Tao, J. et al. A Public Chinese Dataset for Language Model Adaptation. J Sign Process Syst 92, 839–851 (2020). https://doi.org/10.1007/s11265-019-01482-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-019-01482-5

Keywords

Navigation