Learning to Evaluate Neural Language Models

O’Neill, James; Bollegala, Danushka

doi:10.1007/978-981-15-6168-9_11

James O’Neill¹⁰ &
Danushka Bollegala¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1215))

Included in the following conference series:

International Conference of the Pacific Association for Computational Linguistics

705 Accesses

Abstract

Evaluating the performance of neural network-based text generators and density estimators is challenging since no one measure perfectly evaluates language quality. Perplexity has been a mainstay metric for neural language models trained by maximizing the conditional log-likelihood. We argue perplexity alone is a naive measure since it does not explicitly take into account the semantic similarity between generated and target sentences. Instead, it relies on measuring the cross-entropy between the targets and predictions on the word-level, while ignoring alternative incorrect predictions that may be semantically similar and globally coherent, thus ignoring quality of neighbouring tokens that may be good candidates. This is particularly important when learning from smaller corpora where co-occurrences are even more sparse. Thus, this paper proposes the use of a pretrained model-based evaluation that assesses semantic and syntactic similarity between predicted sequences and target sequences. We argue that this is an improvement over perplexity which does not distinguish between incorrect predictions that vary in semantic distance to the target words. We find that models that outperform other models using perplexity as an evaluation metric on Penn-Treebank and WikiText-2, do not necessarily perform better on measures that evaluate using semantic similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
we follow the hugging face implementation available here: https://github.com/huggingface/pytorch-openai-transformer-lm.

References

Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 1171–1179 (2015)
Google Scholar
Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 (2015)
Chaganty, A.T., Mussman, S., Liang, P.: The price of debiasing automatic metrics in natural language evaluation. arXiv preprint arXiv:1807.02202 (2018)
Chen, S.F., Beeferman, D., Rosenfeld, R.: Evaluation metrics for language models (1998)
Google Scholar
s Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680. Association for Computational Linguistics, Copenhagen, Denmark, September 2017. https://www.aclweb.org/anthology/D17-1070
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Grave, E., Joulin, A., Cissé, M., Jégou, H., et al.: Efficient softmax approximation for GPUs. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1302–1310. JMLR. org (2017)
Google Scholar
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018)
Kiros, R., et al.: Skip-thought vectors. In: Advances in neural information processing systems, pp. 3294–3302 (2015)
Google Scholar
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning, pp. 957–966 (2015)
Google Scholar
Lowe, R., Noseworthy, M., Serban, I.V., Angelard-Gontier, N., Bengio, Y., Pineau, J.: Towards an automatic Turing test: learning to evaluate dialogue responses. arXiv preprint arXiv:1708.07149 (2017)
Marvin, R., Linzen, T.: Targeted syntactic evaluation of language models. arXiv preprint arXiv:1808.09031 (2018)
Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182 (2017)
Miyamoto, Y., Cho, K.: Gated word-character recurrent language model. arXiv preprint arXiv:1606.01700 (2016)
Neill, J.O., Bollegala, D.: Semi-supervised multi-task word embeddings. arXiv preprint arXiv:1809.05886 (2018)
Novikova, J., Dušek, O., Curry, A.C., Rieser, V.: Why we need new evaluation metrics for NLG. arXiv preprint arXiv:1707.06875 (2017)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Peters, M.E., Ammar, W., Bhagavatula, C., Power, R.: Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108 (2017)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018). https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/languageunsupervised/language understanding paper.pdf
Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Liverpool, Liverpool, L69 3BX, UK
James O’Neill & Danushka Bollegala

Authors

James O’Neill
View author publications
You can also search for this author in PubMed Google Scholar
Danushka Bollegala
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to James O’Neill or Danushka Bollegala .

Editor information

Editors and Affiliations

Japan Advanced Institute of Science and Technology, Ishikawa, Japan
Le-Minh Nguyen
University of Engineering and Technology, Hanoi, Vietnam
Xuan-Hieu Phan
Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
Kôiti Hasida
Japan Advanced Institute of Science and Technology, Ishikawa, Japan
Satoshi Tojo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

O’Neill, J., Bollegala, D. (2020). Learning to Evaluate Neural Language Models. In: Nguyen, LM., Phan, XH., Hasida, K., Tojo, S. (eds) Computational Linguistics. PACLING 2019. Communications in Computer and Information Science, vol 1215. Springer, Singapore. https://doi.org/10.1007/978-981-15-6168-9_11

Download citation

DOI: https://doi.org/10.1007/978-981-15-6168-9_11
Published: 02 July 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-6167-2
Online ISBN: 978-981-15-6168-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics