Elsevier

Neurocomputing

Volume 269, 20 December 2017, Pages 73-81
Neurocomputing

Tunable discounting and visual exploration for language models

https://doi.org/10.1016/j.neucom.2016.08.145Get rights and content

Abstract

A language model is fundamental to many applications in natural language processing. Most language models are trained on a large amount of dataset and difficult to be adapted to other domains which may have only a small dataset available. Tuning discounting parameters for smoothing is one way to adapt language models for a new domain. In this work, we present novel language models based on tunable discounting mechanisms. The language models are trained on a large dataset, but their discounting parameters can be tuned to a target dataset afterwards. We explore tunable discounting and polynomial discounting functions based on the modified Kneser–Ney (mKN) models. Specifically, we propose the tunable mKN (TmKN) model, polymomial discounting mKN (PmKN) model, and tunable and polynomial discounting mKN (TPmKN) model. We test our proposed models and compared with the mKN model, improved KN model, and the tunable mKN with the interpolation model (mKN + interp). With the implementation, our language models achieve perplexity improvements in both in-domain and out-of-domain evaluation. Experimental results indicate that our new models significantly outperform the baseline model and our models are especially suitable for adapting to new domains. In addition, we use the visualization technique to depict the relationship between parameter settings and the language model performances for guiding our parameter optimization process. The exploratory visual analysis is then used to examine the performance of the proposed language models which will reveal the strength and characteristic of the models.

Introduction

Language modeling is a well-studied topic in natural language processing (NLP) since language models play a role in many tasks of the language technology such as speech recognition [1], information retrieval [2] and machine translation [3], [4]. A language model assigns probabilities to sequences of n words. One of the most popular language models is the modified Kneser–Ney model [5], which has been implemented in language model toolkits such as SRILM [6] and KenLM [7].

The simplest method to compute the sequence probability is the maximum likelihood estimate (MLE). Unfortunately, the MLE overestimates the probability of rare events and assigns zero probability to unseen word sequences. As a result, smoothing techniques have been proposed to address this estimation problem. Some of the previous language models [5], [8] use absolute discounting and Kneser–Ney discounting to perform smoothing. Those models subtract fixed discounts through the estimation on training data. Considering the mismatch between the training and test data (possibly even from different domains), an accurate estimation is difficult to achieve with only knowledge about the training data.

In this study, we explore the novel tunable discounting mechanisms to tune language models for the adjustment to the target dataset. We train the language model on a training dataset as usual but tune the discounting parameters on the validation set, which we assume to be beneficial for domain adaptation. Compared with the previous discounting methods, our tunable discounting and polynomial discounting techniques offer a more flexible approach for adapting language models to new domains. Our proposed approach for tuning the optimal discounting parameters can minimize the perplexity [9] of validation dataset.

In our experiments, we use the well-known modified Kneser–Ney smoothing model (mKN) as the baseline. The performance of all the models is evaluated by using perplexity on in-domain and out-of-domain data. For in-domain, our models have lower perplexity scores than the baseline model. For out-of-domain, our language models achieve significantly better perplexity than the competitor. Experimental results demonstrate that our tunable discounting models outperform the mKN model. We expect the improvement of our models can potentially further optimize the performance of related NLP applications.

Section snippets

Related work

Many researchers have investigated and proposed language models for natural language processing. In [8] the authors presented the Katz smoothing as a back-off smoothing method and thus established the popular Kneser–Ney model. The modified Kneser–Ney model in [5] is the current dominant n-gram language model. They noted that the interpolation generally works better than the back-off. In [10] the authors contributed Stupid Backoff for language modeling. Their proposed model is slightly

Language models

In this section, we recall the commonly used modified Kneser–Ney (mKN) [5] model, which is used in our contrastive model. We then present the new variations of the mKN model with novel discounting techniques including the tunable discount and polynomial discounts. The notations we used throughout the paper are shown in Table 1.

The optimization of parameters

As already mentioned, we use a tuning step with a small amount of validation data to tune the parameters in our language models. In this section, we introduce the steps for the parameter optimization. In our tuning steps, perplexity is used to measure the performance of a language model (LM). In language modeling [9], perplexity is the inverse probability of the test set and normalized by the number of words. Let w1w2wn be a test sentence s. We assume that a language model P estimates the

Corpora

We summarize the corpora used in experiments in Table 2. We run the in-domain experiments on the Wall Street Journal (WSJ). The training, validation (Val) and test data are taken from WSJ, for the in-domain language models experiment. The training set contains more than 1.6 million sentences and both the validation and the test set have roughly 100,000 sentences each.

For the out-of-domain experiments, we use a special release of the MultiUN corpus [31] as training data. It is a multilingual

Experimental results and discussion

In this paper, we compare our models with the most popular language models in the tookits such as IRSTLM [32] and KenLM [33]. The models in these toolkits are the highly recommended language model in the Moses framework. Heafield et al. contributed a scalable variant of the modified Kneser–Ney model [7] that does not rely on pruning, in which the significant improvements were observed at the expense of much larger language models.

The outcomes of our experimental results together with the

Exploratory analysis of the models

Exploratory data analysis is an open end approach aiming to get more understanding of the dataset. It is often executed by using suitable visual representations of the data, also called visualizations [23]. To this end, we first use three models (mKN, mKN-Interpolation, and TPmKN) to calculate the log likelihood of every sentence on the test dataset. We then calculate the probability of every word in each sentence by using the three models.

Experimental results above clearly show that our new

Conclusion

In this paper, we introduced novel discounting language models using tunable discounting and polynomial discounting methods. The discount parameters are tuned to the validation set to adjust the models for the target set. The experiments empirically determine that the tunable discounting language models perform as well as the modified KN model on in-domain data. In the out-of-domain scenarios, we observed significant improvements on our proposed models. Using the exploratory analysis with

Acknowledgments

We would like to express our acknowledgments to BELSPO (Belgian Science Policy). We also thank the colleagues at the ULB especially Raphael Hubain, Mathias Coeckelbergs, Simon Hengchen who provided insight that greatly assisted the research project.

Junfei Guo is a postdoctoral researcher at Université libre de Bruxelles and a researcher at Huazhong University of Science and Technology. He did his Ph.D. in a program of University of Stuttgart jointly with Wuhan University. His research interests include natural language processing, document management and machine learning.

References (33)

  • A. Mnih et al.

    Improving a statistical language model through non-linear prediction

    Neurocomputing

    (2009)
  • L. Rabiner et al.

    Fundamentals of Speech Recognition

    (1993)
  • C.D. Manning et al.

    Introduction to Information Retrieval

    (2008)
  • P. Koehn

    Statistical Machine Translation

    (2010)
  • J. Guo et al.

    A tunable language model for statistical machine translation

    Proceeding of the Eleventh Biennial Conference of the Association for Machine Translation in the Americas (AMTA)

    (2014)
  • S.F. Chen et al.

    An empirical study of smoothing techniques for language modeling

    Proc. ACL

    (1996)
  • A. Stolcke

    SRILM—an extensible language modeling toolkit

    Proc. INTERSPEECH

    (2002)
  • K. Heafield et al.

    Scalable modified Kneser-Ney language model estimation

    Proc. ACL

    (2013)
  • R. Kneser et al.

    Improved backing-off for m-gram language modeling

    Proc. ICASSP

    (1995)
  • F. Jelinek et al.

    Perplexity—a measureof the difficulty of speech recognition tasks

    J. Acoust. Soc. Am.

    (1977)
  • T. Brants et al.

    Large language models in machine translation

    Proc. EMNLP

    (2007)
  • H. Schütze

    Integrating history-length interpolation and classes in language modeling

    Proc. ACL

    (2011)
  • P.F. Brown et al.

    Class-based n-gram models of natural language

    Comput. Linguist.

    (1992)
  • R. Kneser et al.

    On the dynamic adaptation of stochastic language models

    Proc. ICASSP

    (1993)
  • Y. Bengio et al.

    Neural probabilistic language models

    Innovations in Machine Learning

    (2006)
  • T. Mikolov et al.

    Distributedrepresentations of words and phrases and their compositionality

    Adv. Neural Inf. Process. Syst.

    (2013)
  • Cited by (0)

    Junfei Guo is a postdoctoral researcher at Université libre de Bruxelles and a researcher at Huazhong University of Science and Technology. He did his Ph.D. in a program of University of Stuttgart jointly with Wuhan University. His research interests include natural language processing, document management and machine learning.

    Qi Han received his Diplom degree in physics from the Brandenburg University of Technology Cottbus-Senftenberg, Germany. He is currently a Ph.D. candidate and works as a member of research staff at the Institute for Visualization and Interactive Systems (VIS) of the University of Stuttgart. His current research interests include visualization, visual analytics and machine learning. Specifically, he is interested in effective integration of techniques from visualization and natural language processing.

    Guangzhi Ma is an associate professor at Huazhong University of Science and Technology. His research interests concern about data mining and knowledge discovery. He is author of numerous papers both in journals and international conferences.

    Hong Liu is a professor at Huazhong University of Science and Technology. She completed her Ph.D. degree at Teesside University, UK. Her research interests include pattern recognition and computer vision.

    Seth van Hooland is an associate professor at Université libre de Bruxelles. He received his Ph.D. in Information and Communication Science at Université libre de Bruxelles. His research interests concern about document and records management and digital humanities.

    View full text