Stochasticity and Non-Autoregressive Modeling in Deep Generative Models of Text

Schmidt, Florian

doi:10.3929/ethz-b-000421025

Download

Full text (PDF, 927.2Kb)

Open access

Author

Schmidt, Florian

Date

2020-05

Type

Doctoral Thesis

ETH Bibliography

yes

Altmetrics

Download

Full text (PDF, 927.2Kb)

Rights / license

In Copyright - Non-Commercial Use Permitted

Abstract

Deep generative models of text have shown great success on a wide range of conditional and unconditional tasks. While the design of these models has converged to a few established architectures, their autoregressive training methodology has often been questioned. Treating generation as a sequential classification problem not only neglects the stochasticity involved when text is sampled, but also conflicts with continuous noise models when uncovering latent underlying factors. In this work, we question the standard training methodology from two perspectives, the stochasticity of the hidden states and the incorporation of autoregressive predictions. First, we elevate the role of the hidden states by expressing their evolution through inherently stochastic transitions. Our non-autoregressive sequence model elegantly separates global and local uncertainty, yet requires only two ingredients: an independent noise source and a deterministic transition function. Variational training allows us to condition on observations through a deep inference model without any train-test discrepancy and a combination of generative and inference flows provides insights into the roles of stochasticity in generation and approximate inference. Second, we develop an observation model which expresses autoregressive effects not by interfering with the hidden state evolution but through a globally normalized observation model. By leveraging the semantics captured in word embeddings we can express local word correlations efficiently and obtain a strictly more powerful sequence model without losing the benefits of non-autoregressive states. Finally, we show how such word embeddings, static and contextual, can be exploited via reinforcement learning to guide training of autoregressive models. Our first formulation follows an inference formulation of entropy regularized policy optimization with a reward that effectively corrects predictions made during training towards the ground truth. We empirically verify that deviating slightly from the ground truth improves generalization. Our second formulation expresses reward as a metric against a whole corpus which allows training without any ground truth information after a pre-training phase. We empirically verify that our reward indeed captures longer and semantically more challenging dependencies than a traditional $n$-gram reward. We compare our proposed models against autoregressive baselines on character-level and word-level generation with a focus on keeping the training free of annealing strategies, auxiliary losses, or other forms of nudging. Show more

Permanent link

https://doi.org/10.3929/ethz-b-000421025

Publication status

published

External links

Search print copy at ETH Library

Contributors

Examiner: Hofmann, Thomas
Examiner: Rätsch, Gunnar