Stochasticity and Non-Autoregressive Modeling in Deep Generative Models of Text
Open access
Author
Date
2020-05Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Deep generative models of text have shown great success on a wide range of conditional and unconditional tasks. While the design of these models has converged to a few established architectures, their autoregressive training methodology has often been questioned. Treating generation as a sequential classification problem not only neglects the stochasticity involved when text is sampled, but also conflicts with continuous noise models when uncovering latent underlying factors. In this work, we question the standard training methodology from two perspectives, the stochasticity of the hidden states and the incorporation of autoregressive predictions.
First, we elevate the role of the hidden states by expressing their evolution through inherently stochastic transitions. Our non-autoregressive sequence model elegantly separates global and local uncertainty, yet requires only two ingredients: an independent noise source and a deterministic transition function. Variational training allows us to condition on observations through a deep inference model without any train-test discrepancy and a combination of generative and inference flows provides insights into the roles of stochasticity in generation and approximate inference.
Second, we develop an observation model which expresses autoregressive effects not by interfering with the hidden state evolution but through a globally normalized observation model. By leveraging the semantics captured in word embeddings we can express local word correlations efficiently and obtain a strictly more powerful sequence model without losing the benefits of non-autoregressive states.
Finally, we show how such word embeddings, static and contextual, can be exploited via reinforcement learning to guide training of autoregressive models. Our first formulation follows an inference formulation of entropy regularized policy optimization with a reward that effectively corrects predictions made during training towards the ground truth. We empirically verify that deviating slightly from the ground truth improves generalization. Our second formulation expresses reward as a metric against a whole corpus which allows training without any ground truth information after a pre-training phase. We empirically verify that our reward indeed captures longer and semantically more challenging dependencies than a traditional $n$-gram reward.
We compare our proposed models against autoregressive baselines on character-level and word-level generation with a focus on keeping the training free of annealing strategies, auxiliary losses, or other forms of nudging. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000421025Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
Natural Language Processing; Machine Learning; Deep Learning; Text GenerationOrganisational unit
09462 - Hofmann, Thomas / Hofmann, Thomas
More
Show all metadata
ETH Bibliography
yes
Altmetrics