Sequential data such as speech and dialogs are usually modeled by Recurrent
Neural Networks (RNN) and derivatives since the information can travel
through time with such architecture. However, disadvantages exist with
the use of RNNs, including the limited depth of neural networks and
the GPU’s unfriendly training process.
Estimating the timing
of turn-taking is a critical feature of dialog systems. Such tasks
require knowledge about past dialog contexts and have been modeled
using RNNs in several studies. In this paper, we propose a non-RNN
model for the timing estimation of turn-taking in dialogs. The proposed
model takes lexical and acoustic features as its input to predict a
turn’s end. We conducted experiments on four types of Japanese
conversation datasets and show that with proper neural network designs,
the long-term information in a dialog could propagate without a recurrent
structure. The proposed model outperformed canonical RNN-based architectures
on a turn-taking estimation task.
Cite as: Liu, C., Ishi, C., Ishiguro, H. (2019) A Neural Turn-Taking Model without RNN. Proc. Interspeech 2019, 4150-4154, doi: 10.21437/Interspeech.2019-2270
@inproceedings{liu19l_interspeech, author={Chaoran Liu and Carlos Ishi and Hiroshi Ishiguro}, title={{A Neural Turn-Taking Model without RNN}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={4150--4154}, doi={10.21437/Interspeech.2019-2270} }