Various neural network architectures have been proposed in the literature to model 2D correlations in the input signal, including convolutional layers, frequency LSTMs and 2D LSTMs such as time-frequency LSTMs, grid LSTMs and ReNet LSTMs. It has been argued that frequency LSTMs can model translational variations similar to CNNs, and 2D LSTMs can model even more variations [1], but no proper comparison has been done for speech tasks. While convolutional layers have been a popular technique in speech tasks, this paper compares convolutional and LSTM architectures to model time-frequency patterns as the first layer in an LDNN [2] architecture. This comparison is particularly interesting when the convolutional layer degrades performance, such as in noisy conditions or when the learned filterbank is not constant-Q [3]. We find that grid-LDNNs offer the best performance of all techniques, and provide between a 1–4% relative improvement over an LDNN and CLDNN on 3 different large vocabulary Voice Search tasks.
Cite as: Sainath, T.N., Li, B. (2016) Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks. Proc. Interspeech 2016, 813-817, doi: 10.21437/Interspeech.2016-84
@inproceedings{sainath16_interspeech, author={Tara N. Sainath and Bo Li}, title={{Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks}}, year=2016, booktitle={Proc. Interspeech 2016}, pages={813--817}, doi={10.21437/Interspeech.2016-84} }