ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries

Hang Chen, Jun Du, Yu Hu, Li-Rong Dai, Bao-Cai Yin, Chin-Hui Lee

In this paper, we propose a novel deep learning architecture for improving word-level lip-reading. We first incorporate multi-scale processing into spatial feature extraction for lip-reading using hierarchical pyramidal convolution (HPConv) and self-attention. Specifically, HPConv is proposed to replace the conventional convolution features, leading to an improvement over the model’s ability to discover fine-grained lip movements. Next to deal with fixed-length image sequences representing words in a given database, a self-attention mechanism is proposed to integrate local information in all lip frames without assuming known word boundaries, so that our deep models automatically utilize key feature in relevant frames of a given word. Experiments on the Lip Reading in the Wild corpus show that our proposed architecture achieves an accuracy of 86.83%, yielding a relative error rate reduction of about 10% from that obtained with a state-of-the-art scheme of averaging frame scores for information fusion. A detailed analysis of the experimental results also confirms that weights learned from self-attention tend to be zero at both sides of an image sequence and focus non-zero weights in the middle part of a given word.


doi: 10.21437/Interspeech.2021-723

Cite as: Chen, H., Du, J., Hu, Y., Dai, L.-R., Yin, B.-C., Lee, C.-H. (2021) Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries. Proc. Interspeech 2021, 3001-3005, doi: 10.21437/Interspeech.2021-723

@inproceedings{chen21k_interspeech,
  author={Hang Chen and Jun Du and Yu Hu and Li-Rong Dai and Bao-Cai Yin and Chin-Hui Lee},
  title={{Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3001--3005},
  doi={10.21437/Interspeech.2021-723}
}