Skip to main content
Log in

Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation

  • Letter
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Conclusion

In this paper, we proposed a seq2seq model based on self-attention and self-distillation for sentence-level lip reading. The model includes the CNN front-end, pixel-wise learning, temporal learning, and decoder. we apply the CNN front-end to capture shallow spatial features inside the image sequence, and employ the Resformer module for the deep spatial correlation between pixels per frame, namely, pixel-wise learning. Then, the encoder is utilized to learn the temporal features, namely, temporal learning. Finally, the decoder decodes visual information to realize text prediction. Besides, the model applies self-distillation to further improve the model. Through experiments on GRID, LRW and LRW-1000, the proposed model achieves competitive experimental results on WER, CER and Acc metrics. However, our work presents certain limitations in the model complexity issue, which need to be tackled in the subsequent work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

  1. Xiao J, Yang S, Zhang Y, Shan S, Chen X. Deformation flow based two-stream network for lip reading. In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). 2020: 364–370

  2. Assael Y M, Shillingford B, Whiteson S, De Freitas N. LipNet: End-to-end sentence-level lipreading. 2017, arXiv preprint arXiv: 1611, 0159: 9

  3. Chung J S, Senior A, Vinyals O, et al. Lip reading sentences in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 3444–3453

  4. Xu K, Li D, Cassimatis N, Wang X. LCANet: End-to-end lipreading with cascaded attention-CTC. In: Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). 2018: 548–555

  5. Zhang Y, Yang S, Xiao J, et al. Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). 2020: 356–363

  6. Luo M, Yang S, Shan S, Chen X. Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading. In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). 2020: 273–280

  7. Zhang X, Cheng F, Wang S. Spatio-temporal fusion based convolutional sequence learning for lip reading. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019: 713–722

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shibo Huang.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xue, J., Huang, S., Song, H. et al. Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation. Front. Comput. Sci. 17, 176344 (2023). https://doi.org/10.1007/s11704-023-2230-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-023-2230-x

Navigation