Conclusion
In this paper, we proposed a seq2seq model based on self-attention and self-distillation for sentence-level lip reading. The model includes the CNN front-end, pixel-wise learning, temporal learning, and decoder. we apply the CNN front-end to capture shallow spatial features inside the image sequence, and employ the Resformer module for the deep spatial correlation between pixels per frame, namely, pixel-wise learning. Then, the encoder is utilized to learn the temporal features, namely, temporal learning. Finally, the decoder decodes visual information to realize text prediction. Besides, the model applies self-distillation to further improve the model. Through experiments on GRID, LRW and LRW-1000, the proposed model achieves competitive experimental results on WER, CER and Acc metrics. However, our work presents certain limitations in the model complexity issue, which need to be tackled in the subsequent work.
References
Xiao J, Yang S, Zhang Y, Shan S, Chen X. Deformation flow based two-stream network for lip reading. In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). 2020: 364–370
Assael Y M, Shillingford B, Whiteson S, De Freitas N. LipNet: End-to-end sentence-level lipreading. 2017, arXiv preprint arXiv: 1611, 0159: 9
Chung J S, Senior A, Vinyals O, et al. Lip reading sentences in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 3444–3453
Xu K, Li D, Cassimatis N, Wang X. LCANet: End-to-end lipreading with cascaded attention-CTC. In: Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). 2018: 548–555
Zhang Y, Yang S, Xiao J, et al. Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). 2020: 356–363
Luo M, Yang S, Shan S, Chen X. Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading. In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). 2020: 273–280
Zhang X, Cheng F, Wang S. Spatio-temporal fusion based convolutional sequence learning for lip reading. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019: 713–722
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Xue, J., Huang, S., Song, H. et al. Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation. Front. Comput. Sci. 17, 176344 (2023). https://doi.org/10.1007/s11704-023-2230-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-023-2230-x