skip to main content
10.1145/3394171.3413740acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire

Published: 12 October 2020 Publication History

Abstract

Lipreading is an impressive technique and there has been a definite improvement of accuracy in recent years. However, existing methods for lipreading mainly build on autoregressive (AR) model, which generate target tokens one by one and suffer from high inference latency. To breakthrough this constraint, we propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. NAR lipreading is a challenging task that has many difficulties: 1) the discrepancy of sequence lengths between source and target makes it difficult to estimate the length of the output sequence; 2) the conditionally independent behavior of NAR generation lacks the correlation across time which leads to a poor approximation of target distribution; 3) the feature representation ability of encoder can be weak due to lack of effective alignment mechanism; and 4) the removal of AR language model exacerbates the inherent ambiguity problem of lipreading. Thus, in this paper, we introduce three methods to reduce the gap between FastLR and AR model: 1) to address challenges 1 and 2, we leverage integrate-and-fire (I&F) module to model the correspondence between source video frames and output text sequence. 2) To tackle challenge 3, we add an auxiliary connectionist temporal classification (CTC) decoder to the top of the encoder and optimize it with extra CTC loss. We also add an auxiliary autoregressive decoder to help the feature extraction of encoder. 3) To overcome challenge 4, we propose a novel Noisy Parallel Decoding (NPD) for I&F and bring Byte-Pair Encoding (BPE) into lipreading. Our experiments exhibit that FastLR achieves the speedup up to 10.97× comparing with state-of-the-art lipreading model with slight WER absolute increase of 1.5% and 5.5% on GRID and LRS2 lipreading datasets respectively, which demonstrates the effectiveness of our proposed method.

Supplementary Material

MP4 File (3394171.3413740.mp4)
The brief introduction of paper: FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire

References

[1]
Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018b. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (2018).
[2]
Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018a. Deep lip reading: a comparison of models and an online application. arXiv preprint arXiv:1806.06053 (2018).
[3]
Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2016. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016).
[4]
Anthony N Burkitt. 2006. A review of the integrate-and-fire neuron model: I. Homogeneous synaptic input. Biological cybernetics, Vol. 95, 1 (2006), 1--19.
[5]
Nanxin Chen, Shinji Watanabe, Jesús Villalba, and Najim Dehak. 2019. Non-Autoregressive Transformer Automatic Speech Recognition. arXiv preprint arXiv:1911.04908 (2019).
[6]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
[7]
Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip reading sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3444--3453.
[8]
Joon Son Chung and AP Zisserman. 2017. Lip reading in profile. (2017).
[9]
Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, Vol. 120, 5 (2006), 2421--2424.
[10]
Linhao Dong and Bo Xu. 2019. CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition. arXiv preprint arXiv:1905.11235 (2019).
[11]
Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 6114--6123.
[12]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369--376.
[13]
Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2017. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281 (2017).
[14]
Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. 2019. Non-autoregressive neural machine translation with enhanced decoder input. In AAAI, Vol. 33. 3723--3730.
[15]
Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement. In EMNLP. 1173--1182.
[16]
Jinglin Liu, Yi Ren, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2020. Task-Level Curriculum Learning for Non-Autoregressive Neural Machine Translation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20. 3861--3867.
[17]
Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy. 2019. FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow. In EMNLP-IJCNLP. 4273--4283.
[18]
Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et almbox. 2017. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433 (2017).
[19]
Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Georgios Tzimiropoulos, and Maja Pantic. 2018. Audio-visual speech recognition with a hybrid ctc/attention architecture. In 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 513--520.
[20]
Yi Ren, Chenxu Hu, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020 a. FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech. arXiv preprint arXiv:2006.04558 (2020).
[21]
Yi Ren, Jinglin Liu, Xu Tan, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020 b. A Study of Non-autoregressive Model for Sequence Generation. arXiv preprint arXiv:2004.10454 (2020).
[22]
Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems. 3165--3174.
[23]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015).
[24]
Brendan Shillingford, Yannis Assael, Matthew W Hoffman, Thomas Paine, Cíian Hughes, Utsav Prabhu, Hank Liao, Hasim Sak, Kanishka Rao, Lorrayne Bennett, et almbox. 2018. Large-scale visual speech recognition. arXiv preprint arXiv:1807.05162 (2018).
[25]
Themos Stafylakis and Georgios Tzimiropoulos. 2017. Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105 (2017).
[26]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104--3112.
[27]
Amirsina Torfi, Seyed Mehdi Iranmanesh, Nasser Nasrabadi, and Jeremy Dawson. 2017. 3d convolutional neural networks for cross audio-visual matching recognition. IEEE Access, Vol. 5 (2017), 22081--22091.
[28]
Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2Tensor for Neural Machine Translation. CoRR, Vol. abs/1803.07416 (2018). http://arxiv.org/abs/1803.07416
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[30]
Michael Wand, Jan Koutn'ik, and Jürgen Schmidhuber. 2016. Lipreading with long short-term memory. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6115--6119.
[31]
Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. 2019. Non-Autoregressive Machine Translation with Auxiliary Regularization. In AAAI.
[32]
Bang Yang, Fenglin Liu, and Yuexian Zou. 2019. Non-Autoregressive Video Captioning with Iterative Refinement. arXiv preprint arXiv:1911.12018 (2019).
[33]
Ya Zhao, Rui Xu, Xinchao Wang, Peng Hou, Haihong Tang, and Mingli Song. 2019. Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers. arXiv preprint arXiv:1911.11502 (2019).

Cited By

View all
  • (2024)A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognitionSignal, Image and Video Processing10.1007/s11760-024-03245-718:6-7(5433-5448)Online publication date: 18-May-2024
  • (2023)Research on a Lip Reading Algorithm Based on Efficient-GhostNetElectronics10.3390/electronics1205115112:5(1151)Online publication date: 27-Feb-2023
  • (2023)LipFormer: Learning to Lipread Unseen Speakers Based on Visual-Landmark TransformersIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.328222433:9(4507-4517)Online publication date: Sep-2023
  • Show More Cited By

Index Terms

  1. FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '20: Proceedings of the 28th ACM International Conference on Multimedia
      October 2020
      4889 pages
      ISBN:9781450379885
      DOI:10.1145/3394171
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 October 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. deep learning
      2. lip reading
      3. non-autoregressive generation

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China
      • National Key R&D Program of China
      • Zhejiang Natural Science Foundation
      • Fundamental Research Funds for the Central Universities

      Conference

      MM '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)23
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 28 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognitionSignal, Image and Video Processing10.1007/s11760-024-03245-718:6-7(5433-5448)Online publication date: 18-May-2024
      • (2023)Research on a Lip Reading Algorithm Based on Efficient-GhostNetElectronics10.3390/electronics1205115112:5(1151)Online publication date: 27-Feb-2023
      • (2023)LipFormer: Learning to Lipread Unseen Speakers Based on Visual-Landmark TransformersIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.328222433:9(4507-4517)Online publication date: Sep-2023
      • (2023)MALip: Modal Amplification Lipreading based on reconstructed audio featuresSignal Processing: Image Communication10.1016/j.image.2023.117002117(117002)Online publication date: Sep-2023
      • (2022)Review on research progress of machine lip readingThe Visual Computer10.1007/s00371-022-02511-439:7(3041-3057)Online publication date: 15-Jun-2022
      • (2021)Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech RecognitionSensors10.3390/s2201007222:1(72)Online publication date: 23-Dec-2021
      • (2021)SimulSLTProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475544(4118-4127)Online publication date: 17-Oct-2021
      • (2021)Towards Fast and High-Quality Sign Language ProductionProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475463(3172-3181)Online publication date: 17-Oct-2021
      • (2021)SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive MemoryProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475220(1359-1367)Online publication date: 17-Oct-2021

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media