Audio-guided self-supervised learning for disentangled visual speech representations

Feng, Dalu; Yang, Shuang; Shan, Shiguang; Chen, Xilin

doi:10.1007/s11704-024-3787-8

Audio-guided self-supervised learning for disentangled visual speech representations

Letter
Published: 25 June 2024

Volume 18, article number 186353, (2024)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Dalu Feng^1,2,
Shuang Yang^1,2,
Shiguang Shan^1,2 &
…
Xilin Chen^1,2

71 Accesses
36 Altmetric
5 Mentions
Explore all metrics

4 Conclusion

In this paper, we propose a novel two-branch framework to learn the disentangled visual speech representations based on two particular observations. Its main idea is to introduce the audio signal to guide the learning of speech-relevant cues and introduce a bottleneck to restrict the speech-irrelevant branch from learning high-frequency and fine-grained speech cues. Experiments on both the word-level and sentence-level audio-visual speech datasets LRW and LRS2-BBC show the effectiveness. Our future work is to explore more explicit auxiliary tasks and constraints beyond the reconstruction task of the speech-relevant and irrelevant branch to improve further its ability of capturing speech cues in the video. Meanwhile, it’s also a nice try to combine multiple types of knowledge representations [10] to further boost the obtained speech epresentations, which is also left for the future work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Shi B, Hsu W N, Lakhotia K, Mohamed A. Learning audio-visual speech representation by masked multimodal cluster prediction. In: Proceedings of the 10th International Conference on Learning Representations. 2022
Hsu W N, Shi B. u-HuBERT: unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1538
Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. 2017, 3652–3656
Ma P, Martinez B, Petridis S, Pantic M. Towards practical lipreading with distilled and efficient models. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 7608–7612
Ma P, Wang Y, Shen J, Petridis S, Pantic M. Lip-reading with densely connected temporal convolutional networks. In: Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. 2021, 2856–2865
Koumparoulis A, Potamianos G. Accurate and resource-efficient lipreading with efficientnetv2 and transformers. In: Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. 2022, 8467–8471
Ma P, Petridis S, Pantic M. End-to-end audio-visual speech recognition with conformers. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 7613–7617
Ma P, Petridis S, Pantic M. Visual speech recognition for multiple languages in the wild. Nature Machine Intelligence, 2022, 4(11): 930–939
Article Google Scholar
Ma P, Haliassos A, Fernandez-Lopez A, Chen H, Petridis S, Pantic M. Auto-AVSR: audio-visual speech recognition with automatic labels. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2023, 1–5
Yang Y, Zhuang Y, Pan Y. Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 2021, 22(12): 1551–1558
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (Grant Nos. 62276247, 62076250). Thanks for the help provided by Bingquan Xia in the experiments and by Yuanhang Zhang in proofreading.

Author information

Authors and Affiliations

Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Dalu Feng, Shuang Yang, Shiguang Shan & Xilin Chen
University of Chinese Academy of Sciences, Beijing, 100049, China
Dalu Feng, Shuang Yang, Shiguang Shan & Xilin Chen

Authors

Dalu Feng
View author publications
You can also search for this author inPubMed Google Scholar
Shuang Yang
View author publications
You can also search for this author inPubMed Google Scholar
Shiguang Shan
View author publications
You can also search for this author inPubMed Google Scholar
Xilin Chen
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Shuang Yang.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Additional information

Electronic supplementary material Supplementary material is available in the online version of this article at journal.hep.com.cn and link.springer.com.

Electronic Supplementary Material