ABSTRACT
We present a comprehensive analysis of the neural audio-visual synchrony evaluation tool SyncNet. We assess the agreement of SyncNet scores vis-a-vis human perception and whether we can use these as a reliable metric for evaluating audio-visual lip-synchrony in generation tasks with no ground truth reference audio-video pair. We further look into the underlying elements in audio and video which vitally affect synchrony using interpretable explanations from SyncNet predictions and analyse its susceptibility by introducing adversarial noise. SyncNet has been used in numerous papers on visually-grounded text-to-speech for scenarios such as dubbing. We focus on this scenario which features many local asynchronies (something that SyncNet isn’t made for).
- Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. LRS3-TED: a large-scale dataset for visual speech recognition. CoRR abs/1809.00496(2018). arXiv:1809.00496http://arxiv.org/abs/1809.00496Google Scholar
- Rehan Ahmad, Syed Zubair, Hani Alquhayz, and Allah Ditta. 2019. Multimodal speaker diarization using a pre-trained audio-visual synchronization model. Sensors 19, 23 (2019), 5163.Google ScholarCross Ref
- Stefano Arduini and Robert Hodgson. 2007. Similarity and Difference in Translation. Ed. di Storia e Letteratura.Google Scholar
- R. Barsam and D. Mohanan. 2010. Looking at Movies: An Introduction to Film.3 rd ed. New York:W. W. Norton & Company..Google Scholar
- Julie N. Buchan, Martin Paré, and Kevin G. Munhall. 2008. The effect of varying talker identity and listening conditions on gaze behavior during audiovisual speech perception. Brain Research 1242 (Nov. 2008), 162–171. https://doi.org/10.1016/j.brainres.2008.06.083Google Scholar
- Dick Bulterman. 2008. Synchronized Multimedia Integration Language (SMIL 3.0). W3C Recommendation. W3C. https://www.w3.org/TR/2008/REC-SMIL3-20081201/.Google Scholar
- Frederic Chaume. 2018. An overview of audiovisual translation: Four methodological turns in a mature discipline. Journal of Audiovisual Translation 1 (Nov. 2018), 40–63. https://doi.org/10.47476/jat.v1i1.43Google ScholarCross Ref
- Frederic Chaume Varela. 2004. Synchronization in dubbing: A translational approach. In Benjamins Translation Library, Pilar Orero (Ed.). Vol. 56. John Benjamins Publishing Company, Amsterdam, 35–52. https://doi.org/10.1075/btl.56.07chaGoogle Scholar
- Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.Google Scholar
- Joon Son Chung, Andrew W. Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 3444–3453.Google Scholar
- Joon Son Chung and Andrew Zisserman. 2017. Lip Reading in the Wild. In Computer Vision – ACCV 2016, Shang-Hong Lai, Vincent Lepetit, Ko Nishino, and Yoichi Sato (Eds.). Springer International Publishing, Cham, 87–103.Google Scholar
- Joon Son Chung and Andrew Zisserman. 2017. Out of Time: Automated Lip Sync in the Wild. In Computer Vision – ACCV 2016 Workshops, Chu-Song Chen, Jiwen Lu, and Kai-Kuang Ma (Eds.). Springer International Publishing, Cham, 251–263.Google Scholar
- Martin Cooke, Jon Barker, Stuart P. Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition.The Journal of the Acoustical Society of America 120 5 Pt 1 (2006), 2421–4.Google Scholar
- Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, and Liqiang Wang. 2020. Self-Supervised Learning for Audio-Visual Speaker Diarization. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2020), 4367–4371.Google Scholar
- Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael A. Specter, and Lalana Kagal. 2018. Explaining Explanations: An Overview of Interpretability of Machine Learning. 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) (2018), 80–89.Google Scholar
- Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. CoRR abs/1412.6572(2015).Google Scholar
- Michael Hassid, Michelle Tadmor Ramanovich, Brendan Shillingford, Miaosen Wang, Ye Jia, and Tal Remez. 2021. More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech. ArXiv abs/2111.10139(2021).Google Scholar
- Chenxu Hu, Qiao Tian, Tingle Li, Wang Yuping, Yuxuan Wang, and Hang Zhao. 2021. Neural Dubber: Dubbing for Videos According to Scripts. Advances in Neural Information Processing Systems 34 (2021).Google Scholar
- Venkatesh S. Kadandale, Juan F. Montesinos, and Gloria Haro. 2022. VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices. ArXiv abs/2204.02090(2022).Google Scholar
- Alina Karakanta, Supratik Bhattacharya, Shravan Nayak, Timo Baumann, Matteo Negri, and Marco Turchi. 2020. The Two Shades of Dubbing in Neural Machine Translation. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 4327–4333. https://doi.org/10.18653/v1/2020.coling-main.382Google ScholarCross Ref
- You Jin Kim, Hee-Soo Heo, Soo-Whan Chung, and Bong-Jin Lee. 2021. End-To-End Lip Synchronisation Based on Pattern Classification. 2021 IEEE Spoken Language Technology Workshop (SLT) (2021), 598–605.Google Scholar
- Sebastian Kraft and Udo Zölzer. 2014. BeaqleJS: HTML5 and JavaScript based Framework for the Subjective Evaluation of Audio Quality.Google Scholar
- Junchen Lu, Berrak Sisman, Rui Liu, Mingyang Zhang, and Haizhou Li. 2022. Visualtts: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2022), 8032–8036.Google Scholar
- Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (Dec. 1976), 746–748. https://doi.org/10.1038/264746a0 Number: 5588 Publisher: Nature Publishing Group.Google Scholar
- Paul Mermelstein. 1976. Distance Measures for Speech Recognition – Psychological and Instrumental. In Pattern Recognition and Artificial Intelligence, Proceedings of the Joint Workshop on Pattern Recognition and Artificial Intelligence, C. H. Chen (Ed.). 374–388.Google Scholar
- A. Natarajan, M. Motani, B. de Silva, K. Yap, and K. C. Chua. 2007. Investigating Network Architectures for Body Sensor Networks. In Network Architectures, G. Whitcomb and P. Neece (Eds.). Keleuven Press, Dayton, OH, 322–328. arXiv:960935712 [cs]Google Scholar
- Shravan Nayak, Timo Baumann, Supratik Bhattacharya, Alina Karakanta, Matteo Negri, and Marco Turchi. 2020. See me Speaking? Differentiating on Whether Words are Spoken On Screen or Off to Optimize Machine Dubbing. In Companion Publication of the 2020 International Conference on Multimodal Interaction. ACM, Virtual Event Netherlands, 130–134. https://doi.org/10.1145/3395035.3425640Google ScholarDigital Library
- Alp Öktem, Mireia Farrús, and Antonio Bonafonte. 2018. Bilingual Prosodic Dataset Compilation for Spoken Language Translation. In Proceedings of IberSPEECH 2018 (Barcelona, Spain, 21-23 November 2018). 20–24. https://www.isca-speech.org/archive/IberSPEECH_2018/pdfs/IberS18_P1-1_Oktem.pdfGoogle ScholarCross Ref
- Margaret H. Pinson. 2011. Audiovisual Quality Components: An Analysis. NA (Nov. 2011). https://www.its.bldrdoc.gov/publications/details.aspx?pub=2565 Publisher: ITS.Google ScholarCross Ref
- K. R. Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C. V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. 28th ACM International Conference on Multimedia (ACM MM) (Oct. 2020). https://doi.org/10.1145/3394171.3413532 Publisher: Association for Computing Machinery.Google ScholarDigital Library
- K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. Association for Computing Machinery, New York, NY, USA, 484–492. https://doi.org/10.1145/3394171.3413532Google ScholarDigital Library
- Prajwal K R, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C. V. Jawahar. 2020. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 13793–13802.Google Scholar
- Ashutosh Saboo and Timo Baumann. 2019. Integration of Dubbing Constraints into Machine Translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers). Association for Computational Linguistics, Florence, Italy, 94–101. https://doi.org/10.18653/v1/W19-5210Google ScholarCross Ref
- Debjoy Saha, Shravan Nayak, and Timo Baumann. 2022. Merkel Podcast Corpus: A Multimodal Dataset Compiled from 16 Years of Angela Merkel’s Weekly Video Podcasts. ArXiv abs/2205.12194(2022).Google Scholar
- Florian Schiel. 2004. MAUS Goes Iterative. In LREC.Google Scholar
- Yoav Shalev and Lior Wolf. 2020. End to End Lip Synchronization with a Temporal AutoEncoder. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) (2020), 330–339.Google ScholarCross Ref
- Shijing Si, Jianzong Wang, Xiaoyang Qu, Ning Cheng, Wenqi Wei, Xinghua Zhu, and Jing Xiao. 2021. Speech2Video: Cross-Modal Distillation for Speech to Video Generation. In Interspeech.Google Scholar
- Yaroslav V. Sokolovsky. 2010. On the Linguistic Definition of Translation. undefined (2010). https://www.semanticscholar.org/paper/On-the-Linguistic-Definition-of-Translation-Sokolovsky/b08bccc1d956ed35b5d1c5f89d7e9972cd3532aeGoogle Scholar
- Joon Son Son and Andrew Zisserman. 2017. Lip Reading in Profile. In Proceedings of the British Machine Vision Conference (BMVC), Gabriel Brostow Tae-Kyun Kim, Stefanos Zafeiriou and Krystian Mikolajczyk (Eds.). BMVA Press, Article 155, 11 pages. https://doi.org/10.5244/C.31.155Google ScholarCross Ref
- International Telecommunication Union. 2015. Recommendation ITU-R BS.1534-3: Method for the subjective assessment of intermediate quality level of audio systems. Technical Report. International Telecommunication Union.Google Scholar
- Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. 2019. Adversarial Examples: Attacks and Defenses for Deep Learning. IEEE Transactions on Neural Networks and Learning Systems 30 (2019), 2805–2824.Google ScholarCross Ref
- Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 4174–4184.Google ScholarCross Ref
Index Terms
- A Deep Dive Into Neural Synchrony Evaluation for Audio-visual Translation
Recommendations
Audio-visual granular synthesis performance demo
IE '13: Proceedings of The 9th Australasian Conference on Interactive Entertainment: Matters of Life and DeathIn this paper, I present a prototype of my audio-visual granular synthesis instrument Kortex. The instrument enables real-time improvisation of audio-visual material in a performance context. Granular synthesis is a processing technique that segments ...
Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia RetrievalEmotion recognition is a challenging task because of the emotional gap between subjective emotion and the low-level audio-visual features. Inspired by the recent success of deep learning in bridging the semantic gap, this paper proposes to bridge the ...
Creating an A Cappella Singing Audio Dataset for Automatic Jingju Singing Evaluation Research
DLfM '17: Proceedings of the 4th International Workshop on Digital Libraries for MusicologyThe data-driven computational research on automatic jingju (also known as Beijing or Peking opera) singing evaluation lacks a suitable and comprehensive a cappella singing audio dataset. In this work, we present an a cappella singing audio dataset which ...
Comments