research-article

A Deep Dive Into Neural Synchrony Evaluation for Audio-visual Translation

Authors:
Shravan Nayak

Language Technology Group, Universität Hamburg, Germany

Language Technology Group, Universität Hamburg, Germany

0000-0002-5298-7121
View Profile

,
Christian Schuler

Computer Science, Universität Hamburg, Germany

Computer Science, Universität Hamburg, Germany

0000-0002-2686-6350
View Profile

,
Debjoy Saha

Indian Institute of Technology Kharagpur, India

Indian Institute of Technology Kharagpur, India

0000-0001-8264-0826
View Profile

,
Timo Baumann

Computer Science and Mathematics, Ostbayerische Technische Hochschule Regensburg, Germany

Computer Science and Mathematics, Ostbayerische Technische Hochschule Regensburg, Germany

0000-0003-2203-1783
View Profile

ICMI '22: Proceedings of the 2022 International Conference on Multimodal InteractionNovember 2022Pages 642–647https://doi.org/10.1145/3536221.3556621

Published:07 November 2022Publication History

ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

Pages 642–647

ABSTRACT

We present a comprehensive analysis of the neural audio-visual synchrony evaluation tool SyncNet. We assess the agreement of SyncNet scores vis-a-vis human perception and whether we can use these as a reliable metric for evaluating audio-visual lip-synchrony in generation tasks with no ground truth reference audio-video pair. We further look into the underlying elements in audio and video which vitally affect synchrony using interpretable explanations from SyncNet predictions and analyse its susceptibility by introducing adversarial noise. SyncNet has been used in numerous papers on visually-grounded text-to-speech for scenarios such as dubbing. We focus on this scenario which features many local asynchronies (something that SyncNet isn’t made for).

References

Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. LRS3-TED: a large-scale dataset for visual speech recognition. CoRR abs/1809.00496(2018). arXiv:1809.00496http://arxiv.org/abs/1809.00496Google Scholar
Rehan Ahmad, Syed Zubair, Hani Alquhayz, and Allah Ditta. 2019. Multimodal speaker diarization using a pre-trained audio-visual synchronization model. Sensors 19, 23 (2019), 5163.Google ScholarCross Ref
Stefano Arduini and Robert Hodgson. 2007. Similarity and Difference in Translation. Ed. di Storia e Letteratura.Google Scholar
R. Barsam and D. Mohanan. 2010. Looking at Movies: An Introduction to Film.3 rd ed. New York:W. W. Norton & Company..Google Scholar
Julie N. Buchan, Martin Paré, and Kevin G. Munhall. 2008. The effect of varying talker identity and listening conditions on gaze behavior during audiovisual speech perception. Brain Research 1242 (Nov. 2008), 162–171. https://doi.org/10.1016/j.brainres.2008.06.083Google Scholar
Dick Bulterman. 2008. Synchronized Multimedia Integration Language (SMIL 3.0). W3C Recommendation. W3C. https://www.w3.org/TR/2008/REC-SMIL3-20081201/.Google Scholar
Frederic Chaume. 2018. An overview of audiovisual translation: Four methodological turns in a mature discipline. Journal of Audiovisual Translation 1 (Nov. 2018), 40–63. https://doi.org/10.47476/jat.v1i1.43Google ScholarCross Ref
Frederic Chaume Varela. 2004. Synchronization in dubbing: A translational approach. In Benjamins Translation Library, Pilar Orero (Ed.). Vol. 56. John Benjamins Publishing Company, Amsterdam, 35–52. https://doi.org/10.1075/btl.56.07chaGoogle Scholar
Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.Google Scholar
Joon Son Chung, Andrew W. Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 3444–3453.Google Scholar
Joon Son Chung and Andrew Zisserman. 2017. Lip Reading in the Wild. In Computer Vision – ACCV 2016, Shang-Hong Lai, Vincent Lepetit, Ko Nishino, and Yoichi Sato (Eds.). Springer International Publishing, Cham, 87–103.Google Scholar
Joon Son Chung and Andrew Zisserman. 2017. Out of Time: Automated Lip Sync in the Wild. In Computer Vision – ACCV 2016 Workshops, Chu-Song Chen, Jiwen Lu, and Kai-Kuang Ma (Eds.). Springer International Publishing, Cham, 251–263.Google Scholar
Martin Cooke, Jon Barker, Stuart P. Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition.The Journal of the Acoustical Society of America 120 5 Pt 1 (2006), 2421–4.Google Scholar
Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, and Liqiang Wang. 2020. Self-Supervised Learning for Audio-Visual Speaker Diarization. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2020), 4367–4371.Google Scholar
Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael A. Specter, and Lalana Kagal. 2018. Explaining Explanations: An Overview of Interpretability of Machine Learning. 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) (2018), 80–89.Google Scholar
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. CoRR abs/1412.6572(2015).Google Scholar
Michael Hassid, Michelle Tadmor Ramanovich, Brendan Shillingford, Miaosen Wang, Ye Jia, and Tal Remez. 2021. More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech. ArXiv abs/2111.10139(2021).Google Scholar
Chenxu Hu, Qiao Tian, Tingle Li, Wang Yuping, Yuxuan Wang, and Hang Zhao. 2021. Neural Dubber: Dubbing for Videos According to Scripts. Advances in Neural Information Processing Systems 34 (2021).Google Scholar
Venkatesh S. Kadandale, Juan F. Montesinos, and Gloria Haro. 2022. VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices. ArXiv abs/2204.02090(2022).Google Scholar
Alina Karakanta, Supratik Bhattacharya, Shravan Nayak, Timo Baumann, Matteo Negri, and Marco Turchi. 2020. The Two Shades of Dubbing in Neural Machine Translation. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 4327–4333. https://doi.org/10.18653/v1/2020.coling-main.382Google ScholarCross Ref
You Jin Kim, Hee-Soo Heo, Soo-Whan Chung, and Bong-Jin Lee. 2021. End-To-End Lip Synchronisation Based on Pattern Classification. 2021 IEEE Spoken Language Technology Workshop (SLT) (2021), 598–605.Google Scholar
Sebastian Kraft and Udo Zölzer. 2014. BeaqleJS: HTML5 and JavaScript based Framework for the Subjective Evaluation of Audio Quality.Google Scholar
Junchen Lu, Berrak Sisman, Rui Liu, Mingyang Zhang, and Haizhou Li. 2022. Visualtts: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2022), 8032–8036.Google Scholar
Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (Dec. 1976), 746–748. https://doi.org/10.1038/264746a0 Number: 5588 Publisher: Nature Publishing Group.Google Scholar
Paul Mermelstein. 1976. Distance Measures for Speech Recognition – Psychological and Instrumental. In Pattern Recognition and Artificial Intelligence, Proceedings of the Joint Workshop on Pattern Recognition and Artificial Intelligence, C. H. Chen (Ed.). 374–388.Google Scholar
A. Natarajan, M. Motani, B. de Silva, K. Yap, and K. C. Chua. 2007. Investigating Network Architectures for Body Sensor Networks. In Network Architectures, G. Whitcomb and P. Neece (Eds.). Keleuven Press, Dayton, OH, 322–328. arXiv:960935712 [cs]Google Scholar
Shravan Nayak, Timo Baumann, Supratik Bhattacharya, Alina Karakanta, Matteo Negri, and Marco Turchi. 2020. See me Speaking? Differentiating on Whether Words are Spoken On Screen or Off to Optimize Machine Dubbing. In Companion Publication of the 2020 International Conference on Multimodal Interaction. ACM, Virtual Event Netherlands, 130–134. https://doi.org/10.1145/3395035.3425640Google ScholarDigital Library
Alp Öktem, Mireia Farrús, and Antonio Bonafonte. 2018. Bilingual Prosodic Dataset Compilation for Spoken Language Translation. In Proceedings of IberSPEECH 2018 (Barcelona, Spain, 21-23 November 2018). 20–24. https://www.isca-speech.org/archive/IberSPEECH_2018/pdfs/IberS18_P1-1_Oktem.pdfGoogle ScholarCross Ref
Margaret H. Pinson. 2011. Audiovisual Quality Components: An Analysis. NA (Nov. 2011). https://www.its.bldrdoc.gov/publications/details.aspx?pub=2565 Publisher: ITS.Google ScholarCross Ref
K. R. Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C. V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. 28th ACM International Conference on Multimedia (ACM MM) (Oct. 2020). https://doi.org/10.1145/3394171.3413532 Publisher: Association for Computing Machinery.Google ScholarDigital Library
K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. Association for Computing Machinery, New York, NY, USA, 484–492. https://doi.org/10.1145/3394171.3413532Google ScholarDigital Library
Prajwal K R, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C. V. Jawahar. 2020. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 13793–13802.Google Scholar
Ashutosh Saboo and Timo Baumann. 2019. Integration of Dubbing Constraints into Machine Translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers). Association for Computational Linguistics, Florence, Italy, 94–101. https://doi.org/10.18653/v1/W19-5210Google ScholarCross Ref
Debjoy Saha, Shravan Nayak, and Timo Baumann. 2022. Merkel Podcast Corpus: A Multimodal Dataset Compiled from 16 Years of Angela Merkel’s Weekly Video Podcasts. ArXiv abs/2205.12194(2022).Google Scholar
Florian Schiel. 2004. MAUS Goes Iterative. In LREC.Google Scholar
Yoav Shalev and Lior Wolf. 2020. End to End Lip Synchronization with a Temporal AutoEncoder. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) (2020), 330–339.Google ScholarCross Ref
Shijing Si, Jianzong Wang, Xiaoyang Qu, Ning Cheng, Wenqi Wei, Xinghua Zhu, and Jing Xiao. 2021. Speech2Video: Cross-Modal Distillation for Speech to Video Generation. In Interspeech.Google Scholar
Yaroslav V. Sokolovsky. 2010. On the Linguistic Definition of Translation. undefined (2010). https://www.semanticscholar.org/paper/On-the-Linguistic-Definition-of-Translation-Sokolovsky/b08bccc1d956ed35b5d1c5f89d7e9972cd3532aeGoogle Scholar
Joon Son Son and Andrew Zisserman. 2017. Lip Reading in Profile. In Proceedings of the British Machine Vision Conference (BMVC), Gabriel Brostow Tae-Kyun Kim, Stefanos Zafeiriou and Krystian Mikolajczyk (Eds.). BMVA Press, Article 155, 11 pages. https://doi.org/10.5244/C.31.155Google ScholarCross Ref
International Telecommunication Union. 2015. Recommendation ITU-R BS.1534-3: Method for the subjective assessment of intermediate quality level of audio systems. Technical Report. International Telecommunication Union.Google Scholar
Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. 2019. Adversarial Examples: Attacks and Defenses for Deep Learning. IEEE Transactions on Neural Networks and Learning Systems 30 (2019), 2805–2824.Google ScholarCross Ref
Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 4174–4184.Google ScholarCross Ref

Index Terms

A Deep Dive Into Neural Synchrony Evaluation for Audio-visual Translation
1. Applied computing
  1. Arts and humanities
    1. Language translation
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Audio-visual granular synthesis performance demo
IE '13: Proceedings of The 9th Australasian Conference on Interactive Entertainment: Matters of Life and Death

In this paper, I present a prototype of my audio-visual granular synthesis instrument Kortex. The instrument enables real-time improvisation of audio-visual material in a performance context. Granular synthesis is a processing technique that segments ...
Read More
Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

Emotion recognition is a challenging task because of the emotional gap between subjective emotion and the low-level audio-visual features. Inspired by the recent success of deep learning in bridging the semantic gap, this paper proposes to bridge the ...
Read More
Creating an A Cappella Singing Audio Dataset for Automatic Jingju Singing Evaluation Research
DLfM '17: Proceedings of the 4th International Workshop on Digital Libraries for Musicology

The data-driven computational research on automatic jingju (also known as Beijing or Peking opera) singing evaluation lacks a suitable and comprehensive a cappella singing audio dataset. In this work, we present an a cappella singing audio dataset which ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction
November 2022
830 pages
ISBN:9781450393904
DOI:10.1145/3536221
Editors:
Raj Tumuluri
Openstream
,
Nicu Sebe
University of Trento
,
Gopal Pingali
Accenture
,
Dinesh Babu Jayagopi
IIIT Bangalore
,
Abhinav Dhall
IIT Ropar
,
Richa Singh
IIT Jodhpur
,
Lisa Anthony
University of Florida
,
Albert Ali Salah
Utrecht University and Boğaziçi University
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 November 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
audio-visual synchrony
dubbing
speech-lip synchrony
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate453of1,080submissions,42%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 191
  Total Downloads
- Downloads (Last 12 months)57
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

A Deep Dive Into Neural Synchrony Evaluation for Audio-visual Translation

ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Audio-visual granular synthesis performance demo

Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition

Creating an A Cappella Singing Audio Dataset for Automatic Jingju Singing Evaluation Research