skip to main content
10.1145/3536221.3556621acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

A Deep Dive Into Neural Synchrony Evaluation for Audio-visual Translation

Published:07 November 2022Publication History

ABSTRACT

We present a comprehensive analysis of the neural audio-visual synchrony evaluation tool SyncNet. We assess the agreement of SyncNet scores vis-a-vis human perception and whether we can use these as a reliable metric for evaluating audio-visual lip-synchrony in generation tasks with no ground truth reference audio-video pair. We further look into the underlying elements in audio and video which vitally affect synchrony using interpretable explanations from SyncNet predictions and analyse its susceptibility by introducing adversarial noise. SyncNet has been used in numerous papers on visually-grounded text-to-speech for scenarios such as dubbing. We focus on this scenario which features many local asynchronies (something that SyncNet isn’t made for).

References

  1. Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. LRS3-TED: a large-scale dataset for visual speech recognition. CoRR abs/1809.00496(2018). arXiv:1809.00496http://arxiv.org/abs/1809.00496Google ScholarGoogle Scholar
  2. Rehan Ahmad, Syed Zubair, Hani Alquhayz, and Allah Ditta. 2019. Multimodal speaker diarization using a pre-trained audio-visual synchronization model. Sensors 19, 23 (2019), 5163.Google ScholarGoogle ScholarCross RefCross Ref
  3. Stefano Arduini and Robert Hodgson. 2007. Similarity and Difference in Translation. Ed. di Storia e Letteratura.Google ScholarGoogle Scholar
  4. R. Barsam and D. Mohanan. 2010. Looking at Movies: An Introduction to Film.3 rd ed. New York:W. W. Norton & Company..Google ScholarGoogle Scholar
  5. Julie N. Buchan, Martin Paré, and Kevin G. Munhall. 2008. The effect of varying talker identity and listening conditions on gaze behavior during audiovisual speech perception. Brain Research 1242 (Nov. 2008), 162–171. https://doi.org/10.1016/j.brainres.2008.06.083Google ScholarGoogle Scholar
  6. Dick Bulterman. 2008. Synchronized Multimedia Integration Language (SMIL 3.0). W3C Recommendation. W3C. https://www.w3.org/TR/2008/REC-SMIL3-20081201/.Google ScholarGoogle Scholar
  7. Frederic Chaume. 2018. An overview of audiovisual translation: Four methodological turns in a mature discipline. Journal of Audiovisual Translation 1 (Nov. 2018), 40–63. https://doi.org/10.47476/jat.v1i1.43Google ScholarGoogle ScholarCross RefCross Ref
  8. Frederic Chaume Varela. 2004. Synchronization in dubbing: A translational approach. In Benjamins Translation Library, Pilar Orero (Ed.). Vol. 56. John Benjamins Publishing Company, Amsterdam, 35–52. https://doi.org/10.1075/btl.56.07chaGoogle ScholarGoogle Scholar
  9. Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.Google ScholarGoogle Scholar
  10. Joon Son Chung, Andrew W. Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 3444–3453.Google ScholarGoogle Scholar
  11. Joon Son Chung and Andrew Zisserman. 2017. Lip Reading in the Wild. In Computer Vision – ACCV 2016, Shang-Hong Lai, Vincent Lepetit, Ko Nishino, and Yoichi Sato (Eds.). Springer International Publishing, Cham, 87–103.Google ScholarGoogle Scholar
  12. Joon Son Chung and Andrew Zisserman. 2017. Out of Time: Automated Lip Sync in the Wild. In Computer Vision – ACCV 2016 Workshops, Chu-Song Chen, Jiwen Lu, and Kai-Kuang Ma (Eds.). Springer International Publishing, Cham, 251–263.Google ScholarGoogle Scholar
  13. Martin Cooke, Jon Barker, Stuart P. Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition.The Journal of the Acoustical Society of America 120 5 Pt 1 (2006), 2421–4.Google ScholarGoogle Scholar
  14. Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, and Liqiang Wang. 2020. Self-Supervised Learning for Audio-Visual Speaker Diarization. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2020), 4367–4371.Google ScholarGoogle Scholar
  15. Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael A. Specter, and Lalana Kagal. 2018. Explaining Explanations: An Overview of Interpretability of Machine Learning. 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) (2018), 80–89.Google ScholarGoogle Scholar
  16. Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. CoRR abs/1412.6572(2015).Google ScholarGoogle Scholar
  17. Michael Hassid, Michelle Tadmor Ramanovich, Brendan Shillingford, Miaosen Wang, Ye Jia, and Tal Remez. 2021. More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech. ArXiv abs/2111.10139(2021).Google ScholarGoogle Scholar
  18. Chenxu Hu, Qiao Tian, Tingle Li, Wang Yuping, Yuxuan Wang, and Hang Zhao. 2021. Neural Dubber: Dubbing for Videos According to Scripts. Advances in Neural Information Processing Systems 34 (2021).Google ScholarGoogle Scholar
  19. Venkatesh S. Kadandale, Juan F. Montesinos, and Gloria Haro. 2022. VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices. ArXiv abs/2204.02090(2022).Google ScholarGoogle Scholar
  20. Alina Karakanta, Supratik Bhattacharya, Shravan Nayak, Timo Baumann, Matteo Negri, and Marco Turchi. 2020. The Two Shades of Dubbing in Neural Machine Translation. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 4327–4333. https://doi.org/10.18653/v1/2020.coling-main.382Google ScholarGoogle ScholarCross RefCross Ref
  21. You Jin Kim, Hee-Soo Heo, Soo-Whan Chung, and Bong-Jin Lee. 2021. End-To-End Lip Synchronisation Based on Pattern Classification. 2021 IEEE Spoken Language Technology Workshop (SLT) (2021), 598–605.Google ScholarGoogle Scholar
  22. Sebastian Kraft and Udo Zölzer. 2014. BeaqleJS: HTML5 and JavaScript based Framework for the Subjective Evaluation of Audio Quality.Google ScholarGoogle Scholar
  23. Junchen Lu, Berrak Sisman, Rui Liu, Mingyang Zhang, and Haizhou Li. 2022. Visualtts: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2022), 8032–8036.Google ScholarGoogle Scholar
  24. Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (Dec. 1976), 746–748. https://doi.org/10.1038/264746a0 Number: 5588 Publisher: Nature Publishing Group.Google ScholarGoogle Scholar
  25. Paul Mermelstein. 1976. Distance Measures for Speech Recognition – Psychological and Instrumental. In Pattern Recognition and Artificial Intelligence, Proceedings of the Joint Workshop on Pattern Recognition and Artificial Intelligence, C. H. Chen (Ed.). 374–388.Google ScholarGoogle Scholar
  26. A. Natarajan, M. Motani, B. de Silva, K. Yap, and K. C. Chua. 2007. Investigating Network Architectures for Body Sensor Networks. In Network Architectures, G. Whitcomb and P. Neece (Eds.). Keleuven Press, Dayton, OH, 322–328. arXiv:960935712 [cs]Google ScholarGoogle Scholar
  27. Shravan Nayak, Timo Baumann, Supratik Bhattacharya, Alina Karakanta, Matteo Negri, and Marco Turchi. 2020. See me Speaking? Differentiating on Whether Words are Spoken On Screen or Off to Optimize Machine Dubbing. In Companion Publication of the 2020 International Conference on Multimodal Interaction. ACM, Virtual Event Netherlands, 130–134. https://doi.org/10.1145/3395035.3425640Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Alp Öktem, Mireia Farrús, and Antonio Bonafonte. 2018. Bilingual Prosodic Dataset Compilation for Spoken Language Translation. In Proceedings of IberSPEECH 2018 (Barcelona, Spain, 21-23 November 2018). 20–24. https://www.isca-speech.org/archive/IberSPEECH_2018/pdfs/IberS18_P1-1_Oktem.pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  29. Margaret H. Pinson. 2011. Audiovisual Quality Components: An Analysis. NA (Nov. 2011). https://www.its.bldrdoc.gov/publications/details.aspx?pub=2565 Publisher: ITS.Google ScholarGoogle ScholarCross RefCross Ref
  30. K. R. Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C. V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. 28th ACM International Conference on Multimedia (ACM MM) (Oct. 2020). https://doi.org/10.1145/3394171.3413532 Publisher: Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. Association for Computing Machinery, New York, NY, USA, 484–492. https://doi.org/10.1145/3394171.3413532Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Prajwal K R, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C. V. Jawahar. 2020. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 13793–13802.Google ScholarGoogle Scholar
  33. Ashutosh Saboo and Timo Baumann. 2019. Integration of Dubbing Constraints into Machine Translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers). Association for Computational Linguistics, Florence, Italy, 94–101. https://doi.org/10.18653/v1/W19-5210Google ScholarGoogle ScholarCross RefCross Ref
  34. Debjoy Saha, Shravan Nayak, and Timo Baumann. 2022. Merkel Podcast Corpus: A Multimodal Dataset Compiled from 16 Years of Angela Merkel’s Weekly Video Podcasts. ArXiv abs/2205.12194(2022).Google ScholarGoogle Scholar
  35. Florian Schiel. 2004. MAUS Goes Iterative. In LREC.Google ScholarGoogle Scholar
  36. Yoav Shalev and Lior Wolf. 2020. End to End Lip Synchronization with a Temporal AutoEncoder. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) (2020), 330–339.Google ScholarGoogle ScholarCross RefCross Ref
  37. Shijing Si, Jianzong Wang, Xiaoyang Qu, Ning Cheng, Wenqi Wei, Xinghua Zhu, and Jing Xiao. 2021. Speech2Video: Cross-Modal Distillation for Speech to Video Generation. In Interspeech.Google ScholarGoogle Scholar
  38. Yaroslav V. Sokolovsky. 2010. On the Linguistic Definition of Translation. undefined (2010). https://www.semanticscholar.org/paper/On-the-Linguistic-Definition-of-Translation-Sokolovsky/b08bccc1d956ed35b5d1c5f89d7e9972cd3532aeGoogle ScholarGoogle Scholar
  39. Joon Son Son and Andrew Zisserman. 2017. Lip Reading in Profile. In Proceedings of the British Machine Vision Conference (BMVC), Gabriel Brostow Tae-Kyun Kim, Stefanos Zafeiriou and Krystian Mikolajczyk (Eds.). BMVA Press, Article 155, 11 pages. https://doi.org/10.5244/C.31.155Google ScholarGoogle ScholarCross RefCross Ref
  40. International Telecommunication Union. 2015. Recommendation ITU-R BS.1534-3: Method for the subjective assessment of intermediate quality level of audio systems. Technical Report. International Telecommunication Union.Google ScholarGoogle Scholar
  41. Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. 2019. Adversarial Examples: Attacks and Defenses for Deep Learning. IEEE Transactions on Neural Networks and Learning Systems 30 (2019), 2805–2824.Google ScholarGoogle ScholarCross RefCross Ref
  42. Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 4174–4184.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Deep Dive Into Neural Synchrony Evaluation for Audio-visual Translation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction
        November 2022
        830 pages
        ISBN:9781450393904
        DOI:10.1145/3536221

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 November 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate453of1,080submissions,42%
      • Article Metrics

        • Downloads (Last 12 months)57
        • Downloads (Last 6 weeks)4

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format