Skip to main content

Character-Aware Audio-Visual Subtitling in Context

  • Conference paper
  • First Online:
Computer Vision – ACCV 2024 (ACCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15474))

Included in the following conference series:

  • 83 Accesses

Abstract

This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. This holistic solution addresses what is said, when it’s said, and who is speaking, providing a more comprehensive and accurate character-aware subtitling for TV shows. Our approach brings improvements on two fronts: first, we show that audio-visual synchronisation can be used to pick out the talking face amongst others present in a video clip, and assign an identity to the corresponding speech segment. This audio-visual approach improves recognition accuracy and yield over current methods. Second, we show that the speaker of short segments can be determined by using the temporal context of the dialogue within a scene. We propose an approach using local voice embeddings of the audio, and large language model reasoning on the text transcription. This overcomes a limitation of existing methods that they are unable to accurately assign speakers to short temporal segments. We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches. Project page : https://www.robots.ox.ac.uk/~vgg/research/llr-context/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.robots.ox.ac.uk/~vgg/research/look-listen-recognise/.

References

  1. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE PAMI (2019)

    Google Scholar 

  3. Afouras, T., Chung, J.S., Zisserman, A.: Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018)

  4. Afouras, T., Chung, J.S., Zisserman, A.: Now you’re speaking my language: Visual language identification. In: INTERSPEECH (2020)

    Google Scholar 

  5. Afouras, T., Owens, A., Chung, J.S., Zisserman, A.: Self-supervised learning of audio-visual objects from video. In: Proc. ECCV (2020)

    Google Scholar 

  6. AI@Meta: Llama 3 model card (2024), https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

  7. Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022)

    Google Scholar 

  8. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33, 12449–12460 (2020)

    Google Scholar 

  9. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)

  10. Bain, M., Huh, J., Han, T., Zisserman, A.: Whisperx: Time-accurate speech transcription of long-form audio. In: INTERSPEECH (2023)

    Google Scholar 

  11. Berg, T., Berg, A., Edwards, J., Mair, M., White, R., Teh, Y., Learned-Miller, E., Forsyth, D.: Names and Faces in the News. In: Proc. CVPR (2004)

    Google Scholar 

  12. Bojanowski, P., Bach, F., , Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Finding actors and actions in movies. In: Proc. ICCV (2013)

    Google Scholar 

  13. Bost, X., Linarès, G., Gueye, S.: Audiovisual speaker diarization of tv series. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4799–4803. IEEE (2015)

    Google Scholar 

  14. Bredin, H.: pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In: Proc. Interspeech. pp. 1983–1987. ISCA (2023)

    Google Scholar 

  15. Bredin, H., Laurent, A.: End-to-end speaker segmentation for overlap-aware resegmentation. In: Proc. Interspeech 2021. Brno, Czech Republic (August 2021)

    Google Scholar 

  16. Brown, A., Coto, E., Zisserman, A.: Automated video labelling: Identifying faces by corroborative evidence. In: International Conference on Multimedia Information Processing and Retrieval (2021)

    Google Scholar 

  17. Brown, A., Kalogeiton, V., Zisserman, A.: Face, body, voice: Video person-clustering with multiple modalities. In: ICCV 2021 Workshop on AI for Creative Video Editing and Understanding (2021)

    Google Scholar 

  18. Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell. arXiv preprint arXiv:1508.01211 (2015)

  19. Chen, J., Zhu, D., Haydarov, K., Li, X., Elhoseiny, M.: Video chatcaptioner: Towards the enriched spatiotemporal descriptions. arXiv preprint arXiv:2304.04227 (2023)

  20. Cheng, G., Chen, Y., Yang, R., Li, Q., Yang, Z., Ye, L., Zhang, P., Zhang, Q., Xie, L., Qian, Y., et al.: The conversational short-phrase speaker diarization (cssd) task: Dataset, evaluation metric and baselines. In: 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP). pp. 488–492. IEEE (2022)

    Google Scholar 

  21. Chung, J.S., Huh, J., Mun, S., Lee, M., Heo, H.S., Choe, S., Ham, C., Jung, S., Lee, B.J., Han, I.: In defence of metric learning for speaker recognition. In: Interspeech (2020)

    Google Scholar 

  22. Chung, J.S., Huh, J., Nagrani, A., Afouras, T., Zisserman, A.: Spot the conversation: speaker diarisation in the wild. In: INTERSPEECH (2020)

    Google Scholar 

  23. Defossez, A., Synnaeve, G., Adi, Y.: Real time speech enhancement in the waveform domain. In: Interspeech (2020)

    Google Scholar 

  24. Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: Emphasized Channel Attention, propagation and aggregation in TDNN based speaker verification. In: Interspeech 2020. pp. 3830–3834 (2020)

    Google Scholar 

  25. Diez, M., Burget, L., Landini, F., Černockỳ, J.: Analysis of speaker diarization based on bayesian hmm with eigenvoice priors. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 355–368 (2019)

    Article  Google Scholar 

  26. Everingham, M., Sivic, J., Zisserman, A.: “Hello! My name is... Buffy” – automatic naming of characters in TV video. In: Proc. BMVC (2006)

    Google Scholar 

  27. Everingham, M., Sivic, J., Zisserman, A.: Taking the bite out of automatic naming of characters in TV video. Image and Vision Computing 27(5) (2009)

    Google Scholar 

  28. Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., Watanabe, S.: End-to-End Neural Speaker Diarization with Permutation-free Objectives. In: Interspeech. pp. 4300–4304 (2019)

    Google Scholar 

  29. Gabeur, V., Seo, P.H., Nagrani, A., Sun, C., Alahari, K., Schmid, C.: Avatar: Unconstrained audiovisual speech recognition. arXiv preprint arXiv:2206.07684 (2022)

  30. Gong, Y., Liu, A.H., Luo, H., Karlinsky, L., Glass, J.: Joint audio and speech understanding. In: 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2023)

    Google Scholar 

  31. Gong, Y., Luo, H., Liu, A.H., Karlinsky, L., Glass, J.: Listen, think, and understand. In: Proc. ICLR (2023)

    Google Scholar 

  32. Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., et al.: Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020)

  33. Han, T., Bain, M., Nagrani, A., Varol, G., Xie, W., Zisserman, A.: Autoad: Movie description in context. In: Proc. CVPR (2023)

    Google Scholar 

  34. Han, T., Bain, M., Nagrani, A., Varol, G., Xie, W., Zisserman, A.: AutoAD III: The prequel – back to the pixels. In: Proc. CVPR (2024)

    Google Scholar 

  35. Haurilet, M.L., Tapaswi, M., Al-Halah, Z., Stiefelhagen, R.: Naming tv characters by watching and analyzing dialogs. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1–9. IEEE (2016)

    Google Scholar 

  36. He, Y., Kang, Z., Wang, J., Peng, J., Xiao, J.: Voiceextender: Short-utterance text-independent speaker verification with guided diffusion model. In: 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). pp. 1–8. IEEE (2023)

    Google Scholar 

  37. Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3451–3460 (2021)

    Article  Google Scholar 

  38. Hu, Y., Ren, J.S., Dai, J., Yuan, C., Xu, L., Wang, W.: Deep multimodal speaker naming. In: Proceedings of the 23rd ACM international conference on Multimedia. pp. 1107–1110 (2015)

    Google Scholar 

  39. Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D.d.l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al.: Mistral 7b. arXiv preprint arXiv:2310.06825 (2023)

  40. Kalogeiton, V., Zisserman, A.: Constrained video face clustering using 1nn relations. In: Proc. BMVC (2020)

    Google Scholar 

  41. Kaphungkui, N., Kandali, A.B.: Text dependent speaker recognition with back propagation neural network. International Journal of Engineering and Advanced Technology (IJEAT) 8(5), 1431–1434 (2019)

    Google Scholar 

  42. Kinoshita, K., Delcroix, M., Tawara, N.: Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 7198–7202. IEEE (2021)

    Google Scholar 

  43. Koluguri, N.R., Park, T., Ginsburg, B.: Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 8102–8106. IEEE (2022)

    Google Scholar 

  44. Korbar, B., Huh, J., Zisserman, A.: Look, listen and recognise: character-aware audio-visual subtitling. In: International Conference on Acoustics, Speech, and Signal Processing (2024)

    Google Scholar 

  45. Korbar, B., Zisserman, A.: Personalised clip or: how to find your vacation videos. In: Proc. BMVC (2022)

    Google Scholar 

  46. Kraaij, W., Hain, T., Lincoln, M., Post, W.: The ami meeting corpus. In: Proc. International Conference on Methods and Techniques in Behavioral Research (2005)

    Google Scholar 

  47. Kwak, D., Jung, J., Nam, K., Jang, Y., Jung, J.w., Watanebe, S., Chung, J.S.: Voxmm: Rich transcription of conversations in the wild. In: International Conference on Acoustics, Speech, and Signal Processing (2024)

    Google Scholar 

  48. Lerner, P., Bergoënd, J., Guinaudeau, C., Bredin, H., Maurice, B., Lefevre, S., Bouteiller, M., Berhe, A., Galmant, L., Yin, R., et al.: Bazinga! a dataset for multi-party dialogues structuring. In: 13th Conference on Language Resources and Evaluation (LREC 2022). pp. 3434–3441 (2022)

    Google Scholar 

  49. Li, K., Wrench, E., Jr.: Text-independent speaker recognition with short utterances. The Journal of the Acoustical Society of America 72(S1), S29–S30 (1982)

    Article  Google Scholar 

  50. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

    Google Scholar 

  51. Liu, T., Yu, K.: Ber: Balanced error rate for speaker diarization. arXiv preprint arXiv:2211.04304 (2022)

  52. Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Sun, Y., Deng, C., Xu, H., Xie, Z., Ruan, C.: Deepseek-vl: Towards real-world vision-language understanding (2024)

    Google Scholar 

  53. Ma, P., Haliassos, A., Fernandez-Lopez, A., Chen, H., Petridis, S., Pantic, M.: Auto-avsr: Audio-visual speech recognition with automatic labels. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)

    Google Scholar 

  54. Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-chatgpt: Towards detailed video understanding via large vision and language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) (2024)

    Google Scholar 

  55. Malik, M., Malik, M.K., Mehmood, K., Makhdoom, I.: Automatic speech recognition: a survey. Multimedia Tools and Applications 80, 9411–9457 (2021)

    Article  Google Scholar 

  56. Nagrani, A., Chung, J.S., Xie, W., Zisserman, A.: Voxceleb: Large-scale speaker verification in the wild. Computer Speech and Language (2019)

    Google Scholar 

  57. Nagrani, A., Zisserman, A.: From benedict cumberbatch to sherlock holmes: Character identification in tv series without a script. In: Proc. BMVC (2017)

    Google Scholar 

  58. The 2009 (RT-09) Rich Transcription Meeting Recognition Evaluation Plan (2009 (accessed 1 July 2024)), https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf, See Section 6

  59. Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., Narayanan, S.: A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language 72, 101317 (2022)

    Article  Google Scholar 

  60. Poddar, A., Sahidullah, M., Saha, G.: Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biometrics 7(2), 91–101 (2018)

    Article  Google Scholar 

  61. Poignant, J., Bredin, H., Barras, C.: Multimodal person discovery in broadcast tv: lessons learned from mediaeval 2015. Multimedia Tools and Applications 76, 22547–22567 (2017)

    Article  Google Scholar 

  62. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning. pp. 28492–28518. PMLR (2023)

    Google Scholar 

  63. Ramanathan, V., Joulin, A., Liang, P., Fei-Fei, L.: Linking people in videos with “their” names using coreference resolution. In: Proc. ECCV. pp. 95–110. Springer (2014)

    Google Scholar 

  64. Sharma, R., Narayanan, S.: Using active speaker faces for diarization in tv shows. arXiv preprint arXiv:2203.15961 (2022)

  65. Shi, B., Hsu, W.N., Lakhotia, K., Mohamed, A.: Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184 (2022)

  66. Simple diarization repository. https://github.com/JaesungHuh/SimpleDiarization (2024)

    Google Scholar 

  67. Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Guo, X., Ye, T., Lu, Y., Hwang, J.N., et al.: Moviechat: From dense token to sparse memory for long video understanding. In: Proc. CVPR (2024)

    Google Scholar 

  68. Suchitha, T., Bindu, A.: Feature extraction using mfcc and classification using gmm. International Journal for Scientific Research & Development (IJSRD) 3(5), 1278–1283 (2015)

    Google Scholar 

  69. Szarkowska, A.: Subtitling for the deaf and the hard of hearing. The Palgrave Handbook of Audiovisual Translation and Media Accessibility pp. 249–268 (2020)

    Google Scholar 

  70. Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  71. Team, S.: Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier. https://github.com/snakers4/silero-vad (2021)

    Google Scholar 

  72. Torgashov, N., Makarov, R., Yakovlev, I., Malov, P., Balykin, A., Okhotnikov, A.: The id r &d voxceleb speaker recognition challenge 2023 system description. arXiv preprint arXiv:2308.08294 (2023)

  73. Wang, J., Chen, D., Luo, C., Dai, X., Yuan, L., Wu, Z., Jiang, Y.G.: Chatvideo: A tracklet-centric multimodal and versatile video understanding system. arXiv preprint arXiv:2304.14407 (2023)

  74. Wang, Q., Downey, C., Wan, L., Mansfield, P.A., Moreno, I.L.: Speaker diarization with lstm. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP). pp. 5239–5243. IEEE (2018)

    Google Scholar 

  75. Xu, E.Z., Song, Z., Tsutsui, S., Feng, C., Ye, M., Shou, M.Z.: Ava-avd: Audio-visual speaker diarization in the wild. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 3838–3847 (2022)

    Google Scholar 

  76. Zhang, A., Wang, Q., Zhu, Z., Paisley, J., Wang, C.: Fully supervised speaker diarization. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6301–6305. IEEE (2019)

    Google Scholar 

  77. Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023), https://arxiv.org/abs/2306.02858

Download references

Acknowledgments

This research is supported by EPSRC Programme Grant VisualAI EP/T028572/1 and a Royal Society Research Professorship RP\(\backslash \)R1\(\backslash \)191132. We thank Robin and Bruno for helpful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew Zisserman .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 282 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huh, J., Zisserman, A. (2025). Character-Aware Audio-Visual Subtitling in Context. In: Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H. (eds) Computer Vision – ACCV 2024. ACCV 2024. Lecture Notes in Computer Science, vol 15474. Springer, Singapore. https://doi.org/10.1007/978-981-96-0908-6_21

Download citation

  • DOI: https://doi.org/10.1007/978-981-96-0908-6_21

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-96-0907-9

  • Online ISBN: 978-981-96-0908-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics