Abstract
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. This holistic solution addresses what is said, when it’s said, and who is speaking, providing a more comprehensive and accurate character-aware subtitling for TV shows. Our approach brings improvements on two fronts: first, we show that audio-visual synchronisation can be used to pick out the talking face amongst others present in a video clip, and assign an identity to the corresponding speech segment. This audio-visual approach improves recognition accuracy and yield over current methods. Second, we show that the speaker of short segments can be determined by using the temporal context of the dialogue within a scene. We propose an approach using local voice embeddings of the audio, and large language model reasoning on the text transcription. This overcomes a limitation of existing methods that they are unable to accurately assign speakers to short temporal segments. We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches. Project page : https://www.robots.ox.ac.uk/~vgg/research/llr-context/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE PAMI (2019)
Afouras, T., Chung, J.S., Zisserman, A.: Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018)
Afouras, T., Chung, J.S., Zisserman, A.: Now you’re speaking my language: Visual language identification. In: INTERSPEECH (2020)
Afouras, T., Owens, A., Chung, J.S., Zisserman, A.: Self-supervised learning of audio-visual objects from video. In: Proc. ECCV (2020)
AI@Meta: Llama 3 model card (2024), https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022)
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33, 12449–12460 (2020)
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Bain, M., Huh, J., Han, T., Zisserman, A.: Whisperx: Time-accurate speech transcription of long-form audio. In: INTERSPEECH (2023)
Berg, T., Berg, A., Edwards, J., Mair, M., White, R., Teh, Y., Learned-Miller, E., Forsyth, D.: Names and Faces in the News. In: Proc. CVPR (2004)
Bojanowski, P., Bach, F., , Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Finding actors and actions in movies. In: Proc. ICCV (2013)
Bost, X., Linarès, G., Gueye, S.: Audiovisual speaker diarization of tv series. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4799–4803. IEEE (2015)
Bredin, H.: pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In: Proc. Interspeech. pp. 1983–1987. ISCA (2023)
Bredin, H., Laurent, A.: End-to-end speaker segmentation for overlap-aware resegmentation. In: Proc. Interspeech 2021. Brno, Czech Republic (August 2021)
Brown, A., Coto, E., Zisserman, A.: Automated video labelling: Identifying faces by corroborative evidence. In: International Conference on Multimedia Information Processing and Retrieval (2021)
Brown, A., Kalogeiton, V., Zisserman, A.: Face, body, voice: Video person-clustering with multiple modalities. In: ICCV 2021 Workshop on AI for Creative Video Editing and Understanding (2021)
Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell. arXiv preprint arXiv:1508.01211 (2015)
Chen, J., Zhu, D., Haydarov, K., Li, X., Elhoseiny, M.: Video chatcaptioner: Towards the enriched spatiotemporal descriptions. arXiv preprint arXiv:2304.04227 (2023)
Cheng, G., Chen, Y., Yang, R., Li, Q., Yang, Z., Ye, L., Zhang, P., Zhang, Q., Xie, L., Qian, Y., et al.: The conversational short-phrase speaker diarization (cssd) task: Dataset, evaluation metric and baselines. In: 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP). pp. 488–492. IEEE (2022)
Chung, J.S., Huh, J., Mun, S., Lee, M., Heo, H.S., Choe, S., Ham, C., Jung, S., Lee, B.J., Han, I.: In defence of metric learning for speaker recognition. In: Interspeech (2020)
Chung, J.S., Huh, J., Nagrani, A., Afouras, T., Zisserman, A.: Spot the conversation: speaker diarisation in the wild. In: INTERSPEECH (2020)
Defossez, A., Synnaeve, G., Adi, Y.: Real time speech enhancement in the waveform domain. In: Interspeech (2020)
Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: Emphasized Channel Attention, propagation and aggregation in TDNN based speaker verification. In: Interspeech 2020. pp. 3830–3834 (2020)
Diez, M., Burget, L., Landini, F., Černockỳ, J.: Analysis of speaker diarization based on bayesian hmm with eigenvoice priors. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 355–368 (2019)
Everingham, M., Sivic, J., Zisserman, A.: “Hello! My name is... Buffy” – automatic naming of characters in TV video. In: Proc. BMVC (2006)
Everingham, M., Sivic, J., Zisserman, A.: Taking the bite out of automatic naming of characters in TV video. Image and Vision Computing 27(5) (2009)
Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., Watanabe, S.: End-to-End Neural Speaker Diarization with Permutation-free Objectives. In: Interspeech. pp. 4300–4304 (2019)
Gabeur, V., Seo, P.H., Nagrani, A., Sun, C., Alahari, K., Schmid, C.: Avatar: Unconstrained audiovisual speech recognition. arXiv preprint arXiv:2206.07684 (2022)
Gong, Y., Liu, A.H., Luo, H., Karlinsky, L., Glass, J.: Joint audio and speech understanding. In: 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2023)
Gong, Y., Luo, H., Liu, A.H., Karlinsky, L., Glass, J.: Listen, think, and understand. In: Proc. ICLR (2023)
Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., et al.: Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020)
Han, T., Bain, M., Nagrani, A., Varol, G., Xie, W., Zisserman, A.: Autoad: Movie description in context. In: Proc. CVPR (2023)
Han, T., Bain, M., Nagrani, A., Varol, G., Xie, W., Zisserman, A.: AutoAD III: The prequel – back to the pixels. In: Proc. CVPR (2024)
Haurilet, M.L., Tapaswi, M., Al-Halah, Z., Stiefelhagen, R.: Naming tv characters by watching and analyzing dialogs. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1–9. IEEE (2016)
He, Y., Kang, Z., Wang, J., Peng, J., Xiao, J.: Voiceextender: Short-utterance text-independent speaker verification with guided diffusion model. In: 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). pp. 1–8. IEEE (2023)
Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3451–3460 (2021)
Hu, Y., Ren, J.S., Dai, J., Yuan, C., Xu, L., Wang, W.: Deep multimodal speaker naming. In: Proceedings of the 23rd ACM international conference on Multimedia. pp. 1107–1110 (2015)
Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D.d.l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al.: Mistral 7b. arXiv preprint arXiv:2310.06825 (2023)
Kalogeiton, V., Zisserman, A.: Constrained video face clustering using 1nn relations. In: Proc. BMVC (2020)
Kaphungkui, N., Kandali, A.B.: Text dependent speaker recognition with back propagation neural network. International Journal of Engineering and Advanced Technology (IJEAT) 8(5), 1431–1434 (2019)
Kinoshita, K., Delcroix, M., Tawara, N.: Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 7198–7202. IEEE (2021)
Koluguri, N.R., Park, T., Ginsburg, B.: Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 8102–8106. IEEE (2022)
Korbar, B., Huh, J., Zisserman, A.: Look, listen and recognise: character-aware audio-visual subtitling. In: International Conference on Acoustics, Speech, and Signal Processing (2024)
Korbar, B., Zisserman, A.: Personalised clip or: how to find your vacation videos. In: Proc. BMVC (2022)
Kraaij, W., Hain, T., Lincoln, M., Post, W.: The ami meeting corpus. In: Proc. International Conference on Methods and Techniques in Behavioral Research (2005)
Kwak, D., Jung, J., Nam, K., Jang, Y., Jung, J.w., Watanebe, S., Chung, J.S.: Voxmm: Rich transcription of conversations in the wild. In: International Conference on Acoustics, Speech, and Signal Processing (2024)
Lerner, P., Bergoënd, J., Guinaudeau, C., Bredin, H., Maurice, B., Lefevre, S., Bouteiller, M., Berhe, A., Galmant, L., Yin, R., et al.: Bazinga! a dataset for multi-party dialogues structuring. In: 13th Conference on Language Resources and Evaluation (LREC 2022). pp. 3434–3441 (2022)
Li, K., Wrench, E., Jr.: Text-independent speaker recognition with short utterances. The Journal of the Acoustical Society of America 72(S1), S29–S30 (1982)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Liu, T., Yu, K.: Ber: Balanced error rate for speaker diarization. arXiv preprint arXiv:2211.04304 (2022)
Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Sun, Y., Deng, C., Xu, H., Xie, Z., Ruan, C.: Deepseek-vl: Towards real-world vision-language understanding (2024)
Ma, P., Haliassos, A., Fernandez-Lopez, A., Chen, H., Petridis, S., Pantic, M.: Auto-avsr: Audio-visual speech recognition with automatic labels. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)
Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-chatgpt: Towards detailed video understanding via large vision and language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) (2024)
Malik, M., Malik, M.K., Mehmood, K., Makhdoom, I.: Automatic speech recognition: a survey. Multimedia Tools and Applications 80, 9411–9457 (2021)
Nagrani, A., Chung, J.S., Xie, W., Zisserman, A.: Voxceleb: Large-scale speaker verification in the wild. Computer Speech and Language (2019)
Nagrani, A., Zisserman, A.: From benedict cumberbatch to sherlock holmes: Character identification in tv series without a script. In: Proc. BMVC (2017)
The 2009 (RT-09) Rich Transcription Meeting Recognition Evaluation Plan (2009 (accessed 1 July 2024)), https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf, See Section 6
Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., Narayanan, S.: A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language 72, 101317 (2022)
Poddar, A., Sahidullah, M., Saha, G.: Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biometrics 7(2), 91–101 (2018)
Poignant, J., Bredin, H., Barras, C.: Multimodal person discovery in broadcast tv: lessons learned from mediaeval 2015. Multimedia Tools and Applications 76, 22547–22567 (2017)
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning. pp. 28492–28518. PMLR (2023)
Ramanathan, V., Joulin, A., Liang, P., Fei-Fei, L.: Linking people in videos with “their” names using coreference resolution. In: Proc. ECCV. pp. 95–110. Springer (2014)
Sharma, R., Narayanan, S.: Using active speaker faces for diarization in tv shows. arXiv preprint arXiv:2203.15961 (2022)
Shi, B., Hsu, W.N., Lakhotia, K., Mohamed, A.: Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184 (2022)
Simple diarization repository. https://github.com/JaesungHuh/SimpleDiarization (2024)
Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Guo, X., Ye, T., Lu, Y., Hwang, J.N., et al.: Moviechat: From dense token to sparse memory for long video understanding. In: Proc. CVPR (2024)
Suchitha, T., Bindu, A.: Feature extraction using mfcc and classification using gmm. International Journal for Scientific Research & Development (IJSRD) 3(5), 1278–1283 (2015)
Szarkowska, A.: Subtitling for the deaf and the hard of hearing. The Palgrave Handbook of Audiovisual Translation and Media Accessibility pp. 249–268 (2020)
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Team, S.: Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier. https://github.com/snakers4/silero-vad (2021)
Torgashov, N., Makarov, R., Yakovlev, I., Malov, P., Balykin, A., Okhotnikov, A.: The id r &d voxceleb speaker recognition challenge 2023 system description. arXiv preprint arXiv:2308.08294 (2023)
Wang, J., Chen, D., Luo, C., Dai, X., Yuan, L., Wu, Z., Jiang, Y.G.: Chatvideo: A tracklet-centric multimodal and versatile video understanding system. arXiv preprint arXiv:2304.14407 (2023)
Wang, Q., Downey, C., Wan, L., Mansfield, P.A., Moreno, I.L.: Speaker diarization with lstm. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP). pp. 5239–5243. IEEE (2018)
Xu, E.Z., Song, Z., Tsutsui, S., Feng, C., Ye, M., Shou, M.Z.: Ava-avd: Audio-visual speaker diarization in the wild. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 3838–3847 (2022)
Zhang, A., Wang, Q., Zhu, Z., Paisley, J., Wang, C.: Fully supervised speaker diarization. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6301–6305. IEEE (2019)
Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023), https://arxiv.org/abs/2306.02858
Acknowledgments
This research is supported by EPSRC Programme Grant VisualAI EP/T028572/1 and a Royal Society Research Professorship RP\(\backslash \)R1\(\backslash \)191132. We thank Robin and Bruno for helpful discussions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Huh, J., Zisserman, A. (2025). Character-Aware Audio-Visual Subtitling in Context. In: Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H. (eds) Computer Vision – ACCV 2024. ACCV 2024. Lecture Notes in Computer Science, vol 15474. Springer, Singapore. https://doi.org/10.1007/978-981-96-0908-6_21
Download citation
DOI: https://doi.org/10.1007/978-981-96-0908-6_21
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0907-9
Online ISBN: 978-981-96-0908-6
eBook Packages: Computer ScienceComputer Science (R0)