Character-Aware Audio-Visual Subtitling in Context

Huh, Jaesung; Zisserman, Andrew

doi:10.1007/978-981-96-0908-6_21

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15474))

Included in the following conference series:

Asian Conference on Computer Vision

83 Accesses

Abstract

This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. This holistic solution addresses what is said, when it’s said, and who is speaking, providing a more comprehensive and accurate character-aware subtitling for TV shows. Our approach brings improvements on two fronts: first, we show that audio-visual synchronisation can be used to pick out the talking face amongst others present in a video clip, and assign an identity to the corresponding speech segment. This audio-visual approach improves recognition accuracy and yield over current methods. Second, we show that the speaker of short segments can be determined by using the temporal context of the dialogue within a scene. We propose an approach using local voice embeddings of the audio, and large language model reasoning on the text transcription. This overcomes a limitation of existing methods that they are unable to accurately assign speakers to short temporal segments. We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches. Project page : https://www.robots.ox.ac.uk/~vgg/research/llr-context/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.robots.ox.ac.uk/~vgg/research/look-listen-recognise/.

References

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE PAMI (2019)
Google Scholar
Afouras, T., Chung, J.S., Zisserman, A.: Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018)
Afouras, T., Chung, J.S., Zisserman, A.: Now you’re speaking my language: Visual language identification. In: INTERSPEECH (2020)
Google Scholar
Afouras, T., Owens, A., Chung, J.S., Zisserman, A.: Self-supervised learning of audio-visual objects from video. In: Proc. ECCV (2020)
Google Scholar
AI@Meta: Llama 3 model card (2024), https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022)
Google Scholar
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33, 12449–12460 (2020)
Google Scholar
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Bain, M., Huh, J., Han, T., Zisserman, A.: Whisperx: Time-accurate speech transcription of long-form audio. In: INTERSPEECH (2023)
Google Scholar
Berg, T., Berg, A., Edwards, J., Mair, M., White, R., Teh, Y., Learned-Miller, E., Forsyth, D.: Names and Faces in the News. In: Proc. CVPR (2004)
Google Scholar
Bojanowski, P., Bach, F., , Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Finding actors and actions in movies. In: Proc. ICCV (2013)
Google Scholar
Bost, X., Linarès, G., Gueye, S.: Audiovisual speaker diarization of tv series. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4799–4803. IEEE (2015)
Google Scholar
Bredin, H.: pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In: Proc. Interspeech. pp. 1983–1987. ISCA (2023)
Google Scholar
Bredin, H., Laurent, A.: End-to-end speaker segmentation for overlap-aware resegmentation. In: Proc. Interspeech 2021. Brno, Czech Republic (August 2021)
Google Scholar
Brown, A., Coto, E., Zisserman, A.: Automated video labelling: Identifying faces by corroborative evidence. In: International Conference on Multimedia Information Processing and Retrieval (2021)
Google Scholar
Brown, A., Kalogeiton, V., Zisserman, A.: Face, body, voice: Video person-clustering with multiple modalities. In: ICCV 2021 Workshop on AI for Creative Video Editing and Understanding (2021)
Google Scholar
Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell. arXiv preprint arXiv:1508.01211 (2015)
Chen, J., Zhu, D., Haydarov, K., Li, X., Elhoseiny, M.: Video chatcaptioner: Towards the enriched spatiotemporal descriptions. arXiv preprint arXiv:2304.04227 (2023)
Cheng, G., Chen, Y., Yang, R., Li, Q., Yang, Z., Ye, L., Zhang, P., Zhang, Q., Xie, L., Qian, Y., et al.: The conversational short-phrase speaker diarization (cssd) task: Dataset, evaluation metric and baselines. In: 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP). pp. 488–492. IEEE (2022)
Google Scholar
Chung, J.S., Huh, J., Mun, S., Lee, M., Heo, H.S., Choe, S., Ham, C., Jung, S., Lee, B.J., Han, I.: In defence of metric learning for speaker recognition. In: Interspeech (2020)
Google Scholar
Chung, J.S., Huh, J., Nagrani, A., Afouras, T., Zisserman, A.: Spot the conversation: speaker diarisation in the wild. In: INTERSPEECH (2020)
Google Scholar
Defossez, A., Synnaeve, G., Adi, Y.: Real time speech enhancement in the waveform domain. In: Interspeech (2020)
Google Scholar
Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: Emphasized Channel Attention, propagation and aggregation in TDNN based speaker verification. In: Interspeech 2020. pp. 3830–3834 (2020)
Google Scholar
Diez, M., Burget, L., Landini, F., Černockỳ, J.: Analysis of speaker diarization based on bayesian hmm with eigenvoice priors. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 355–368 (2019)
Article Google Scholar
Everingham, M., Sivic, J., Zisserman, A.: “Hello! My name is... Buffy” – automatic naming of characters in TV video. In: Proc. BMVC (2006)
Google Scholar
Everingham, M., Sivic, J., Zisserman, A.: Taking the bite out of automatic naming of characters in TV video. Image and Vision Computing 27(5) (2009)
Google Scholar
Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., Watanabe, S.: End-to-End Neural Speaker Diarization with Permutation-free Objectives. In: Interspeech. pp. 4300–4304 (2019)
Google Scholar
Gabeur, V., Seo, P.H., Nagrani, A., Sun, C., Alahari, K., Schmid, C.: Avatar: Unconstrained audiovisual speech recognition. arXiv preprint arXiv:2206.07684 (2022)
Gong, Y., Liu, A.H., Luo, H., Karlinsky, L., Glass, J.: Joint audio and speech understanding. In: 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2023)
Google Scholar
Gong, Y., Luo, H., Liu, A.H., Karlinsky, L., Glass, J.: Listen, think, and understand. In: Proc. ICLR (2023)
Google Scholar
Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., et al.: Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020)
Han, T., Bain, M., Nagrani, A., Varol, G., Xie, W., Zisserman, A.: Autoad: Movie description in context. In: Proc. CVPR (2023)
Google Scholar
Han, T., Bain, M., Nagrani, A., Varol, G., Xie, W., Zisserman, A.: AutoAD III: The prequel – back to the pixels. In: Proc. CVPR (2024)
Google Scholar
Haurilet, M.L., Tapaswi, M., Al-Halah, Z., Stiefelhagen, R.: Naming tv characters by watching and analyzing dialogs. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1–9. IEEE (2016)
Google Scholar
He, Y., Kang, Z., Wang, J., Peng, J., Xiao, J.: Voiceextender: Short-utterance text-independent speaker verification with guided diffusion model. In: 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). pp. 1–8. IEEE (2023)
Google Scholar
Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3451–3460 (2021)
Article Google Scholar
Hu, Y., Ren, J.S., Dai, J., Yuan, C., Xu, L., Wang, W.: Deep multimodal speaker naming. In: Proceedings of the 23rd ACM international conference on Multimedia. pp. 1107–1110 (2015)
Google Scholar
Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D.d.l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al.: Mistral 7b. arXiv preprint arXiv:2310.06825 (2023)
Kalogeiton, V., Zisserman, A.: Constrained video face clustering using 1nn relations. In: Proc. BMVC (2020)
Google Scholar
Kaphungkui, N., Kandali, A.B.: Text dependent speaker recognition with back propagation neural network. International Journal of Engineering and Advanced Technology (IJEAT) 8(5), 1431–1434 (2019)
Google Scholar
Kinoshita, K., Delcroix, M., Tawara, N.: Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 7198–7202. IEEE (2021)
Google Scholar
Koluguri, N.R., Park, T., Ginsburg, B.: Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 8102–8106. IEEE (2022)
Google Scholar
Korbar, B., Huh, J., Zisserman, A.: Look, listen and recognise: character-aware audio-visual subtitling. In: International Conference on Acoustics, Speech, and Signal Processing (2024)
Google Scholar
Korbar, B., Zisserman, A.: Personalised clip or: how to find your vacation videos. In: Proc. BMVC (2022)
Google Scholar
Kraaij, W., Hain, T., Lincoln, M., Post, W.: The ami meeting corpus. In: Proc. International Conference on Methods and Techniques in Behavioral Research (2005)
Google Scholar
Kwak, D., Jung, J., Nam, K., Jang, Y., Jung, J.w., Watanebe, S., Chung, J.S.: Voxmm: Rich transcription of conversations in the wild. In: International Conference on Acoustics, Speech, and Signal Processing (2024)
Google Scholar
Lerner, P., Bergoënd, J., Guinaudeau, C., Bredin, H., Maurice, B., Lefevre, S., Bouteiller, M., Berhe, A., Galmant, L., Yin, R., et al.: Bazinga! a dataset for multi-party dialogues structuring. In: 13th Conference on Language Resources and Evaluation (LREC 2022). pp. 3434–3441 (2022)
Google Scholar
Li, K., Wrench, E., Jr.: Text-independent speaker recognition with short utterances. The Journal of the Acoustical Society of America 72(S1), S29–S30 (1982)
Article Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Google Scholar
Liu, T., Yu, K.: Ber: Balanced error rate for speaker diarization. arXiv preprint arXiv:2211.04304 (2022)
Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Sun, Y., Deng, C., Xu, H., Xie, Z., Ruan, C.: Deepseek-vl: Towards real-world vision-language understanding (2024)
Google Scholar
Ma, P., Haliassos, A., Fernandez-Lopez, A., Chen, H., Petridis, S., Pantic, M.: Auto-avsr: Audio-visual speech recognition with automatic labels. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)
Google Scholar
Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-chatgpt: Towards detailed video understanding via large vision and language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) (2024)
Google Scholar
Malik, M., Malik, M.K., Mehmood, K., Makhdoom, I.: Automatic speech recognition: a survey. Multimedia Tools and Applications 80, 9411–9457 (2021)
Article Google Scholar
Nagrani, A., Chung, J.S., Xie, W., Zisserman, A.: Voxceleb: Large-scale speaker verification in the wild. Computer Speech and Language (2019)
Google Scholar
Nagrani, A., Zisserman, A.: From benedict cumberbatch to sherlock holmes: Character identification in tv series without a script. In: Proc. BMVC (2017)
Google Scholar
The 2009 (RT-09) Rich Transcription Meeting Recognition Evaluation Plan (2009 (accessed 1 July 2024)), https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf, See Section 6
Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., Narayanan, S.: A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language 72, 101317 (2022)
Article Google Scholar
Poddar, A., Sahidullah, M., Saha, G.: Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biometrics 7(2), 91–101 (2018)
Article Google Scholar
Poignant, J., Bredin, H., Barras, C.: Multimodal person discovery in broadcast tv: lessons learned from mediaeval 2015. Multimedia Tools and Applications 76, 22547–22567 (2017)
Article Google Scholar
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning. pp. 28492–28518. PMLR (2023)
Google Scholar
Ramanathan, V., Joulin, A., Liang, P., Fei-Fei, L.: Linking people in videos with “their” names using coreference resolution. In: Proc. ECCV. pp. 95–110. Springer (2014)
Google Scholar
Sharma, R., Narayanan, S.: Using active speaker faces for diarization in tv shows. arXiv preprint arXiv:2203.15961 (2022)
Shi, B., Hsu, W.N., Lakhotia, K., Mohamed, A.: Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184 (2022)
Simple diarization repository. https://github.com/JaesungHuh/SimpleDiarization (2024)
Google Scholar
Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Guo, X., Ye, T., Lu, Y., Hwang, J.N., et al.: Moviechat: From dense token to sparse memory for long video understanding. In: Proc. CVPR (2024)
Google Scholar
Suchitha, T., Bindu, A.: Feature extraction using mfcc and classification using gmm. International Journal for Scientific Research & Development (IJSRD) 3(5), 1278–1283 (2015)
Google Scholar
Szarkowska, A.: Subtitling for the deaf and the hard of hearing. The Palgrave Handbook of Audiovisual Translation and Media Accessibility pp. 249–268 (2020)
Google Scholar
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Team, S.: Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier. https://github.com/snakers4/silero-vad (2021)
Google Scholar
Torgashov, N., Makarov, R., Yakovlev, I., Malov, P., Balykin, A., Okhotnikov, A.: The id r &d voxceleb speaker recognition challenge 2023 system description. arXiv preprint arXiv:2308.08294 (2023)
Wang, J., Chen, D., Luo, C., Dai, X., Yuan, L., Wu, Z., Jiang, Y.G.: Chatvideo: A tracklet-centric multimodal and versatile video understanding system. arXiv preprint arXiv:2304.14407 (2023)
Wang, Q., Downey, C., Wan, L., Mansfield, P.A., Moreno, I.L.: Speaker diarization with lstm. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP). pp. 5239–5243. IEEE (2018)
Google Scholar
Xu, E.Z., Song, Z., Tsutsui, S., Feng, C., Ye, M., Shou, M.Z.: Ava-avd: Audio-visual speaker diarization in the wild. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 3838–3847 (2022)
Google Scholar
Zhang, A., Wang, Q., Zhu, Z., Paisley, J., Wang, C.: Fully supervised speaker diarization. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6301–6305. IEEE (2019)
Google Scholar
Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023), https://arxiv.org/abs/2306.02858

Download references

Acknowledgments

This research is supported by EPSRC Programme Grant VisualAI EP/T028572/1 and a Royal Society Research Professorship RP$\backslash $R1$\backslash $191132. We thank Robin and Bruno for helpful discussions.

Author information

Authors and Affiliations

Visual Geometry Group, Department of Engineering Science, University of Oxford, Oxford, UK
Jaesung Huh & Andrew Zisserman

Authors

Jaesung Huh
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Zisserman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrew Zisserman .

Editor information

Editors and Affiliations

Pohang University of Science and Technology (POSTECH), Pohang, Korea (Republic of)
Minsu Cho
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Ivan Laptev
Google, Mountain View, CA, USA
Du Tran
National University of Singapore, Singapore, Singapore
Angela Yao
Peking University, Beijing, China
Hongbin Zha

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 282 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huh, J., Zisserman, A. (2025). Character-Aware Audio-Visual Subtitling in Context. In: Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H. (eds) Computer Vision – ACCV 2024. ACCV 2024. Lecture Notes in Computer Science, vol 15474. Springer, Singapore. https://doi.org/10.1007/978-981-96-0908-6_21

Download citation

DOI: https://doi.org/10.1007/978-981-96-0908-6_21
Published: 07 December 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0907-9
Online ISBN: 978-981-96-0908-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Character-Aware Audio-Visual Subtitling in Context