It’s Just Another Day: Unique Video Captioning by Discriminitive Prompting

Perrett, Toby; Han, Tengda; Damen, Dima; Zisserman, Andrew

doi:10.1007/978-981-96-0908-6_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15474))

Included in the following conference series:

Asian Conference on Computer Vision

92 Accesses

Abstract

Long videos contain many repeating actions, events and shots. These repetitions are frequently given identical captions, which makes it difficult to retrieve the exact desired clip using a text search. In this paper, we formulate the problem of unique captioning: Given multiple clips with the same caption, we generate a new caption for each clip that uniquely identifies it. We propose Captioning by Discriminative Prompting (CDP), which predicts a property that can separate identically captioned clips, and use it to generate unique captions. We introduce two benchmarks for unique captioning, based on egocentric footage and timeloop movies – where repeating actions are common. We demonstrate that captions generated by CDP improve text-to-video R@1 by 15% for egocentric videos and 10% in timeloop movies.

https://tobyperrett.github.io/its-just-another-day

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

List of films featuring time loops. https://en.wikipedia.org/wiki/List_of_films_featuring_time_loops (2024)
Twenty questions. https://en.wikipedia.org/wiki/Twenty_questions (2024)
Bagad, P., Tapaswi, M., Snoek, C.G.: Test of time: Instilling video-language models with a sense of time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2503–2516 (2023)
Google Scholar
Bain, M., Nagrani, A., Brown, A., Zisserman, A.: Condensed movies: Story based retrieval with contextual embeddings. In: Proceedings of the Asian Conference on Computer Vision (2020)
Google Scholar
Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al.: Improving image generation with better captions. OpenAI dall-e-3 (2023)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6299–6308 (2017)
Google Scholar
Chum, O., Philbin, J., Isard, M., Zisserman, A.: Scalable near identical image and shot detection. In: Proceedings of the 6th ACM international conference on Image and video retrieval. pp. 549–556 (2007)
Google Scholar
Damen, D., Doughty, H., Farinella, G.M., Furnari, A., Kazakos, E., Ma, J., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al.: Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision pp. 1–23 (2022)
Google Scholar
Ding, N., Deng, C., Tan, M., Du, Q., Ge, Z., Wu, Q.: Image captioning with controllable and adaptive length levels. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)
Google Scholar
Duan, L.Y., Wang, J., Zheng, Y., Jin, J.S., Lu, H., Xu, C.: Segmentation, categorization, and identification of commercial clips from tv streams using multimodal analysis. In: Proceedings of the 14th ACM international conference on Multimedia. pp. 201–210 (2006)
Google Scholar
Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al.: The" something something" video database for learning and evaluating visual common sense. In: Proceedings of the IEEE international conference on computer vision. pp. 5842–5850 (2017)
Google Scholar
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18995–19012 (2022)
Google Scholar
Hampapur, A., Brown, L., Connell, J., Ekin, A., Haas, N., Lu, M., Merkl, H., Pankanti, S.: Smart video surveillance: exploring the concept of multiscale spatiotemporal tracking. IEEE Signal Process. Mag. 22(2), 38–51 (2005)
Article Google Scholar
Han, T., Bain, M., Nagrani, A., Varol, G., Xie, W., Zisserman, A.: Autoad: Movie description in context. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18930–18940 (2023)
Google Scholar
Han, T., Bain, M., Nagrani, A., Varol, G., Xie, W., Zisserman, A.: Autoad iii: The prequel-back to the pixels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
Google Scholar
Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with temporal language. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 1380–1390 (2018)
Google Scholar
Huang, Q., Xiong, Yu., Rao, A., Wang, J., Lin, D.: MovieNet: A Holistic Dataset for Movie Understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 709–727. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_41
Chapter Google Scholar
Islam, M.M., Ho, N., Yang, X., Nagarajan, T., Torresani, L., Bertasius, G.: Video recap: Recursive captioning of hour-long videos (2024)
Google Scholar
Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 1346–1353. IEEE (2012)
Google Scholar
Lei, J., Yu, L., Bansal, M., Berg, T.L.: Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696 (2018)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models (2023)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)
Google Scholar
Lin, K., Li, L., Lin, C.C., Ahmed, F., Gan, Z., Liu, Z., Lu, Y., Wang, L.: Swinbert: End-to-end transformers with sparse attention for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17949–17958 (2022)
Google Scholar
Lin, W., Mirza, M.J., Doveh, S., Feris, R., Giryes, R., Hochreiter, S., Karlinsky, L.: Comparison visual instruction tuning. arXiv preprint arXiv:2406.09240 (2024)
Liu, H., Zaharia, M., Abbeel, P.: Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889 (2023)
Liu, Z., Wang, T., Zhang, J., Zheng, F., Jiang, W., Lu, K.: Show, tell and rephrase: Diverse video captioning via two-stage progressive training. IEEE Transactions on Multimedia (2022)
Google Scholar
Long, Y., Wen, Y., Han, J., Xu, H., Ren, P., Zhang, W., Zhao, S., Liang, X.: Capdet: Unifying dense captioning and open-world detection pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15233–15243 (2023)
Google Scholar
Lu, Y., Zhang, Z., Yuan, C., Li, P., Wang, Y., Li, B., Hu, W.: Set prediction guided by semantic concepts for diverse video captioning. In: AAAI (2023)
Google Scholar
Luo, H., Ji, L., Shi, B., Huang, H., Duan, N., Li, T., Li, J., Bharti, T., Zhou, M.: Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020)
Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2630–2640 (2019)
Google Scholar
Mokady, R., Hertz, A., Bermano, A.H.: Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021)
Nagarajan, T., Torresani, L.: Step differences in instructional video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18740–18750 (2024)
Google Scholar
Qinghong Lin, K., Jinpeng Wang, A., Soldan, M., Wray, M., Yan, R., Zhongcong Xu, E., Gao, D., Tu, R., Zhao, W., Kong, W., et al.: Egocentric video-language pretraining. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Ramakrishnan, S.K., Al-Halah, Z., Grauman, K.: Spotem: Efficient video search for episodic memory. In: International Conference on Machine Learning (2023)
Google Scholar
Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: CVPR (2015)
Google Scholar
Schaffalitzky, F., Zisserman, A.: Automated location matching in movies. Comput. Vis. Image Underst. 92, 236–264 (2003)
Article Google Scholar
Seo, P.H., Nagrani, A., Arnab, A., Schmid, C.: End-to-end generative pretraining for multimodal video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17959–17968 (2022)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565 (2018)
Google Scholar
Sivic, Zisserman: Video google: A text retrieval approach to object matching in videos. In: Proceedings ninth IEEE international conference on computer vision. pp. 1470–1477. IEEE (2003)
Google Scholar
Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Guo, X., Ye, T., Lu, Y., Hwang, J.N., et al.: Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449 (2023)
Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)
Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L.Y., Wang, Y.X., Yang, Y., et al.: Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525 (2023)
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: Movieqa: Understanding stories in movies through question-answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4631–4640 (2016)
Google Scholar
Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.: Winoground: Probing vision and language models for visio-linguistic compositionality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5238–5248 (2022)
Google Scholar
Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209 (2024)
Ventura, L., Yang, A., Schmid, C., Varol, G.: CoVR: Learning composed video retrieval from web video captions. In: AAAI (2024)
Google Scholar
Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Chen, X., Wang, Y., Luo, P., Liu, Z., Wang, Y., Wang, L., Qiao, Y.: Internvid: A large-scale video-text dataset for multimodal understanding and generation. In: International Conference on Learning Representations (2024)
Google Scholar
Wang, Z., Feng, B., Narasimhan, K., Russakovsky, O.: Towards unique and informative captioning of images. In: ECCV (2020)
Google Scholar
Wang, Z., Li, M., Xu, R., Zhou, L., Lei, J., Lin, X., Wang, S., Yang, Z., Zhu, C., Hoiem, D., et al.: Language models with image descriptors are strong few-shot video-language learners. Adv. Neural. Inf. Process. Syst. 35, 8483–8497 (2022)
Google Scholar
Wu, C.Y., Li, Y., Mangalam, K., Fan, H., Xiong, B., Malik, J., Feichtenhofer, C.: Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13587–13597 (2022)
Google Scholar
Wu, W., Luo, H., Fang, B., Wang, J., Ouyang, W.: Cap4video: What can auxiliary captions do for text-video retrieval? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10704–10713 (2023)
Google Scholar
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A large video description dataset for bridging video and language. In: CVPR. pp. 5288–5296 (2016)
Google Scholar
Xu, L.Q., Li, Y.: Video classification using spatial-temporal features and pca. In: 2003 International Conference on Multimedia and Expo. ICME’03. Proceedings (Cat. No. 03TH8698). vol. 3, pp. III–485. IEEE (2003)
Google Scholar
Yang, A., Nagrani, A., Seo, P.H., Miech, A., Pont-Tuset, J., Laptev, I., Sivic, J., Schmid, C.: Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In: CVPR (2023)
Google Scholar
Yang, X., Zhang, H., Cai, J.: Auto-encoding and distilling scene graphs for image captioning. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2313–2327 (2020)
Google Scholar
Yu, K.P.: VideoBLIP, https://github.com/yukw777/VideoBLIP
Yu, K.P., Zhang, Z., Hu, F., Chai, J.: Efficient in-context learning in vision-language models for egocentric videos. arXiv preprint arXiv:2311.17041 (2023)
Yue, Z., Zhang, Q., Hu, A., Zhang, L., Wang, Z., Jin, Q.: Movie101: A new movie understanding benchmark. arXiv preprint arXiv:2305.12140 (2023)
Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)
Zhao, Y., Misra, I., Krähenbühl, P., Girdhar, R.: Learning video representations from large language models. In: CVPR (2023)
Google Scholar
Zhou, L., Xu, C., Corso, J.: Towards automatic learning of procedures from web instructional videos. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)
Google Scholar

Download references

Acknowledgements

Research is supported by EPSRC Programme Grant Visual AI (EP/T028572/1) and EPSRC UMPIRE (EP/T004991/1). This project acknowledges the use of the EPSRC funded Tier 2 facility, JADE-II.

Author information

Authors and Affiliations

University of Bristol, Bristol, UK
Toby Perrett & Dima Damen
University of Oxford, Oxford, UK
Tengda Han & Andrew Zisserman

Authors

Toby Perrett
View author publications
You can also search for this author in PubMed Google Scholar
Tengda Han
View author publications
You can also search for this author in PubMed Google Scholar
Dima Damen
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Zisserman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Toby Perrett .

Editor information

Editors and Affiliations

Pohang University of Science and Technology (POSTECH), Pohang, Korea (Republic of)
Minsu Cho
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Ivan Laptev
Google, Mountain View, CA, USA
Du Tran
National University of Singapore, Singapore, Singapore
Angela Yao
Peking University, Beijing, China
Hongbin Zha

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4759 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Perrett, T., Han, T., Damen, D., Zisserman, A. (2025). It’s Just Another Day: Unique Video Captioning by Discriminitive Prompting. In: Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H. (eds) Computer Vision – ACCV 2024. ACCV 2024. Lecture Notes in Computer Science, vol 15474. Springer, Singapore. https://doi.org/10.1007/978-981-96-0908-6_16

Download citation

DOI: https://doi.org/10.1007/978-981-96-0908-6_16
Published: 07 December 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0907-9
Online ISBN: 978-981-96-0908-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

It’s Just Another Day: Unique Video Captioning by Discriminitive Prompting