Spatio-Temporal Attention Models for Grounded Video Captioning

Zanfir, Mihai; Marinoiu, Elisabeta; Sminchisescu, Cristian

doi:10.1007/978-3-319-54190-7_7

Mihai Zanfir¹⁸,
Elisabeta Marinoiu¹⁸ &
Cristian Sminchisescu^17,18

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10114))

Included in the following conference series:

Asian Conference on Computer Vision

2367 Accesses
12 Citations

Abstract

Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video data to high level textual description, bypassing localization and recognition, thus discarding potentially valuable information for content localization and generalization. In this work we present an automatic video captioning model that combines spatio-temporal attention and image classification by means of deep neural network structures based on long short-term memory. The resulting system is demonstrated to produce state-of-the-art results in the standard YouTube captioning benchmark while also offering the advantage of localizing the visual concepts (subjects, verbs, objects), with no grounding supervision, over space and time.

M. Zanfir and E. Marinoiu—Authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Custom CNN-BiLSTM model for video captioning

Article 03 July 2024

Video Captioning Using Attention Based Visual Fusion with Bi-temporal Context and Bi-modal Semantic Feature Learning

Video Captioning Using Deep Learning Approach-A Comprehensive Survey

References

Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma, D.A., Bernstein, M.S., Li, F.: Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016)
Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR (2016)
Google Scholar
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: ICCV (2015)
Google Scholar
Taralova, E.H., Torre, F., Hebert, M.: Motion words for videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 725–740. Springer, Cham (2014). doi:10.1007/978-3-319-10590-1_47
Google Scholar
Oneata, D., Revaud, J., Verbeek, J., Schmid, C.: Spatio-temporal object detection proposals. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 737–752. Springer, Cham (2014). doi:10.1007/978-3-319-10578-9_48
Google Scholar
Fragkiadaki, K., Arbelaez, P., Felsen, P., Malik, J.: Learning to segment moving objects in videos. In: CVPR (2015)
Google Scholar
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL (2011)
Google Scholar
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV (2013)
Google Scholar
Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating language and vision to generate natural language descriptions of videos in the wild. In: COLING (2014)
Google Scholar
Xu, R., Xiong, C., Chen, W., Corso, J.J.: Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI Conference on Artificial Intelligence (2015)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Google Scholar
Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localization networks for dense captioning. arXiv preprint arXiv:1511.07571 (2015)
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)
Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: NAACL HLT (2015)
Google Scholar
Rohrbach, A., Rohrbach, M., Schiele, B.: The long-short story of movie description. In: Gall, J., Gehler, P., Leibe, B. (eds.) GCPR 2015. LNCS, vol. 9358, pp. 209–221. Springer, Cham (2015). doi:10.1007/978-3-319-24947-6_17
Chapter Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2014)
Google Scholar
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)
Google Scholar
Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies (2001)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence - video to text. In: ICCV (2015)
Google Scholar
Xu, H., Venugopalan, S., Ramanishka, V., Rohrbach, M., Saenko, K.: A multi-scale multiple instance video description network. In: arXiv preprint arXiv:1505.05914 (2015)
Zaremba, W., Sutskever, I.: Learning to execute. arXiv preprint arXiv:1410.4615 (2014)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL (2014)
Google Scholar
Cawley, G.C.: Leave-one-out cross-validation based model selection criteria for weighted ls-svms. In: IJCNN (2006)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL (2002)
Google Scholar
Lavie, A., Agarwal, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments, pp. 65–72 (2005)
Google Scholar
Lienhart, R.W.: Comparison of automatic shot boundary detection algorithms. In: International Society for Optics and Photonics on Electronic Imaging 1999, pp. 290–301 (1998)
Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)
Article MathSciNet Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: ACMMM (2014)
Google Scholar
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: CVPR. IEEE (2011)
Google Scholar
Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR (2015)
Google Scholar
Pan, Y., T.M., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: arXiv preprint arXiv:1505.01861. (2015)
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. arXiv preprint arXiv:1511.03476 (2015)
Ballas, N., Yao, L., Pal, C., Courville, A.C.: Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432 (2015)

Download references

Acknowledgement

This work was supported in part by CNCS-UEFISCDI under PCE-2011-3-0438, JRP-RO-FR-2014-16 and NVIDIA through a GPU donation.

Author information

Authors and Affiliations

Department of Mathematics, Faculty of Engineering, Lund University, Lund, Sweden
Cristian Sminchisescu
Institute of Mathematics of the Romanian Academy, Bucharest, Romania
Mihai Zanfir, Elisabeta Marinoiu & Cristian Sminchisescu

Authors

Mihai Zanfir
View author publications
You can also search for this author in PubMed Google Scholar
Elisabeta Marinoiu
View author publications
You can also search for this author in PubMed Google Scholar
Cristian Sminchisescu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cristian Sminchisescu .

Editor information

Editors and Affiliations

National Tsing Hua University, Hsinchu, Taiwan
Shang-Hong Lai
Graz University of Technology, Graz, Austria
Vincent Lepetit
Drexel University, Philadelphia, Pennsylvania, USA
Ko Nishino
The University of Tokyo, Tokyo, Japan
Yoichi Sato

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zanfir, M., Marinoiu, E., Sminchisescu, C. (2017). Spatio-Temporal Attention Models for Grounded Video Captioning. In: Lai, SH., Lepetit, V., Nishino, K., Sato, Y. (eds) Computer Vision – ACCV 2016. ACCV 2016. Lecture Notes in Computer Science(), vol 10114. Springer, Cham. https://doi.org/10.1007/978-3-319-54190-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-54190-7_7
Published: 12 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54189-1
Online ISBN: 978-3-319-54190-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics