Hierarchical Vision-Language Alignment for Video Captioning

Zhang, Junchao; Peng, Yuxin

doi:10.1007/978-3-030-05710-7_4

Junchao Zhang¹⁸ &
Yuxin Peng¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11295))

Included in the following conference series:

International Conference on Multimedia Modeling

2917 Accesses
12 Citations

Abstract

We have witnessed promising advances on video captioning in recent years, which is a challenging task since it is hard to capture the semantic correspondences between visual content and language descriptions. Different granularities of language components (e.g. words, phrases and sentences), are corresponding to different granularities of visual elements (e.g. objects, visual relations and interested regions). These correspondences can provide multi-level alignments and complementary information for transforming visual content to language descriptions. Therefore, we propose an Attention Guided Hierarchical Alignment (AGHA) approach for video captioning. In the proposed approach, hierarchical vision-language alignments, including object-word, relation-phrase, and region-sentence alignments, are extracted from a well-learned model that suits for multiple tasks related to vision and language, which are then embedded into parallel encoder-decoder streams to provide multi-level semantic guidance and rich complementarities on description generation. Besides, multi-granularity visual features are also exploited to obtain the coarse-to-fine understanding on complex video content, where an attention mechanism is applied to extract comprehensive visual discrimination to enhance video captioning. Experimental results on widely-used dataset MSVD demonstrate that AGHA achieves promising improvement on popular evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We denote a visual relation as a \(<object1-predciate-object2>\) triplet.

References

Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: ICCV, pp. 4534–4542 (2015)
Google Scholar
Yang, Z., Han, Y., Wang, Z.: Catching the temporal regions-of-interest for video captioning. In: ACM MM, pp. 146–153 (2017)
Google Scholar
Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: CVPR, pp. 3185–3194 (2017)
Google Scholar
Wang, J., Wang, W., Huang, Y., Wang, L., Tan, T.: M3: multimodal memory modelling for video captioning. In: CVPR, pp. 7512–7520 (2018)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR, pp. 1–15 (2015)
Google Scholar
Yao, L., et al.: Describing videos by exploiting temporal structure. In: ICCV, pp. 4507–4515 (2015)
Google Scholar
Zhu, L., Xu, Z., Yang, Y.: Bidirectional multirate reconstruction for temporal modeling in videos. In: CVPR, pp. 1339–1348 (2016)
Google Scholar
Guadarrama, S., et al.: Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV, pp. 2712–2719 (2013)
Google Scholar
Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: ICCV, pp. 433–440 (2013)
Google Scholar
Chen, S., Chen, J., Jin, Q., Hauptmann, A.: Video captioning with guidance of multimodal latent topics. In: ACM MM, pp. 1838–1846 (2017)
Google Scholar
Xu, J., Yao, T., Zhang, Y., Mei, T.: Learning multimodal attention LSTM networks for video captioning. In: ACM MM, pp. 537–545 (2017)
Google Scholar
Hori, C., et al.: Attention-based multimodal fusion for video description. In: ICCV, pp. 4203–4212 (2017)
Google Scholar
Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. In: CVPR, pp. 6504–6512 (2017)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015)
Google Scholar
Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: CVPR, pp. 1261–1270 (2017)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200 (2011)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72 (2005)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)
Google Scholar
Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In: CVPR, pp. 7622–7631 (2018)
Google Scholar
Wu, A., Han, Y.: Multi-modal circulant fusion for video-to-language and backward. In: IJCAI, pp. 1029–1035 (2018)
Google Scholar
Zhang, X., Gao, K., Zhang, Y., Zhang, D., Li, J., Tian, Q.: Task-driven dynamic fusion: Reducing ambiguity in video description. In: CVPR, pp. 6250–6258 (2017)
Google Scholar
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp. 1029–1038 (2016)
Google Scholar
Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR, pp. 4584–4593 (2016)
Google Scholar
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: CVPR, pp. 4594–4602 (2016)
Google Scholar
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-Resnet and the impact of residual connections on learning. In: AAAI, pp. 4278–4284 (2017)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2015)
Google Scholar
Xu, Z., Yang, Y., Tsang, I., Sebe, N., Hauptmann, A.G.: Feature weighting via optimal thresholding for video analysis. In: ICCV, pp. 3440–3447 (2013)
Google Scholar

Download references

Acknowledgment

This work was supported by National Natural Science Foundation of China under Grant 61771025.

Author information

Authors and Affiliations

Institute of Computer Science and Technology, Peking University, Beijing, China
Junchao Zhang & Yuxin Peng

Authors

Junchao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yuxin Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuxin Peng .

Editor information

Editors and Affiliations

Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Ioannis Kompatsiaris
EURECOM, Sophia Antipolis, France
Benoit Huet
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Vasileios Mezaris
Dublin City University, Dublin, Ireland
Cathal Gurrin
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Stefanos Vrochidis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, J., Peng, Y. (2019). Hierarchical Vision-Language Alignment for Video Captioning. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, WH., Vrochidis, S. (eds) MultiMedia Modeling. MMM 2019. Lecture Notes in Computer Science(), vol 11295. Springer, Cham. https://doi.org/10.1007/978-3-030-05710-7_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-05710-7_4
Published: 08 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05709-1
Online ISBN: 978-3-030-05710-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics