Spatial-Temporal Contextual Feature Fusion Network for Movie Description

Liao, Yihui; Fan, Lu; Ding, Huiming; Xie, Zhifeng

doi:10.1007/978-3-031-20497-5_40

Yihui Liao¹²,
Lu Fan¹²,
Huiming Ding¹² &
…
Zhifeng Xie¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13604))

Included in the following conference series:

CAAI International Conference on Artificial Intelligence

1658 Accesses

Abstract

The movie description task aims to generate narrative textual descriptions that match the content of the movie. Most of the current methods lack the ability to consider comprehensive visual content analysis and contextual information utilization simultaneously, resulting in inaccurate or incoherent in the generated descriptions. In order to tackle the problem, we propose a new method called spatial-temporal contextual feature fusion network (ST-CFFNet) to capture both spatial-temporal and contextual information in movie by building the stacked visual graph attention encoding unit and the contextual feature fusion module. We also propose a spatial-temporal context loss to constrain the effectiveness of ST-CFFNet in spatial-temporal relation analysis and context modeling. The experimental results on LSMDC dataset show that our method achieves more accurate and coherent movie descriptions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Condensed Movies: Story Based Retrieval with Contextual Embeddings

MovieNet: A Holistic Dataset for Movie Understanding

Movie Retrieval Systems Using Genre-Guided Multimodal Learning Techniques

References

Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Gao, L., Li, X., Song, J., Shen, H.T.: Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans. Pattern Anal. Mach. Intell. 42(5), 1112–1131 (2019)
Google Scholar
Han, S.H., Go, B.W., Choi, H.J.: Multiple videos captioning model for video storytelling. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 1–4. IEEE (2019)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137 (2015)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Mahajan, Dhruv, Girshick, Ross, Ramanathan, Vignesh, He, Kaiming, Paluri, Manohar, Li, Yixuan, Bharambe, Ashwin, van der Maaten, Laurens: Exploring the limits of weakly supervised pretraining. In: Ferrari, Vittorio, Hebert, Martial, Sminchisescu, Cristian, Weiss, Yair (eds.) ECCV 2018. LNCS, vol. 11206, pp. 185–201. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_12
Chapter Google Scholar
Mi, L., Chen, Z.: Hierarchical graph attention network for visual relationship detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13886–13895 (2020)
Google Scholar
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038 (2016)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3212 (2015)
Google Scholar
Rohrbach, A., et al.: Movie description. Int. J. Comput. Vision 123(1), 94–120 (2017)
Article Google Scholar
Ronfard, R., Thuong, T.: A framework for aligning and indexing movies with their script. In: 2003 Proceedings of International Conference on Multimedia and Expo. ICME 2003 (Cat. No. 03TH8698), vol. 1, pp. 1–21 (2003). https://doi.org/10.1109/ICME.2003.1220844
Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE Trans. Neural Netw. 20(1), 61–80 (2008)
Article Google Scholar
Shetty, R., Laaksonen, J.: Video captioning with recurrent networks based on frame-and video-level features and visual content classification. arXiv preprint arXiv:1512.02949 (2015)
Song, L., Smola, A., Gretton, A., Borgwardt, K.M., Bedo, J.: Supervised feature selection via dependence estimation. In: Proceedings of the 24th International Conference on Machine Learning, pp. 823–830 (2007)
Google Scholar
Tapaswi, M., Bauml, M., Stiefelhagen, R.: Book2movie: aligning video scenes with book chapters. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1827–1835 (2015)
Google Scholar
Torabi, A., Pal, C., Larochelle, H., Courville, A.: Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070 (2015)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)
Google Scholar
Wang, H., Gao, C., Han, Y.: Sequence in sequence for video captioning. Pattern Recogn. Lett. 130, 327–334 (2020)
Article Google Scholar
Wang, J., Bao, B., Xu, C.: Dualvgr: A dual-visual graph reasoning unit for video question answering. IEEE Trans. Multimed. (2021)
Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)
Google Scholar
Xie, Z., Zhang, W., Sheng, B., Li, P., Chen, C.L.P.: BagFN: broad attentive graph fusion network for high-order feature interactions. IEEE Trans. Neural Netw. Learn. Syst. 1–15 (2021). https://doi.org/10.1109/TNNLS.2021.3116209
Yu, Y., Chung, J., Yun, H., Kim, J., Kim, G.: Transitional adaptation of pretrained models for visual storytelling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12658–12668 (2021)
Google Scholar
Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3165–3173 (2017)
Google Scholar
Zhong, R., Wang, R., Zou, Y., Hong, Z., Hu, M.: Graph attention networks adjusted Bi-LSTM for video summarization. IEEE Signal Process. Lett. 28, 663–667 (2021)
Article Google Scholar
Zhou, W., Xia, Z., Dou, P., Su, T., Hu, H.: Double attention based on graph attention network for image multi-label classification. ACM Trans. Multimed. Comput. Commun. App. (TOMM) (2022)
Google Scholar

Download references

Acknowledgments

This work was supported by the Shanghai Natural Science Foundation of China No. 19ZR1419100.

Author information

Authors and Affiliations

Shanghai University, Shanghai, China
Yihui Liao, Lu Fan, Huiming Ding & Zhifeng Xie

Authors

Yihui Liao
View author publications
You can also search for this author in PubMed Google Scholar
Lu Fan
View author publications
You can also search for this author in PubMed Google Scholar
Huiming Ding
View author publications
You can also search for this author in PubMed Google Scholar
Zhifeng Xie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhifeng Xie .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Lu Fang
Xiaomi Inc., Beijing, China
Daniel Povey
Shanghai Jiao Tong University, Shanghai, China
Guangtao Zhai
JD Explore Academy, Beijing, China
Tao Mei
Chinese Academy of Sciences, Beijing, China
Ruiping Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liao, Y., Fan, L., Ding, H., Xie, Z. (2022). Spatial-Temporal Contextual Feature Fusion Network for Movie Description. In: Fang, L., Povey, D., Zhai, G., Mei, T., Wang, R. (eds) Artificial Intelligence. CICAI 2022. Lecture Notes in Computer Science(), vol 13604. Springer, Cham. https://doi.org/10.1007/978-3-031-20497-5_40

Download citation

DOI: https://doi.org/10.1007/978-3-031-20497-5_40
Published: 17 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20496-8
Online ISBN: 978-3-031-20497-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Spatial-Temporal Contextual Feature Fusion Network for Movie Description