Video Captioning Using Hierarchical LSTM and Text-Based Sliding Window

Xiao, Huanhou; Shi, Jinglun

doi:10.1007/978-3-030-14657-3_6

Video Captioning Using Hierarchical LSTM and Text-Based Sliding Window

Conference paper
First Online: 07 March 2019

846 Accesses

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 271))

Abstract

Automatically describing video content with natural language has been attracting a lot of attention in multimedia community. However, most existing methods only use the word-level cross entropy loss to train the model, while ignoring the relationship between visual content and sentence semantics. In addition, during the decoding stage, the resulting models are used to predict one word at a time, and by feeding the generated word back as input at the next time step. Nevertheless, the other generated words are not fully exploited. As a result, the model is easy to “run off” if the last generated word is ambiguous. To tackle these issues, we propose a novel framework consisting of hierarchical long short term memory and text-based sliding window (HLSTM-TSW), which not only optimizes the model at word level, but also enhances the semantic relationship between the visual content and the entire sentence during training. Moreover, a sliding window is used to focus on k previously generated words when predicting the next word, so that our model can make use of more useful information to further improve the accuracy of forecast. Experiments on the benchmark dataset YouTube2Text demonstrate that our method which only uses single feature achieves superior or even better results than the state-of-the-art baselines for video captioning.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
[Online]. Available: https://www.nltk.org/index.html.

References

Sun, L., Wang, X., Wang, Z., Zhao, H., Zhu, W.: Social-aware video recommendation for online social groups. IEEE Trans. Multimedia 19(3), 609–618 (2017)
Article Google Scholar
Wu, S., Wieland, J., Farivar, O., Schiller, J.: Automatic Alt-text: computer-generated image descriptions for blind users on a social network service. In: CSCW, pp. 1180–1192 (2017)
Google Scholar
Das, A., et al.: Visual dialog. In: CVPR (2017)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL (2011)
Google Scholar
Lebret, R., Pinheiro, P.O., Collobert, R.: Phrase-based image captioning. arXiv preprint arXiv:1502.03671 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)
Google Scholar
Fang, H., et al.: From captions to visual concepts and back. In: CVPR (2015)
Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: NAACL HLT (2015)
Google Scholar
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. arXiv preprint arXiv:1505.01861 (2015)
Yao, L., et al.: Describing videos by exploiting temporal structure. In: ICCV (2015)
Google Scholar
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence - video to text. In: ICCV (2015)
Google Scholar
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. arXiv preprint arXiv:1511.03476 (2015)
Chen, S., Chen, J., Jin, Q.: Generating video descriptions with topic guidance. In: ICMR (2017)
Google Scholar
Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR (2016)
Google Scholar
Gan, Z., et al.: Semantic compositional networks for visual captioning. In: CVPR (2017)
Google Scholar
Song, J., Guo, Y., Gao, L., Li, X., Hanjalic, A., Shen, H.T.: From deterministic to generative: multi-modal stochastic RNNs for video captioning. arXiv preprint arXiv:1708.02478 (2017)
Song, J., Guo, Z., Gao, L., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231 (2017)
Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. arXiv preprint arXiv:1803.11438 (2018)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)
Google Scholar
Lavie, A., Agarwal, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)
Google Scholar
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

Download references

Author information

Authors and Affiliations

South China University of Technology, Guangzhou, 510641, China
Huanhou Xiao & Jinglun Shi

Authors

Huanhou Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Jinglun Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinglun Shi .

Editor information

Editors and Affiliations

Northwestern Polytechnical University, Xi′an, China
Bo Li
Northwestern Polytechnical University, Xi'an, China
Mao Yang
Shandong University, Jinan, Qinghai, China
Hui Yuan
Northwestern Polytechnical University, Xi'an, Shaanxi, China
Zhongjiang Yan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiao, H., Shi, J. (2019). Video Captioning Using Hierarchical LSTM and Text-Based Sliding Window. In: Li, B., Yang, M., Yuan, H., Yan, Z. (eds) IoT as a Service. IoTaaS 2018. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 271. Springer, Cham. https://doi.org/10.1007/978-3-030-14657-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-14657-3_6
Published: 07 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-14656-6
Online ISBN: 978-3-030-14657-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics