Abstract
Automatically describing video content with natural language has been attracting a lot of attention in multimedia community. However, most existing methods only use the word-level cross entropy loss to train the model, while ignoring the relationship between visual content and sentence semantics. In addition, during the decoding stage, the resulting models are used to predict one word at a time, and by feeding the generated word back as input at the next time step. Nevertheless, the other generated words are not fully exploited. As a result, the model is easy to “run off” if the last generated word is ambiguous. To tackle these issues, we propose a novel framework consisting of hierarchical long short term memory and text-based sliding window (HLSTM-TSW), which not only optimizes the model at word level, but also enhances the semantic relationship between the visual content and the entire sentence during training. Moreover, a sliding window is used to focus on k previously generated words when predicting the next word, so that our model can make use of more useful information to further improve the accuracy of forecast. Experiments on the benchmark dataset YouTube2Text demonstrate that our method which only uses single feature achieves superior or even better results than the state-of-the-art baselines for video captioning.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
[Online]. Available: https://www.nltk.org/index.html.
References
Sun, L., Wang, X., Wang, Z., Zhao, H., Zhu, W.: Social-aware video recommendation for online social groups. IEEE Trans. Multimedia 19(3), 609–618 (2017)
Wu, S., Wieland, J., Farivar, O., Schiller, J.: Automatic Alt-text: computer-generated image descriptions for blind users on a social network service. In: CSCW, pp. 1180–1192 (2017)
Das, A., et al.: Visual dialog. In: CVPR (2017)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL (2011)
Lebret, R., Pinheiro, P.O., Collobert, R.: Phrase-based image captioning. arXiv preprint arXiv:1502.03671 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)
Fang, H., et al.: From captions to visual concepts and back. In: CVPR (2015)
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: NAACL HLT (2015)
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. arXiv preprint arXiv:1505.01861 (2015)
Yao, L., et al.: Describing videos by exploiting temporal structure. In: ICCV (2015)
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence - video to text. In: ICCV (2015)
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. arXiv preprint arXiv:1511.03476 (2015)
Chen, S., Chen, J., Jin, Q.: Generating video descriptions with topic guidance. In: ICMR (2017)
Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR (2016)
Gan, Z., et al.: Semantic compositional networks for visual captioning. In: CVPR (2017)
Song, J., Guo, Y., Gao, L., Li, X., Hanjalic, A., Shen, H.T.: From deterministic to generative: multi-modal stochastic RNNs for video captioning. arXiv preprint arXiv:1708.02478 (2017)
Song, J., Guo, Z., Gao, L., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231 (2017)
Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. arXiv preprint arXiv:1803.11438 (2018)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)
Lavie, A., Agarwal, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Xiao, H., Shi, J. (2019). Video Captioning Using Hierarchical LSTM and Text-Based Sliding Window. In: Li, B., Yang, M., Yuan, H., Yan, Z. (eds) IoT as a Service. IoTaaS 2018. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 271. Springer, Cham. https://doi.org/10.1007/978-3-030-14657-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-14657-3_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-14656-6
Online ISBN: 978-3-030-14657-3
eBook Packages: Computer ScienceComputer Science (R0)