Skip to main content

Video Captioning Using Hierarchical LSTM and Text-Based Sliding Window

  • Conference paper
  • First Online:
  • 846 Accesses

Abstract

Automatically describing video content with natural language has been attracting a lot of attention in multimedia community. However, most existing methods only use the word-level cross entropy loss to train the model, while ignoring the relationship between visual content and sentence semantics. In addition, during the decoding stage, the resulting models are used to predict one word at a time, and by feeding the generated word back as input at the next time step. Nevertheless, the other generated words are not fully exploited. As a result, the model is easy to “run off” if the last generated word is ambiguous. To tackle these issues, we propose a novel framework consisting of hierarchical long short term memory and text-based sliding window (HLSTM-TSW), which not only optimizes the model at word level, but also enhances the semantic relationship between the visual content and the entire sentence during training. Moreover, a sliding window is used to focus on k previously generated words when predicting the next word, so that our model can make use of more useful information to further improve the accuracy of forecast. Experiments on the benchmark dataset YouTube2Text demonstrate that our method which only uses single feature achieves superior or even better results than the state-of-the-art baselines for video captioning.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    [Online]. Available: https://www.nltk.org/index.html.

References

  1. Sun, L., Wang, X., Wang, Z., Zhao, H., Zhu, W.: Social-aware video recommendation for online social groups. IEEE Trans. Multimedia 19(3), 609–618 (2017)

    Article  Google Scholar 

  2. Wu, S., Wieland, J., Farivar, O., Schiller, J.: Automatic Alt-text: computer-generated image descriptions for blind users on a social network service. In: CSCW, pp. 1180–1192 (2017)

    Google Scholar 

  3. Das, A., et al.: Visual dialog. In: CVPR (2017)

    Google Scholar 

  4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  5. Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL (2011)

    Google Scholar 

  6. Lebret, R., Pinheiro, P.O., Collobert, R.: Phrase-based image captioning. arXiv preprint arXiv:1502.03671 (2015)

  7. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  8. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)

    Google Scholar 

  9. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)

    Google Scholar 

  10. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)

    Google Scholar 

  11. Fang, H., et al.: From captions to visual concepts and back. In: CVPR (2015)

    Google Scholar 

  12. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: NAACL HLT (2015)

    Google Scholar 

  13. Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. arXiv preprint arXiv:1505.01861 (2015)

  14. Yao, L., et al.: Describing videos by exploiting temporal structure. In: ICCV (2015)

    Google Scholar 

  15. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence - video to text. In: ICCV (2015)

    Google Scholar 

  16. Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. arXiv preprint arXiv:1511.03476 (2015)

  17. Chen, S., Chen, J., Jin, Q.: Generating video descriptions with topic guidance. In: ICMR (2017)

    Google Scholar 

  18. Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR (2016)

    Google Scholar 

  19. Gan, Z., et al.: Semantic compositional networks for visual captioning. In: CVPR (2017)

    Google Scholar 

  20. Song, J., Guo, Y., Gao, L., Li, X., Hanjalic, A., Shen, H.T.: From deterministic to generative: multi-modal stochastic RNNs for video captioning. arXiv preprint arXiv:1708.02478 (2017)

  21. Song, J., Guo, Z., Gao, L., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231 (2017)

  22. Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. arXiv preprint arXiv:1803.11438 (2018)

  23. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)

    Google Scholar 

  24. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  25. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)

    Google Scholar 

  26. Lavie, A., Agarwal, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)

    Google Scholar 

  27. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)

    Google Scholar 

  28. Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinglun Shi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xiao, H., Shi, J. (2019). Video Captioning Using Hierarchical LSTM and Text-Based Sliding Window. In: Li, B., Yang, M., Yuan, H., Yan, Z. (eds) IoT as a Service. IoTaaS 2018. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 271. Springer, Cham. https://doi.org/10.1007/978-3-030-14657-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-14657-3_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-14656-6

  • Online ISBN: 978-3-030-14657-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics