Boundary Detector Encoder and Decoder with Soft Attention for Video Captioning

Chen, Tangming; Zhao, Qike; Song, Jingkuan

doi:10.1007/978-3-030-33982-1_9

Tangming Chen^10,11,
Qike Zhao^10,11 &
Jingkuan Song^10,11

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11809))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

681 Accesses
2 Citations

Abstract

The use of Recurrent Neural Networks and Convolutional Neural Networks for video captioning has received widespread attention, since the deep learning has developed rapidly. Based on classical encoder-decoder approach, we modify the encoding networks and decoding networks to improve the performance of the entire networks. In this paper, we introduce an encoding scheme that can detect the hierarchical structure of the input video. What’s more, we use soft attention mechanism which can learn to automatically select the relevant input frames from the input video to generate the description of the input video. Extensive experiments are conducted on two datasets: the Microsoft Video Description Corpus and the MSR-Video To Text. Three metrics, BLEU@4, METEOR and CIDEr are used to evaluate our approach. Experimental results demonstrate the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Joint multi-scale information and long-range dependence for video captioning

Article 14 November 2023

Automatic video captioning using tree hierarchical deep convolutional neural network and ASRNN-bi-directional LSTM

Article 13 August 2024

Video Captioning Using Deep Learning Approach-A Comprehensive Survey

References

Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1657–1666 (2017)
Google Scholar
Barbu, A., et al.: Video in sentences out. arXiv preprint arXiv:1204.2742 (2012)
Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 1171–1179 (2015)
Google Scholar
Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 190–200 (2011)
Google Scholar
Fang, H., et al.: From captions to visual concepts and back. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1482 (2015)
Google Scholar
Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multimed. 19(9), 2045–2055 (2017)
Article Google Scholar
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM (1999)
Google Scholar
Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013)
Google Scholar
Guadarrama, S., et al.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712–2719 (2013)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Khan, M.U.G., Gotoh, Y.: Describing video contents in natural language. In: Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, pp. 27–35 (2012)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Raiko, T., Berglund, M., Alain, G., Dinh, L.: Techniques for learning binary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989 (2014)
Rohrbach, A., et al.: Movie description. Int. J. Comput. Vis. 123(1), 94–120 (2017)
Article Google Scholar
Schmidhuber, J., Wierstra, D., Gagliolo, M., Gomez, F.: Training recurrent networks by evolino. Neural Comput. 19(3), 757–779 (2007)
Article Google Scholar
Song, J., Guo, Z., Gao, L., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231 (2017)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)
Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)
Wang, J., Wang, W., Huang, Y., Wang, L., Tan, T.: M3: multimodal memory modelling for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7512–7520 (2018)
Google Scholar
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296 (2016)
Google Scholar
Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
Google Scholar

Download references

Acknowledgments

This work is supported by Major Scientific and Technological Special Project of Guizhou Province (20183002).

Author information

Authors and Affiliations

Guizhou Provincial Key Laboratory of Public Big Data, GuiZhou University, Guiyang, 550025, Guizhou, China
Tangming Chen, Qike Zhao & Jingkuan Song
University of Electronic Science and Technology of China, Chengdu, China
Tangming Chen, Qike Zhao & Jingkuan Song

Authors

Tangming Chen
View author publications
You can also search for this author in PubMed Google Scholar
Qike Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jingkuan Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tangming Chen .

Editor information

Editors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Jingkuan Song
Massey University, Auckland, New Zealand
Xiaofeng Zhu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, T., Zhao, Q., Song, J. (2019). Boundary Detector Encoder and Decoder with Soft Attention for Video Captioning. In: Song, J., Zhu, X. (eds) Web and Big Data. APWeb-WAIM 2019. Lecture Notes in Computer Science(), vol 11809. Springer, Cham. https://doi.org/10.1007/978-3-030-33982-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-33982-1_9
Published: 01 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33981-4
Online ISBN: 978-3-030-33982-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics