Description generation of open-domain videos incorporating multimodal features and bidirectional encoder

Du, Xiaotong; Yuan, Jiabin; Hu, Liu; Dai, Yuke

doi:10.1007/s00371-018-1591-x

Description generation of open-domain videos incorporating multimodal features and bidirectional encoder

Original Article
Published: 30 August 2018

Volume 35, pages 1703–1712, (2019)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Xiaotong Du ORCID: orcid.org/0000-0001-5933-015X¹,
Jiabin Yuan¹,
Liu Hu¹ &
…
Yuke Dai²

327 Accesses
4 Citations
Explore all metrics

Abstract

Describing open-domain videos in natural language is a major challenge for video understanding and can largely fulfill its potential in a host of applications, such as assisting blind people and managing massive videos. This paper presents an updated sequence-to-sequence video to text model (MM-BiS2VT), which incorporates multimodal feature fusion and bidirectional language structure and aims at optimizing conventional methods. The model totally considered four features-RGB images, optical flow, spatiotemporal and audio features. RGB images and optical flow features were extracted by ResNet152. And with the help of the improved three-dimensional convolutional neural networks model, spatiotemporal feature was included. As a vital factor to increase the accuracy of results, audio feature was also added to make up for visual information. After combining these features by a feature fusion method, bidirectional long short-term memory units (BiLSTMs) was adopted to generate descriptive sentences. The results indicate that fusing multimodal features could gain better sentences over other methods and BiLSTMs plays a significant role as well to improve the accuracy of the outputs, which means the works in this paper could be an available reference for computer vision and video processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Visual attention network

Article Open access 28 July 2023

References

Chen, N., Prasanna, V.K.: A bag-of-semantics model for image clustering. Vis. Comput. 29(11), 1221–1229 (2013)
Article Google Scholar
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al.: BabyTalk: understanding and generating simple image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 35, pp. 1601–1608. IEEE Computer Society (2011)
Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2t: image parsing to text description. Proc. IEEE 98(8), 1485–1508 (2010)
Article Google Scholar
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. AB initto calculation of the structures and properties of molecules. Elsevier, Amsterdam (2015)
Google Scholar
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., et al.: YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: IEEE International Conference on Computer Vision, pp. 2712–2719. IEEE (2014)
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-rnn). Eprint Arxiv (2014)
Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: IEEE International Conference on Computer Vision, vol. 21, pp. 433–440. IEEE (2013)
Derpanis, K.G.: Dynamic scene understanding: the role of orientation features in space and time in scene classification. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 157, pp. 1306–1313. IEEE Computer Society (2012)
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Joint IEEE International Workshop on Visual Surveillance and PERFORMANCE Evaluation of Tracking and Surveillance, pp. 65–72. IEEE (2006)
Venugopalan S, Rohrbach M, Donahue J, et al.: Sequence to sequence—video to text. In: IEEE International Conference on Computer Vision, pp. 4534–4542. IEEE Computer Society (2015)
Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al.: Learning phrase representations using rnn encoder -decoder for statistical machine translation. Comput. Sci. arXiv:1406.1078 (2014)
Mirzaei, M.R., Ghorshi, S., Mortazavi, M.: Audio-visual speech recognition techniques in augmented reality environments. Vis. Comput. 30(3), 245–257 (2014)
Article Google Scholar
Chen, X., Zitnick, C.L.: Mind’s eye: a recurrent visual representation for image caption generation. In Computer Vision and Pattern Recognition, pp. 2422–2431. IEEE (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In Computer Vision and Pattern Recognition, pp. 3156–3164. IEEE (2015)
Barbu, A., Bridge, A., Burchill, Z., Dan, C., Dickinson, S., Fidler, S., et al.: Video in sentences out. In: Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, vol. 1401, pp. 274–283. arXiv (2012)
Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In Computer Vision and Pattern Recognition, pp. 3202–3212 (2015)
Yao, L., Torabi, A., Cho, K., et al.: Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision, pp. 4507–4515. IEEE Computer Society (2015)
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. Comput. Sci. arXiv:1412.4729 (2014)
Xu, H., Venugopalan, S., Ramanishka, V., Rohrbach, M., Saenko, K.: A multi-scale multiple instance video description network. Comput. Sci. 6738, 272–279 (2015)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: C3d: generic features for video analysis. Eprint. arXiv:1412.0767 (2014)
Shaikh, A.A., Kumar, D.K., Gubbi, J.: Automatic visual speech segmentation and recognition using directional motion history images and zernike moments. Vis. Comput. 29(10), 969–982 (2013)
Article Google Scholar
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. C., et al.: Cnn architectures for large-scale audio classification. In: International Conference on Acoustics, Speech and Signal Processing, pp. 131–135 (2017)
Jin, Q., Chen, J., Chen, S., Xiong, Y., Hauptmann, A.: Describing videos using multi-modal fusion. In: ACM on Multimedia Conference, pp. 1087–1091. ACM (2016)
Ramanishka, V., Das, A., Dong, H.P., Venugopalan, S., Hendricks, L.A., Rohrbach, M., et al.: Multimodal video description. In: ACM on Multimedia Conference, pp. 1092–1096. ACM (2016)
Shetty, R., Laaksonen, J.: Frame- and segment-level features and candidate pool evaluation for video caption generation. In: ACM on Multimedia Conference. arXiv:1608.04959 (2016)
Alvaro, P., Bolanos, M., Radeva, P., Casacuberta, F.: Video description using bidirectional recurrent neural networks. In: International Conference on Artificial Neural Networks, pp. 3–11. Springer, Cham (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, pp. 770–778 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates Inc (2012)
D’Angelo, E., Paratte, J., Puy, G., Vandergheynst, P.: Fast TV-L1 optical flow for interactivity. In; IEEE International Conference on Image Processing, vol. 6626, pp. 1885–1888. IEEE (2011)
Giannakopoulos, T.: Pyaudioanalysis: an open-source python library for audio signal analysis. PLoS ONE 10(12), e0144610 (2015)
Article Google Scholar
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using lstms. In Computer Vision and Pattern Recognition, pp. 843–852 (2015)
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200. Association for Computational Linguistics (2011)
Torabi, A., Pal, C., Larochelle, H., et al.: Using descriptive video services to create a large data source for video annotation research. Comput. Sci. arXiv:1503.01070 (2015)
Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollar, P., et al.: Microsoft coco captions: data collection and evaluation server. Comput. Sci. arXiv:1504.00325 (2015)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In Meeting on Association for Computational Linguistics, vol. 4, pp. 311–318. Association for Computational Linguistics (2002)
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In The Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating language and vision to generate natural language descriptions of videos in the wild. In; International Conference on Computational Linguistics, pp. 1218–1227 (2014)
Pasunuru, R., Bansal, M.: Multi-task video captioning with video and entailment generation. In Meeting of the Association for Computational Linguistics, pp. 1273–1283 (2017)

Download references

Acknowledgements

This work was supported by Research and Industrialization for Intelligent Video Processing Technology based on GPUs Parallel Computing of the Science and Technology Supported Program of Jiangsu Province (BY 2016003-11) and the Application platform and Industrialization for efficient cloud computing for Big data of the Science and Technology Supported Program of Jiangsu Province (BA2015052). We thank all the shared achievements of selfless antecessors.

Author information

Authors and Affiliations

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, People’s Republic of China
Xiaotong Du, Jiabin Yuan & Liu Hu
College of Aerospace Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, People’s Republic of China
Yuke Dai

Authors

Xiaotong Du
View author publications
You can also search for this author in PubMed Google Scholar
Jiabin Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Liu Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yuke Dai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaotong Du.

Ethics declarations

Conflict of interest

All the authors have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Du, X., Yuan, J., Hu, L. et al. Description generation of open-domain videos incorporating multimodal features and bidirectional encoder. Vis Comput 35, 1703–1712 (2019). https://doi.org/10.1007/s00371-018-1591-x

Download citation

Published: 30 August 2018
Issue Date: December 2019
DOI: https://doi.org/10.1007/s00371-018-1591-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Description generation of open-domain videos incorporating multimodal features and bidirectional encoder

Abstract

Access this article

Similar content being viewed by others

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Video summarization using deep learning techniques: a detailed analysis and investigation

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Description generation of open-domain videos incorporating multimodal features and bidirectional encoder

Abstract

Access this article

Similar content being viewed by others

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Video summarization using deep learning techniques: a detailed analysis and investigation

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation