Skip to main content
Log in

Description generation of open-domain videos incorporating multimodal features and bidirectional encoder

  • Original Article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Describing open-domain videos in natural language is a major challenge for video understanding and can largely fulfill its potential in a host of applications, such as assisting blind people and managing massive videos. This paper presents an updated sequence-to-sequence video to text model (MM-BiS2VT), which incorporates multimodal feature fusion and bidirectional language structure and aims at optimizing conventional methods. The model totally considered four features-RGB images, optical flow, spatiotemporal and audio features. RGB images and optical flow features were extracted by ResNet152. And with the help of the improved three-dimensional convolutional neural networks model, spatiotemporal feature was included. As a vital factor to increase the accuracy of results, audio feature was also added to make up for visual information. After combining these features by a feature fusion method, bidirectional long short-term memory units (BiLSTMs) was adopted to generate descriptive sentences. The results indicate that fusing multimodal features could gain better sentences over other methods and BiLSTMs plays a significant role as well to improve the accuracy of the outputs, which means the works in this paper could be an available reference for computer vision and video processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Chen, N., Prasanna, V.K.: A bag-of-semantics model for image clustering. Vis. Comput. 29(11), 1221–1229 (2013)

    Article  Google Scholar 

  2. Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al.: BabyTalk: understanding and generating simple image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 35, pp. 1601–1608. IEEE Computer Society (2011)

  3. Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2t: image parsing to text description. Proc. IEEE 98(8), 1485–1508 (2010)

    Article  Google Scholar 

  4. Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. AB initto calculation of the structures and properties of molecules. Elsevier, Amsterdam (2015)

    Google Scholar 

  5. Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., et al.: YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: IEEE International Conference on Computer Vision, pp. 2712–2719. IEEE (2014)

  6. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-rnn). Eprint Arxiv (2014)

  7. Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: IEEE International Conference on Computer Vision, vol. 21, pp. 433–440. IEEE (2013)

  8. Derpanis, K.G.: Dynamic scene understanding: the role of orientation features in space and time in scene classification. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 157, pp. 1306–1313. IEEE Computer Society (2012)

  9. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Joint IEEE International Workshop on Visual Surveillance and PERFORMANCE Evaluation of Tracking and Surveillance, pp. 65–72. IEEE (2006)

  10. Venugopalan S, Rohrbach M, Donahue J, et al.: Sequence to sequence—video to text. In: IEEE International Conference on Computer Vision, pp. 4534–4542. IEEE Computer Society (2015)

  11. Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al.: Learning phrase representations using rnn encoder -decoder for statistical machine translation. Comput. Sci. arXiv:1406.1078 (2014)

  12. Mirzaei, M.R., Ghorshi, S., Mortazavi, M.: Audio-visual speech recognition techniques in augmented reality environments. Vis. Comput. 30(3), 245–257 (2014)

    Article  Google Scholar 

  13. Chen, X., Zitnick, C.L.: Mind’s eye: a recurrent visual representation for image caption generation. In Computer Vision and Pattern Recognition, pp. 2422–2431. IEEE (2015)

  14. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In Computer Vision and Pattern Recognition, pp. 3156–3164. IEEE (2015)

  15. Barbu, A., Bridge, A., Burchill, Z., Dan, C., Dickinson, S., Fidler, S., et al.: Video in sentences out. In: Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, vol. 1401, pp. 274–283. arXiv (2012)

  16. Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In Computer Vision and Pattern Recognition, pp. 3202–3212 (2015)

  17. Yao, L., Torabi, A., Cho, K., et al.: Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision, pp. 4507–4515. IEEE Computer Society (2015)

  18. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. Comput. Sci. arXiv:1412.4729 (2014)

  19. Xu, H., Venugopalan, S., Ramanishka, V., Rohrbach, M., Saenko, K.: A multi-scale multiple instance video description network. Comput. Sci. 6738, 272–279 (2015)

    Google Scholar 

  20. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: C3d: generic features for video analysis. Eprint. arXiv:1412.0767 (2014)

  21. Shaikh, A.A., Kumar, D.K., Gubbi, J.: Automatic visual speech segmentation and recognition using directional motion history images and zernike moments. Vis. Comput. 29(10), 969–982 (2013)

    Article  Google Scholar 

  22. Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. C., et al.: Cnn architectures for large-scale audio classification. In: International Conference on Acoustics, Speech and Signal Processing, pp. 131–135 (2017)

  23. Jin, Q., Chen, J., Chen, S., Xiong, Y., Hauptmann, A.: Describing videos using multi-modal fusion. In: ACM on Multimedia Conference, pp. 1087–1091. ACM (2016)

  24. Ramanishka, V., Das, A., Dong, H.P., Venugopalan, S., Hendricks, L.A., Rohrbach, M., et al.: Multimodal video description. In: ACM on Multimedia Conference, pp. 1092–1096. ACM (2016)

  25. Shetty, R., Laaksonen, J.: Frame- and segment-level features and candidate pool evaluation for video caption generation. In: ACM on Multimedia Conference. arXiv:1608.04959 (2016)

  26. Alvaro, P., Bolanos, M., Radeva, P., Casacuberta, F.: Video description using bidirectional recurrent neural networks. In: International Conference on Artificial Neural Networks, pp. 3–11. Springer, Cham (2016)

  27. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, pp. 770–778 (2015)

  28. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates Inc (2012)

  29. D’Angelo, E., Paratte, J., Puy, G., Vandergheynst, P.: Fast TV-L1 optical flow for interactivity. In; IEEE International Conference on Image Processing, vol. 6626, pp. 1885–1888. IEEE (2011)

  30. Giannakopoulos, T.: Pyaudioanalysis: an open-source python library for audio signal analysis. PLoS ONE 10(12), e0144610 (2015)

    Article  Google Scholar 

  31. Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using lstms. In Computer Vision and Pattern Recognition, pp. 843–852 (2015)

  32. Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200. Association for Computational Linguistics (2011)

  33. Torabi, A., Pal, C., Larochelle, H., et al.: Using descriptive video services to create a large data source for video annotation research. Comput. Sci. arXiv:1503.01070 (2015)

  34. Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollar, P., et al.: Microsoft coco captions: data collection and evaluation server. Comput. Sci. arXiv:1504.00325 (2015)

  35. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In Meeting on Association for Computational Linguistics, vol. 4, pp. 311–318. Association for Computational Linguistics (2002)

  36. Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In The Workshop on Statistical Machine Translation, pp. 376–380 (2014)

  37. Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating language and vision to generate natural language descriptions of videos in the wild. In; International Conference on Computational Linguistics, pp. 1218–1227 (2014)

  38. Pasunuru, R., Bansal, M.: Multi-task video captioning with video and entailment generation. In Meeting of the Association for Computational Linguistics, pp. 1273–1283 (2017)

Download references

Acknowledgements

This work was supported by Research and Industrialization for Intelligent Video Processing Technology based on GPUs Parallel Computing of the Science and Technology Supported Program of Jiangsu Province (BY 2016003-11) and the Application platform and Industrialization for efficient cloud computing for Big data of the Science and Technology Supported Program of Jiangsu Province (BA2015052). We thank all the shared achievements of selfless antecessors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaotong Du.

Ethics declarations

Conflict of interest

All the authors have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Du, X., Yuan, J., Hu, L. et al. Description generation of open-domain videos incorporating multimodal features and bidirectional encoder. Vis Comput 35, 1703–1712 (2019). https://doi.org/10.1007/s00371-018-1591-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-018-1591-x

Keywords

Navigation