ABSTRACT
Real-world web videos often contain cues to supplement visual information for generating natural language descriptions. In this paper we propose a sequence-to-sequence model which explores such auxiliary information. In particular, audio and the topic of the video are used in addition to the visual information in a multimodal framework to generate coherent descriptions of videos "in the wild". In contrast to current encoder-decoder based models which exploit visual information only during the encoding stage, our model fuses multiple sources of information judiciously, showing improvement over using the different modalities separately. We based our multimodal video description network on the state-of-the-art sequence to sequence video to text (S2VT) model and extended it to take advantage of multiple modalities. Extensive experiments on the challenging MSR-VTT dataset are carried out to show the superior performance of the proposed approach on natural videos found in the web.
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467, 2016.Google Scholar
- D. Bahdanau, K. Cho, and Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. In International Conference on Learning Representations, 2015.Google Scholar
- S. Banerjee and A. Lavie. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Association for Computational Linguistics Workshop, 2005. Google ScholarDigital Library
- F. Beritelli and R. Grasso. A Pattern Recognition System for Environmental Sound Classification based on MFCCs and Neural Networks. In IEEE International Conference on Signal Processing and Communication Systems, pages 1--4, 2008.Google ScholarCross Ref
- P. Das, C. Xu, R. F. Doell, and J. J. Corso. A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching. In IEEE Conference on Computer Vision and Pattern Recognition, 2013. Google ScholarDigital Library
- A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every Picture Tells a Story: Generating Sentences from Images. In European Conference on Computer Vision, 2010. Google ScholarDigital Library
- T. Giannakopoulos. pyaudioanalysis: An open-source python library for audio signal analysis. PLoS ONE, 10(12):1--17, 12 2015.Google ScholarCross Ref
- S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition. In IEEE International Conference on Computer Vision, 2013. Google ScholarDigital Library
- K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.Google Scholar
- G. Hinton, L. Deng, D. Yu, G. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine, 29(6):82--97, 2012.Google ScholarCross Ref
- S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735--1780, 11 1997. Google ScholarDigital Library
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale Video Classification with Convolutional Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition, June 2014. Google ScholarDigital Library
- N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge. In AAAI Conference on Artificial Intelligence, 2013. Google ScholarDigital Library
- G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and Generating Simple Image Descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. Google ScholarDigital Library
- S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Composing Simple Image Descriptions using Web-scale n-grams. In Conference on Computational Natural Language Learning, 2011. Google ScholarDigital Library
- C.-Y. Lin. Rouge: A Package for Automatic Evaluation of Summaries. In Association for Computational Linguistics Workshop, volume 8, 2004.Google Scholar
- B. Logan. Mel Frequency Cepstral Coefficients for Music Modeling. In International Symposium on Music Information Retrieval, 2000.Google Scholar
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. In Association for Computational Linguistics, pages 311--318, 2002. Google ScholarDigital Library
- J. Pennington, R. Socher, and C. D. Manning. Glove: Global Vectors for Word Representation. In Conference on Empirical Methods in Natural Language Processing, 2014.Google Scholar
- A. Rohrbach, M. Rohrbach, and B. Schiele. The Long-Short Story of Movie Description. In German Conference on Pattern Recognition, 2015.Google Scholar
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 2014. Google ScholarDigital Library
- J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild. In International Conference on Computational Linguistics, 2014.Google Scholar
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning Spatiotemporal Features with 3D Convolutional Networks. In IEEE International Conference on Computer Vision, 2015. Google ScholarDigital Library
- L. van der Maaten and G. Hinton. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9:2579--2605, 2008.Google Scholar
- R. Vedantam, L. C. Zitnick, and D. Parikh. Cider: Consensus-based Image Description Evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.Google Scholar
- S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to Sequence - Video to Text. In IEEE International Conference on Computer Vision, 2015. Google ScholarDigital Library
- J. Xu, T. Mei, T. Yao, and Y. Rui. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.Google Scholar
- B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2t: Image Parsing to Text Description. Proceedings of the IEEE, 98(8):1485--1508, 2010.Google ScholarCross Ref
Index Terms
- Multimodal Video Description
Recommendations
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on MultimediaAutomatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
Video Description Generation using Audio and Visual Cues
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia RetrievalThe recent advances in image captioning stimulate the research in generating natural language description for visual content, which can be widely applied in many applications such as assisting blind people. Video description generation is a more complex ...
Visual and language semantic hybrid enhancement and complementary for video description
AbstractIt is a fundamental task of computer vision to describe and express the visual content of a video in natural language, which not only highly summarizes the video, but also presents the visual information in description sentence with reasonable ...
Comments