research-article

Public Access

Multimodal Video Description

Authors:
Vasili Ramanishka

University of Massachusetts Lowell, Lowell, MA, USA

University of Massachusetts Lowell, Lowell, MA, USA
View Profile

,
Abir Das

University of Massachusetts Lowell, Lowell, MA, USA

University of Massachusetts Lowell, Lowell, MA, USA
View Profile

,
Dong Huk Park

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

,
Subhashini Venugopalan

University of Texas at Austin, Austin, TX, USA

University of Texas at Austin, Austin, TX, USA
View Profile

,
Lisa Anne Hendricks

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

,
Marcus Rohrbach

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

,
Kate Saenko

University of Massachusetts Lowell, Lowell, MA, USA

University of Massachusetts Lowell, Lowell, MA, USA
View Profile

MM '16: Proceedings of the 24th ACM international conference on MultimediaOctober 2016Pages 1092–1096https://doi.org/10.1145/2964284.2984066

Published:01 October 2016Publication History

MM '16: Proceedings of the 24th ACM international conference on Multimedia

Pages 1092–1096

ABSTRACT

Real-world web videos often contain cues to supplement visual information for generating natural language descriptions. In this paper we propose a sequence-to-sequence model which explores such auxiliary information. In particular, audio and the topic of the video are used in addition to the visual information in a multimodal framework to generate coherent descriptions of videos "in the wild". In contrast to current encoder-decoder based models which exploit visual information only during the encoding stage, our model fuses multiple sources of information judiciously, showing improvement over using the different modalities separately. We based our multimodal video description network on the state-of-the-art sequence to sequence video to text (S2VT) model and extended it to take advantage of multiple modalities. Extensive experiments on the challenging MSR-VTT dataset are carried out to show the superior performance of the proposed approach on natural videos found in the web.

References

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467, 2016.Google Scholar
D. Bahdanau, K. Cho, and Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. In International Conference on Learning Representations, 2015.Google Scholar
S. Banerjee and A. Lavie. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Association for Computational Linguistics Workshop, 2005. Google ScholarDigital Library
F. Beritelli and R. Grasso. A Pattern Recognition System for Environmental Sound Classification based on MFCCs and Neural Networks. In IEEE International Conference on Signal Processing and Communication Systems, pages 1--4, 2008.Google ScholarCross Ref
P. Das, C. Xu, R. F. Doell, and J. J. Corso. A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching. In IEEE Conference on Computer Vision and Pattern Recognition, 2013. Google ScholarDigital Library
A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every Picture Tells a Story: Generating Sentences from Images. In European Conference on Computer Vision, 2010. Google ScholarDigital Library
T. Giannakopoulos. pyaudioanalysis: An open-source python library for audio signal analysis. PLoS ONE, 10(12):1--17, 12 2015.Google ScholarCross Ref
S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition. In IEEE International Conference on Computer Vision, 2013. Google ScholarDigital Library
K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.Google Scholar
G. Hinton, L. Deng, D. Yu, G. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine, 29(6):82--97, 2012.Google ScholarCross Ref
S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735--1780, 11 1997. Google ScholarDigital Library
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale Video Classification with Convolutional Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition, June 2014. Google ScholarDigital Library
N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge. In AAAI Conference on Artificial Intelligence, 2013. Google ScholarDigital Library
G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and Generating Simple Image Descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. Google ScholarDigital Library
S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Composing Simple Image Descriptions using Web-scale n-grams. In Conference on Computational Natural Language Learning, 2011. Google ScholarDigital Library
C.-Y. Lin. Rouge: A Package for Automatic Evaluation of Summaries. In Association for Computational Linguistics Workshop, volume 8, 2004.Google Scholar
B. Logan. Mel Frequency Cepstral Coefficients for Music Modeling. In International Symposium on Music Information Retrieval, 2000.Google Scholar
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. In Association for Computational Linguistics, pages 311--318, 2002. Google ScholarDigital Library
J. Pennington, R. Socher, and C. D. Manning. Glove: Global Vectors for Word Representation. In Conference on Empirical Methods in Natural Language Processing, 2014.Google Scholar
A. Rohrbach, M. Rohrbach, and B. Schiele. The Long-Short Story of Movie Description. In German Conference on Pattern Recognition, 2015.Google Scholar
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 2014. Google ScholarDigital Library
J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild. In International Conference on Computational Linguistics, 2014.Google Scholar
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning Spatiotemporal Features with 3D Convolutional Networks. In IEEE International Conference on Computer Vision, 2015. Google ScholarDigital Library
L. van der Maaten and G. Hinton. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9:2579--2605, 2008.Google Scholar
R. Vedantam, L. C. Zitnick, and D. Parikh. Cider: Consensus-based Image Description Evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.Google Scholar
S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to Sequence - Video to Text. In IEEE International Conference on Computer Vision, 2015. Google ScholarDigital Library
J. Xu, T. Mei, T. Yao, and Y. Rui. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.Google Scholar
B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2t: Image Parsing to Text Description. Proceedings of the IEEE, 98(8):1485--1508, 2010.Google ScholarCross Ref

Index Terms

Multimodal Video Description
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
    2. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification

Recommendations

Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
Read More
Video Description Generation using Audio and Visual Cues
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

The recent advances in image captioning stimulate the research in generating natural language description for visual content, which can be widely applied in many applications such as assisting blind people. Video description generation is a more complex ...
Read More
Visual and language semantic hybrid enhancement and complementary for video description
Abstract
It is a fundamental task of computer vision to describe and express the visual content of a video in natural language, which not only highly summarizes the video, but also presents the visual information in description sentence with reasonable ... $^{}$
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '16: Proceedings of the 24th ACM international conference on Multimedia
October 2016
1542 pages
ISBN:9781450336031
DOI:10.1145/2964284
General Chairs:
Alan Hanjalic
Delft University of Technology
,
Cees Snoek
Qualcomm Research Netherlands / University of Amsterdam
,
Marcel Worring
University of Amsterdam
,
Moderator:
Dick Bulterman
CWI / VU University Amsterdam
,
Program Chairs:
Benoit Huet
EURECOM
,
Aisling Kelliher
Virginia Tech
,
Yiannis Kompatsiaris
CERTH-ITI
,
Jin Li
Microsoft
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
sequence to sequence
video description
Qualifiers
- research-article
Conference

Acceptance Rates
MM '16 Paper Acceptance Rate52of237submissions,22%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 96
  Total Citations
  View Citations
- 1,708
  Total Downloads
- Downloads (Last 12 months)221
- Downloads (Last 6 weeks)24
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multimodal Video Description

MM '16: Proceedings of the 24th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning Multimodal Attention LSTM Networks for Video Captioning

Video Description Generation using Audio and Visual Cues

Visual and language semantic hybrid enhancement and complementary for video description

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media