skip to main content
10.1145/2964284.2984066acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Public Access

Multimodal Video Description

Published:01 October 2016Publication History

ABSTRACT

Real-world web videos often contain cues to supplement visual information for generating natural language descriptions. In this paper we propose a sequence-to-sequence model which explores such auxiliary information. In particular, audio and the topic of the video are used in addition to the visual information in a multimodal framework to generate coherent descriptions of videos "in the wild". In contrast to current encoder-decoder based models which exploit visual information only during the encoding stage, our model fuses multiple sources of information judiciously, showing improvement over using the different modalities separately. We based our multimodal video description network on the state-of-the-art sequence to sequence video to text (S2VT) model and extended it to take advantage of multiple modalities. Extensive experiments on the challenging MSR-VTT dataset are carried out to show the superior performance of the proposed approach on natural videos found in the web.

References

  1. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467, 2016.Google ScholarGoogle Scholar
  2. D. Bahdanau, K. Cho, and Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. In International Conference on Learning Representations, 2015.Google ScholarGoogle Scholar
  3. S. Banerjee and A. Lavie. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Association for Computational Linguistics Workshop, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Beritelli and R. Grasso. A Pattern Recognition System for Environmental Sound Classification based on MFCCs and Neural Networks. In IEEE International Conference on Signal Processing and Communication Systems, pages 1--4, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  5. P. Das, C. Xu, R. F. Doell, and J. J. Corso. A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching. In IEEE Conference on Computer Vision and Pattern Recognition, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every Picture Tells a Story: Generating Sentences from Images. In European Conference on Computer Vision, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Giannakopoulos. pyaudioanalysis: An open-source python library for audio signal analysis. PLoS ONE, 10(12):1--17, 12 2015.Google ScholarGoogle ScholarCross RefCross Ref
  8. S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition. In IEEE International Conference on Computer Vision, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.Google ScholarGoogle Scholar
  10. G. Hinton, L. Deng, D. Yu, G. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine, 29(6):82--97, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  11. S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735--1780, 11 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale Video Classification with Convolutional Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition, June 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge. In AAAI Conference on Artificial Intelligence, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and Generating Simple Image Descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Composing Simple Image Descriptions using Web-scale n-grams. In Conference on Computational Natural Language Learning, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C.-Y. Lin. Rouge: A Package for Automatic Evaluation of Summaries. In Association for Computational Linguistics Workshop, volume 8, 2004.Google ScholarGoogle Scholar
  17. B. Logan. Mel Frequency Cepstral Coefficients for Music Modeling. In International Symposium on Music Information Retrieval, 2000.Google ScholarGoogle Scholar
  18. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. In Association for Computational Linguistics, pages 311--318, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Pennington, R. Socher, and C. D. Manning. Glove: Global Vectors for Word Representation. In Conference on Empirical Methods in Natural Language Processing, 2014.Google ScholarGoogle Scholar
  20. A. Rohrbach, M. Rohrbach, and B. Schiele. The Long-Short Story of Movie Description. In German Conference on Pattern Recognition, 2015.Google ScholarGoogle Scholar
  21. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild. In International Conference on Computational Linguistics, 2014.Google ScholarGoogle Scholar
  23. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning Spatiotemporal Features with 3D Convolutional Networks. In IEEE International Conference on Computer Vision, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. van der Maaten and G. Hinton. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9:2579--2605, 2008.Google ScholarGoogle Scholar
  25. R. Vedantam, L. C. Zitnick, and D. Parikh. Cider: Consensus-based Image Description Evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.Google ScholarGoogle Scholar
  26. S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to Sequence - Video to Text. In IEEE International Conference on Computer Vision, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Xu, T. Mei, T. Yao, and Y. Rui. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.Google ScholarGoogle Scholar
  28. B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2t: Image Parsing to Text Description. Proceedings of the IEEE, 98(8):1485--1508, 2010.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Multimodal Video Description

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '16: Proceedings of the 24th ACM international conference on Multimedia
          October 2016
          1542 pages
          ISBN:9781450336031
          DOI:10.1145/2964284

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 October 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          MM '16 Paper Acceptance Rate52of237submissions,22%Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader