research-article

Attention-based Densely Connected LSTM for Video Captioning

Authors:
Yongqing Zhu

Institute of Computing Technology Chinese Academy of Sciences, Beijing, China

Institute of Computing Technology Chinese Academy of Sciences, Beijing, China
View Profile

,
Shuqiang Jiang

Institute of Computing Technology Chinese Academy of Sciences, Beijing, China

Institute of Computing Technology Chinese Academy of Sciences, Beijing, China
View Profile

MM '19: Proceedings of the 27th ACM International Conference on MultimediaOctober 2019Pages 802–810https://doi.org/10.1145/3343031.3350932

Published:15 October 2019Publication History

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 802–810

ABSTRACT

Recurrent Neural Networks (RNNs), especially the Long Short-Term Memory (LSTM), have been widely used for video captioning, since they can cope with the temporal dependencies within both video frames and the corresponding descriptions. However, as the sequence gets longer, it becomes much harder to handle the temporal dependencies within the sequence. And in traditional LSTM, previously generated hidden states except the last one do not work directly to predict the current word. This may lead to the predicted word highly related to the last generated hidden state other than the overall context. To better capture long range dependencies and directly leverage early generated hidden states, in this work, we propose a novel model named Attention-based Densely Connected Long Short-Term Memory (DenseLSTM). In DenseLSTM, to ensure maximum information flow, all previous cells are connected to the current cell, which makes the updating of the current state directly related to all its previous states. Furthermore, an attention mechanism is designed to model the impacts of different hidden states. Because each cell is directly connected with all its successive cells, each cell has direct access to the gradients from later ones. In this way, the long-range dependencies are more effectively captured. We perform experiments on two publicly used video captioning datasets: the Microsoft Video Description Corpus (MSVD) and the MSR-VTT, and experimental results illustrate the effectiveness of DenseLSTM.

References

Santoro Adam, Faulkner Ryan, Raposo David, Rae Jack, Chrzanowski Mike, Weber Theophane, Wierstra Daan, Vinyals Oriol, Pascanu Razvan, and Lillicrap Timothy. 2018. Relational recurrent neural networks. In Advances in Neural Information Processing Systems (NIPS), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 7299--7310. http://papers.nips.cc/paper/7960-relational-recurrent-neural-networks.pdfGoogle Scholar
Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Ł ukasz, and Polosukhin Illia. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems (NIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 5998--6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdfGoogle Scholar
Kojima Atsuhiro, Tamura Takeshi, and Fukunaga Kunio. 2002. Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions. International Journal of Computer Vision (IJCV) (2002).Google Scholar
Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical Boundary-Aware Neural Encoder for Video Captioning. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google Scholar
Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, and Tao Mei. 2019. Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning. In Association for the Advance of Artificial Intelligence (AAAI) .Google Scholar
Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less Is More: Picking Informative Frames for Video Captioning. In The European Conference on Computer Vision (ECCV) .Google Scholar
Hori Chiori, Hori Takaaki, Lee Teng-Yok, Ziming Zhang, Harsham Bret, Hershey John R., Marks Tim K., and Sumi Kazuhiko. 2017. Attention-Based Multimodal Fusion for Video Description. In International Conference on Computer Vision (ICCV) .Google Scholar
Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google Scholar
Yinpeng Dong, Hang Su, Jun Zhu, and Bo Zhang. 2017. Improving Interpretability of Deep Neural Networks with Semantic Information. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google Scholar
Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR) .Google Scholar
Huang Gao, Sun Yu, Liu Zhuang, Sedra Daniel, and Weinberger Kilian. 2016. Deep Networks with Stochastic Depth. In European Conference on Computer Vision (ECCV) .Google Scholar
Huang Gao, Liu Zhuang, van der Maaten Laurens, and Weinberger Kilian Q. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarCross Ref
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation (1997).Google Scholar
Ba Jimmy, Hinton Geoffrey E, Mnih Volodymyr, Leibo Joel Z, and Ionescu Catalin. 2016. Using Fast Weights to Attend to the Recent Past. In Neural Information Processing Systems (NIPS) . 4331--4339.Google Scholar
Gehring Jonas, Auli Michael, Grangier David, Yarats Denis, and N.Dauphin Yann. 2017. Convolutional Sequence to Sequence Learning. In International Conference on Machine Learning (ICML) .Google Scholar
Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Training Very Deep Networks. In Neural Information Processing Systems (NIPS) .Google Scholar
Kim Kyung-Min, Choi Seong-Ho, Kim Jin-Hwa, and Zhang Byoung-Tak. 2018. Multimodal Dual Attention Memory for Video Story Question Answering. In The European Conference on Computer Vision (ECCV) .Google Scholar
Alon Lavie and Abhaya Agarwal. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Annual Meeting of the Association for Computational Linguistics (ACL) .Google Scholar
Xiangyang Li and Shuqiang Jiang. 2018. Bundled Object Context for Referring Expressions. In IEEE Transactions on Multimedia, (TMM) .Google Scholar
Xiangyang Li and Shuqiang Jiang. 2019. Know More Say Less: Image Captioning Based on Scene Graphs. In IEEE Transactions on Multimedia, (TMM) .Google Scholar
Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering. In Association for the Advance of Artificial Intelligence (AAAI) .Google Scholar
Chin-Yew Lin. 2004. Rouge: a package for automatic evaluation of summaries. In Annual Meeting of the Association for Computational Linguistics (ACL) .Google Scholar
Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016b. Hierarchical recurrent neural encoder for video representation with application to captioning. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarCross Ref
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016a. Jointly Modeling Embedding and Translation to Bridge Video and Language. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarCross Ref
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video Captioning with Transferred Semantic Attributes. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google Scholar
Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Annual Meeting of the Association for Computational Linguistics (ACL) .Google Scholar
Diederik P.Kingma and Jimmy Lei Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR) .Google Scholar
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) , Vol. 115, 3 (2015), 211--252. https://doi.org/10.1007/s11263-015-0816-yGoogle ScholarDigital Library
Guadarrama Sergio, Krishnamoorthy Niveda, Malkarnenkar Girish, Venugopalan Subhashini, Mooney Raymond, Darrell Trevor, and Saenko Kate. 2013. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In International Conference on Computer Vision (ICCV) .Google Scholar
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR) .Google Scholar
Jingkuan Song, Lianli Gao, Zhao Guo, Wu Liu, Dongxiang Zhang, and Hengtao Shen. 2017. Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, (IJCAI) .Google ScholarCross Ref
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarDigital Library
Ramakrishna Vedantam, C.Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarCross Ref
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to Sequence -- Video to Text. In International Conference on Computer Vision (ICCV) .Google ScholarDigital Library
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Rohrbach Marcus, Mooney Raymond, and Saenko Kate. 2015b. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In The North American Chapter of the Association for Computational Linguistics (NAACL) .Google Scholar
Xin Wang, Wenhu Chen, Jiawei Wu, YuanFang Wang, and William Yang Wang. 2018. Video Captioning via Hierarchical Reinforcement Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarCross Ref
Aming Wu and Yahong Han. 2018. Multi-modal Circulant Fusion for Video-to-Language and Backward. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, (IJCAI) .Google ScholarCross Ref
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarCross Ref
Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning Multimodal Attention LSTM Networks for Video Captioning. In Proceedings of the ACM International Conference on Multimedia (ACM MM). 9.Google ScholarDigital Library
Ziwei Yang, Yahong Han, and Zheng Wang. 2017. Catching the Temporal Regions-of-Interest for Video Captioning. In Proceedings of the ACM International Conference on Multimedia (ACM MM) .Google ScholarDigital Library
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In International Conference on Computer Vision (ICCV) .Google ScholarDigital Library
Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video Paragraph Captioning using Hierarchical Recurrent Neural Networks. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarCross Ref
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent Neural Network Regularization. CoRR , Vol. abs/1409.2329 (2014). arxiv: 1409.2329 http://arxiv.org/abs/1409.2329Google Scholar
Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. Video Captioning with Tube Features. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, (IJCAI) . International Joint Conferences on Artificial Intelligence Organization, 1177--1183. https://doi.org/10.24963/ijcai.2018/164Google ScholarCross Ref

Index Terms

Attention-based Densely Connected LSTM for Video Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing
      1. Natural language generation

Recommendations

Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
Read More
Residual attention-based LSTM for video captioning

Recently great success has been achieved by proposing a framework with hierarchical LSTMs in video captioning, such as stacked LSTM networks. When deeper LSTM layers are able to start converging, a degradation problem has been exposed. With the number ...
Read More
Richer Semantic Visual and Language Representation for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Translating and summarizing a video into natural language is an interesting and challenging visual task. In this work, a novel framework is built to generate sentences for videos with more coherence and semantics. A long short term memory (LSTM) network ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
attention mechanism
deep learning
lstm
video captioning
vision and language
Qualifiers
- research-article
Conference

Acceptance Rates
MM '19 Paper Acceptance Rate252of936submissions,27%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 27
  Total Citations
  View Citations
- 434
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Attention-based Densely Connected LSTM for Video Captioning

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning Multimodal Attention LSTM Networks for Video Captioning

Residual attention-based LSTM for video captioning

Richer Semantic Visual and Language Representation for Video Captioning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media