ABSTRACT
Recurrent Neural Networks (RNNs), especially the Long Short-Term Memory (LSTM), have been widely used for video captioning, since they can cope with the temporal dependencies within both video frames and the corresponding descriptions. However, as the sequence gets longer, it becomes much harder to handle the temporal dependencies within the sequence. And in traditional LSTM, previously generated hidden states except the last one do not work directly to predict the current word. This may lead to the predicted word highly related to the last generated hidden state other than the overall context. To better capture long range dependencies and directly leverage early generated hidden states, in this work, we propose a novel model named Attention-based Densely Connected Long Short-Term Memory (DenseLSTM). In DenseLSTM, to ensure maximum information flow, all previous cells are connected to the current cell, which makes the updating of the current state directly related to all its previous states. Furthermore, an attention mechanism is designed to model the impacts of different hidden states. Because each cell is directly connected with all its successive cells, each cell has direct access to the gradients from later ones. In this way, the long-range dependencies are more effectively captured. We perform experiments on two publicly used video captioning datasets: the Microsoft Video Description Corpus (MSVD) and the MSR-VTT, and experimental results illustrate the effectiveness of DenseLSTM.
- Santoro Adam, Faulkner Ryan, Raposo David, Rae Jack, Chrzanowski Mike, Weber Theophane, Wierstra Daan, Vinyals Oriol, Pascanu Razvan, and Lillicrap Timothy. 2018. Relational recurrent neural networks. In Advances in Neural Information Processing Systems (NIPS), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 7299--7310. http://papers.nips.cc/paper/7960-relational-recurrent-neural-networks.pdfGoogle Scholar
- Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Ł ukasz, and Polosukhin Illia. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems (NIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 5998--6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdfGoogle Scholar
- Kojima Atsuhiro, Tamura Takeshi, and Fukunaga Kunio. 2002. Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions. International Journal of Computer Vision (IJCV) (2002).Google Scholar
- Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical Boundary-Aware Neural Encoder for Video Captioning. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google Scholar
- Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, and Tao Mei. 2019. Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning. In Association for the Advance of Artificial Intelligence (AAAI) .Google Scholar
- Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less Is More: Picking Informative Frames for Video Captioning. In The European Conference on Computer Vision (ECCV) .Google Scholar
- Hori Chiori, Hori Takaaki, Lee Teng-Yok, Ziming Zhang, Harsham Bret, Hershey John R., Marks Tim K., and Sumi Kazuhiko. 2017. Attention-Based Multimodal Fusion for Video Description. In International Conference on Computer Vision (ICCV) .Google Scholar
- Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google Scholar
- Yinpeng Dong, Hang Su, Jun Zhu, and Bo Zhang. 2017. Improving Interpretability of Deep Neural Networks with Semantic Information. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google Scholar
- Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR) .Google Scholar
- Huang Gao, Sun Yu, Liu Zhuang, Sedra Daniel, and Weinberger Kilian. 2016. Deep Networks with Stochastic Depth. In European Conference on Computer Vision (ECCV) .Google Scholar
- Huang Gao, Liu Zhuang, van der Maaten Laurens, and Weinberger Kilian Q. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarCross Ref
- Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation (1997).Google Scholar
- Ba Jimmy, Hinton Geoffrey E, Mnih Volodymyr, Leibo Joel Z, and Ionescu Catalin. 2016. Using Fast Weights to Attend to the Recent Past. In Neural Information Processing Systems (NIPS) . 4331--4339.Google Scholar
- Gehring Jonas, Auli Michael, Grangier David, Yarats Denis, and N.Dauphin Yann. 2017. Convolutional Sequence to Sequence Learning. In International Conference on Machine Learning (ICML) .Google Scholar
- Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Training Very Deep Networks. In Neural Information Processing Systems (NIPS) .Google Scholar
- Kim Kyung-Min, Choi Seong-Ho, Kim Jin-Hwa, and Zhang Byoung-Tak. 2018. Multimodal Dual Attention Memory for Video Story Question Answering. In The European Conference on Computer Vision (ECCV) .Google Scholar
- Alon Lavie and Abhaya Agarwal. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Annual Meeting of the Association for Computational Linguistics (ACL) .Google Scholar
- Xiangyang Li and Shuqiang Jiang. 2018. Bundled Object Context for Referring Expressions. In IEEE Transactions on Multimedia, (TMM) .Google Scholar
- Xiangyang Li and Shuqiang Jiang. 2019. Know More Say Less: Image Captioning Based on Scene Graphs. In IEEE Transactions on Multimedia, (TMM) .Google Scholar
- Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering. In Association for the Advance of Artificial Intelligence (AAAI) .Google Scholar
- Chin-Yew Lin. 2004. Rouge: a package for automatic evaluation of summaries. In Annual Meeting of the Association for Computational Linguistics (ACL) .Google Scholar
- Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016b. Hierarchical recurrent neural encoder for video representation with application to captioning. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarCross Ref
- Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016a. Jointly Modeling Embedding and Translation to Bridge Video and Language. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarCross Ref
- Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video Captioning with Transferred Semantic Attributes. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google Scholar
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Annual Meeting of the Association for Computational Linguistics (ACL) .Google Scholar
- Diederik P.Kingma and Jimmy Lei Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR) .Google Scholar
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) , Vol. 115, 3 (2015), 211--252. https://doi.org/10.1007/s11263-015-0816-yGoogle ScholarDigital Library
- Guadarrama Sergio, Krishnamoorthy Niveda, Malkarnenkar Girish, Venugopalan Subhashini, Mooney Raymond, Darrell Trevor, and Saenko Kate. 2013. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In International Conference on Computer Vision (ICCV) .Google Scholar
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR) .Google Scholar
- Jingkuan Song, Lianli Gao, Zhao Guo, Wu Liu, Dongxiang Zhang, and Hengtao Shen. 2017. Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, (IJCAI) .Google ScholarCross Ref
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarDigital Library
- Ramakrishna Vedantam, C.Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarCross Ref
- Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to Sequence -- Video to Text. In International Conference on Computer Vision (ICCV) .Google ScholarDigital Library
- Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Rohrbach Marcus, Mooney Raymond, and Saenko Kate. 2015b. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In The North American Chapter of the Association for Computational Linguistics (NAACL) .Google Scholar
- Xin Wang, Wenhu Chen, Jiawei Wu, YuanFang Wang, and William Yang Wang. 2018. Video Captioning via Hierarchical Reinforcement Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarCross Ref
- Aming Wu and Yahong Han. 2018. Multi-modal Circulant Fusion for Video-to-Language and Backward. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, (IJCAI) .Google ScholarCross Ref
- Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarCross Ref
- Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning Multimodal Attention LSTM Networks for Video Captioning. In Proceedings of the ACM International Conference on Multimedia (ACM MM). 9.Google ScholarDigital Library
- Ziwei Yang, Yahong Han, and Zheng Wang. 2017. Catching the Temporal Regions-of-Interest for Video Captioning. In Proceedings of the ACM International Conference on Multimedia (ACM MM) .Google ScholarDigital Library
- Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In International Conference on Computer Vision (ICCV) .Google ScholarDigital Library
- Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video Paragraph Captioning using Hierarchical Recurrent Neural Networks. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarCross Ref
- Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent Neural Network Regularization. CoRR , Vol. abs/1409.2329 (2014). arxiv: 1409.2329 http://arxiv.org/abs/1409.2329Google Scholar
- Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. Video Captioning with Tube Features. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, (IJCAI) . International Joint Conferences on Artificial Intelligence Organization, 1177--1183. https://doi.org/10.24963/ijcai.2018/164Google ScholarCross Ref
Index Terms
- Attention-based Densely Connected LSTM for Video Captioning
Recommendations
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on MultimediaAutomatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
Residual attention-based LSTM for video captioning
Recently great success has been achieved by proposing a framework with hierarchical LSTMs in video captioning, such as stacked LSTM networks. When deeper LSTM layers are able to start converging, a degradation problem has been exposed. With the number ...
Richer Semantic Visual and Language Representation for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on MultimediaTranslating and summarizing a video into natural language is an interesting and challenging visual task. In this work, a novel framework is built to generate sentences for videos with more coherence and semantics. A long short term memory (LSTM) network ...
Comments