skip to main content
10.1145/3343031.3350932acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Attention-based Densely Connected LSTM for Video Captioning

Authors Info & Claims
Published:15 October 2019Publication History

ABSTRACT

Recurrent Neural Networks (RNNs), especially the Long Short-Term Memory (LSTM), have been widely used for video captioning, since they can cope with the temporal dependencies within both video frames and the corresponding descriptions. However, as the sequence gets longer, it becomes much harder to handle the temporal dependencies within the sequence. And in traditional LSTM, previously generated hidden states except the last one do not work directly to predict the current word. This may lead to the predicted word highly related to the last generated hidden state other than the overall context. To better capture long range dependencies and directly leverage early generated hidden states, in this work, we propose a novel model named Attention-based Densely Connected Long Short-Term Memory (DenseLSTM). In DenseLSTM, to ensure maximum information flow, all previous cells are connected to the current cell, which makes the updating of the current state directly related to all its previous states. Furthermore, an attention mechanism is designed to model the impacts of different hidden states. Because each cell is directly connected with all its successive cells, each cell has direct access to the gradients from later ones. In this way, the long-range dependencies are more effectively captured. We perform experiments on two publicly used video captioning datasets: the Microsoft Video Description Corpus (MSVD) and the MSR-VTT, and experimental results illustrate the effectiveness of DenseLSTM.

References

  1. Santoro Adam, Faulkner Ryan, Raposo David, Rae Jack, Chrzanowski Mike, Weber Theophane, Wierstra Daan, Vinyals Oriol, Pascanu Razvan, and Lillicrap Timothy. 2018. Relational recurrent neural networks. In Advances in Neural Information Processing Systems (NIPS), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 7299--7310. http://papers.nips.cc/paper/7960-relational-recurrent-neural-networks.pdfGoogle ScholarGoogle Scholar
  2. Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Ł ukasz, and Polosukhin Illia. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems (NIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 5998--6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdfGoogle ScholarGoogle Scholar
  3. Kojima Atsuhiro, Tamura Takeshi, and Fukunaga Kunio. 2002. Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions. International Journal of Computer Vision (IJCV) (2002).Google ScholarGoogle Scholar
  4. Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical Boundary-Aware Neural Encoder for Video Captioning. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarGoogle Scholar
  5. Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, and Tao Mei. 2019. Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning. In Association for the Advance of Artificial Intelligence (AAAI) .Google ScholarGoogle Scholar
  6. Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less Is More: Picking Informative Frames for Video Captioning. In The European Conference on Computer Vision (ECCV) .Google ScholarGoogle Scholar
  7. Hori Chiori, Hori Takaaki, Lee Teng-Yok, Ziming Zhang, Harsham Bret, Hershey John R., Marks Tim K., and Sumi Kazuhiko. 2017. Attention-Based Multimodal Fusion for Video Description. In International Conference on Computer Vision (ICCV) .Google ScholarGoogle Scholar
  8. Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarGoogle Scholar
  9. Yinpeng Dong, Hang Su, Jun Zhu, and Bo Zhang. 2017. Improving Interpretability of Deep Neural Networks with Semantic Information. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarGoogle Scholar
  10. Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR) .Google ScholarGoogle Scholar
  11. Huang Gao, Sun Yu, Liu Zhuang, Sedra Daniel, and Weinberger Kilian. 2016. Deep Networks with Stochastic Depth. In European Conference on Computer Vision (ECCV) .Google ScholarGoogle Scholar
  12. Huang Gao, Liu Zhuang, van der Maaten Laurens, and Weinberger Kilian Q. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarGoogle Scholar
  13. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarGoogle ScholarCross RefCross Ref
  14. Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation (1997).Google ScholarGoogle Scholar
  15. Ba Jimmy, Hinton Geoffrey E, Mnih Volodymyr, Leibo Joel Z, and Ionescu Catalin. 2016. Using Fast Weights to Attend to the Recent Past. In Neural Information Processing Systems (NIPS) . 4331--4339.Google ScholarGoogle Scholar
  16. Gehring Jonas, Auli Michael, Grangier David, Yarats Denis, and N.Dauphin Yann. 2017. Convolutional Sequence to Sequence Learning. In International Conference on Machine Learning (ICML) .Google ScholarGoogle Scholar
  17. Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Training Very Deep Networks. In Neural Information Processing Systems (NIPS) .Google ScholarGoogle Scholar
  18. Kim Kyung-Min, Choi Seong-Ho, Kim Jin-Hwa, and Zhang Byoung-Tak. 2018. Multimodal Dual Attention Memory for Video Story Question Answering. In The European Conference on Computer Vision (ECCV) .Google ScholarGoogle Scholar
  19. Alon Lavie and Abhaya Agarwal. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Annual Meeting of the Association for Computational Linguistics (ACL) .Google ScholarGoogle Scholar
  20. Xiangyang Li and Shuqiang Jiang. 2018. Bundled Object Context for Referring Expressions. In IEEE Transactions on Multimedia, (TMM) .Google ScholarGoogle Scholar
  21. Xiangyang Li and Shuqiang Jiang. 2019. Know More Say Less: Image Captioning Based on Scene Graphs. In IEEE Transactions on Multimedia, (TMM) .Google ScholarGoogle Scholar
  22. Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering. In Association for the Advance of Artificial Intelligence (AAAI) .Google ScholarGoogle Scholar
  23. Chin-Yew Lin. 2004. Rouge: a package for automatic evaluation of summaries. In Annual Meeting of the Association for Computational Linguistics (ACL) .Google ScholarGoogle Scholar
  24. Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016b. Hierarchical recurrent neural encoder for video representation with application to captioning. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarGoogle ScholarCross RefCross Ref
  25. Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016a. Jointly Modeling Embedding and Translation to Bridge Video and Language. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarGoogle ScholarCross RefCross Ref
  26. Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video Captioning with Transferred Semantic Attributes. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarGoogle Scholar
  27. Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Annual Meeting of the Association for Computational Linguistics (ACL) .Google ScholarGoogle Scholar
  28. Diederik P.Kingma and Jimmy Lei Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR) .Google ScholarGoogle Scholar
  29. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) , Vol. 115, 3 (2015), 211--252. https://doi.org/10.1007/s11263-015-0816-yGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  30. Guadarrama Sergio, Krishnamoorthy Niveda, Malkarnenkar Girish, Venugopalan Subhashini, Mooney Raymond, Darrell Trevor, and Saenko Kate. 2013. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In International Conference on Computer Vision (ICCV) .Google ScholarGoogle Scholar
  31. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR) .Google ScholarGoogle Scholar
  32. Jingkuan Song, Lianli Gao, Zhao Guo, Wu Liu, Dongxiang Zhang, and Hengtao Shen. 2017. Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, (IJCAI) .Google ScholarGoogle ScholarCross RefCross Ref
  33. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ramakrishna Vedantam, C.Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarGoogle ScholarCross RefCross Ref
  35. Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to Sequence -- Video to Text. In International Conference on Computer Vision (ICCV) .Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Rohrbach Marcus, Mooney Raymond, and Saenko Kate. 2015b. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In The North American Chapter of the Association for Computational Linguistics (NAACL) .Google ScholarGoogle Scholar
  37. Xin Wang, Wenhu Chen, Jiawei Wu, YuanFang Wang, and William Yang Wang. 2018. Video Captioning via Hierarchical Reinforcement Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarGoogle ScholarCross RefCross Ref
  38. Aming Wu and Yahong Han. 2018. Multi-modal Circulant Fusion for Video-to-Language and Backward. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, (IJCAI) .Google ScholarGoogle ScholarCross RefCross Ref
  39. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarGoogle ScholarCross RefCross Ref
  40. Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning Multimodal Attention LSTM Networks for Video Captioning. In Proceedings of the ACM International Conference on Multimedia (ACM MM). 9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ziwei Yang, Yahong Han, and Zheng Wang. 2017. Catching the Temporal Regions-of-Interest for Video Captioning. In Proceedings of the ACM International Conference on Multimedia (ACM MM) .Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In International Conference on Computer Vision (ICCV) .Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video Paragraph Captioning using Hierarchical Recurrent Neural Networks. In Internaltional Conference on Computer Vision and Pattern Recogintion (CVPR) .Google ScholarGoogle ScholarCross RefCross Ref
  44. Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent Neural Network Regularization. CoRR , Vol. abs/1409.2329 (2014). arxiv: 1409.2329 http://arxiv.org/abs/1409.2329Google ScholarGoogle Scholar
  45. Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. Video Captioning with Tube Features. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, (IJCAI) . International Joint Conferences on Artificial Intelligence Organization, 1177--1183. https://doi.org/10.24963/ijcai.2018/164Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Attention-based Densely Connected LSTM for Video Captioning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '19: Proceedings of the 27th ACM International Conference on Multimedia
        October 2019
        2794 pages
        ISBN:9781450368896
        DOI:10.1145/3343031

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 October 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        MM '19 Paper Acceptance Rate252of936submissions,27%Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader