Abstract
Recent research in convolutional and recurrent neural networks has fueled incredible advances in video understanding. We propose a video captioning framework that achieves the performance and quality necessary to be deployed in distributed surveillance systems. Our method combines an efficient hierarchical architecture with novel attention mechanisms at both the local and global levels. By shifting focus to different spatiotemporal locations, attention mechanisms correlate sequential outputs with activation maps, offering a clever way to adaptively combine multiple frames and locations of video. As soft attention mixing weights are solved via back-propagation, the number of weights or input frames needs to be known in advance. To remove this restriction, our video understanding framework combines continuous attention mechanisms over a family of Gaussian distributions. Our efficient multistream hierarchical model combines a recurrent architecture with a soft hierarchy layer using both equally spaced and dynamically localized boundary cuts. As opposed to costly volumetric attention approaches, we use video attributes to steer temporal attention. Our fully learnable end-to-end approach helps predict salient temporal regions of action/objects in the video. We demonstrate state-of-the-art captioning results on the popular MSVD, MSR-VTT and M-VAD video datasets and compare several variants of the algorithm suitable for real-time applications. By adjusting the frame rate, we show a single computer can generate effective video captions for 100 simultaneous cameras. We additionally perform studies to show how bit rate compression modifies captioning results.







Similar content being viewed by others
References
Abadi M et al (2015) TensorFlow: large-scale machine learning on heterogeneous systems. http://tensorflow.org/
Anne Hendricks L, Venugopalan S, Rohrbach M, Mooney R, Saenko K, Darrell T (2016) Deep compositional captioning: describing novel object categories without paired training data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–10
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization vol 29, pp 65–72
Baraldi L, Grana C, Cucchiara R (2017) Hierarchical boundary-aware neural encoder for video captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR) pp 3185–3194
Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D, Schmidt L, Shangguan J, Siskind JM, Waggoner J, Wang S, Wei J, Yin Y, Zhang Z (2012) Video in sentences out. In: Proceedings of the twenty-eighth conference on uncertainty in artificial intelligence, UAI’12. AUAI Press, Arlington, pp 102–112. http://dl.acm.org/citation.cfm?id=3020652.3020667
Bellard F, Niedermayer M et al (2012) Ffmpeg. http://ffmpeg.org
Caba Heilbron F et al (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp 961–970
Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th association for computational linguistics: human language technologies vol 1, pp 190–200. ACL
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 workshop on deep learning, December 2014
Donahue J et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp 2625–2634
Dong J, Li X, Snoek C (2016) Word2visualvec: cross-media retrieval by visual feature prediction. CoRR abs/1604.06838
Dong J et al (2016) Early embedding and late reranking for video captioning. In: Proceedings of the 2016 ACM on multimedia conference
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 2
Graves A (2013) Generating sequences with recurrent neural networks. CoRR abs/1308.0850
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Johnson J et al (2016) Densecap: fully convolutional localization networks for dense captioning. In: CVPR
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp 3128–3137
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR, pp 1725–1732
Kingma DP, Ba JL (2014) Adam: a method for stochastic optimization. In: Proceedings of 3rd international conference learn. Representations
Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. IJCV 50(2):171–184
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105
Li Y, Yao T, Pan Y, Chao H, Mei T (2018) Jointly localizing and describing events for dense video captioning. In: CVPR
Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out: proceedings of the ACL-04 workshop, vol 8
Lin G, Shen C, Van Den Hengel A, Reid I (2017) Exploring context with deep structured models for semantic segmentation. IEEE Trans Pattern Anal Mach Intell
Lin TY, Goyal P, Girshick R, He K, Dollár P (2018) Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell
Liu C, Mao J, Sha F, Yuille AL (2017) Attention correctness in neural image captioning. In: AAAI, pp 4176–4182
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Luc P, Neverova N, Couprie C, Verbeek J, LeCun Y (2017) Predicting deeper into the future of semantic segmentation. In: ICCV 2017-international conference on computer vision, p 10
Manning CD et al ( 2014) The stanford corenlp natural language processing toolkit. In: ACL, pp 55–60
Mettes P, Koelma DC, Snoek CG (2016) The imagenet shuffle: reorganized pre-training for video event detection. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, ICMR ’16, pp 175–182. ACM, New York. https://doi.org/10.1145/2911996.2912036
Microsoft COCO Caption Evaluation (2016) https://github.com/tylin/coco-caption. Accessed 3 Oct 2016
Pan P et al (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp 1029–1038
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318. ACL
Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: ICCV–international conference on computer vision, pp 1251–1259. IEEE, Venice. https://doi.org/10.1109/ICCV.2017.140. https://hal.inria.fr/hal-01428963
Peng H, Li B, Ling H, Hu W, Xiong W, Maybank SJ (2017) Salient object detection via structured matrix decomposition. IEEE Trans Pattern Anal Mach Intell 39(4):818–832
Pennington J et al (2014) Glove: global vectors for word representation. EMNLP 14:1532–43
Piergiovanni AJ, Fan C, Ryoo MS (2016) Temporal attention filters for human activity recognition in videos. CoRR abs/1605.08140
Pu Y, Min MR, Gan Z, Carin L (2018) Adaptive feature abstraction for translating video to text. In: Proceedings of the thirty-second AAAI conference on artificial intelligence, New Orleans, Louisiana, USA, February 2–7, 2018
Ramanishka V et al (2016) Multimodal video description. In: Proceedings of the ACM on multimedia conference, pp 1092–1096
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Satat G, Tancik M, Gupta O, Heshmat B, Raskar R (2017) Object classification through scattering media with deep learning on time resolved measurement. Opt Exp 25(15):17466–17479
Shetty R, Laaksonen J (2016) Frame-and segment-level features and candidate pool evaluation for video caption generation. In: ACM on multimedia conference, pp 1073–1076
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576
Srivastava N et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Sutskever et al (2014) Sequence to sequence learning with neural networks. In: NIPS, pp 3104–3112
Thomason J et al (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In: Coling, vol 2, p 9
Torabi A, Pal CJ, Larochelle H, Courville AC (2015) Using descriptive video services to create a large data source for video annotation research. CoRR abs/1503.01070
Tran D et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCV
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: CVPR, pp 4566–4575
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2015) Translating videos to natural language using deep recurrent neural networks. In: NAACL HLT
Venugopalan S et al (2015) Sequence to sequence-video to text. In: ICCV, pp 4534–4542
Wang X, Chen W, Wu J, Wang YF, Wang WY (2018) Video captioning via hierarchical reinforcement learning. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Xu J et al (2016) Msr-vtt: a large video description dataset for bridging video and language. In: CVPR
Xu J, Song L, Xie R (2016) Shot boundary detection using convolutional neural networks. In: Visual communications and image processing (VCIP), IEEE, pp 1–4
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Xu R et al (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI, pp 2346–2352
Yao L et al (2015) Describing videos by exploiting temporal structure. In: ICCV, pp 4507–4515
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4584–4593
Yu Y, Ko H, Choi J, Kim G (2017) End-to-end concept word detection for video captioning, retrieval, and question answering. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR) pp 3261–3269
Zhao B, Feng J, Wu X, Yan S (2017) A survey on deep learning-based fine-grained object classification and semantic segmentation. Int J Autom Comput 14(2):119–135
Zilly JG, Srivastava RK, Koutník J, Schmidhuber J (2017) Recurrent highway networks. In: Proceedings of the 34th international conference on machine learning, vol 70, pp 4189–4198. PMLR, International Convention Centre, Sydney, Australia
Zitnick CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In: ECCV
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sah, S., Nguyen, T. & Ptucha, R. Understanding temporal structure for video captioning. Pattern Anal Applic 23, 147–159 (2020). https://doi.org/10.1007/s10044-018-00770-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-018-00770-3