Double-channel language feature mining based model for video description

Tang, Pengjie; Xia, Jiewu; Tan, Yunlan; Tan, Bin

doi:10.1007/s11042-020-09674-z

Double-channel language feature mining based model for video description

Published: 31 August 2020

Volume 79, pages 33193–33213, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Pengjie Tang^1,2,
Jiewu Xia^1,2,
Yunlan Tan^1,2 &
…
Bin Tan^1,2

185 Accesses
2 Citations
Explore all metrics

Abstract

Video description is to translate video to natural language. Many recent effective models for the task are developed with the popular deep convolutional neural networks and recurrent neural networks. However, the abstractness and representation ability of visual motion feature and language feature are usually ignored in most of popular methods. In this work, a framework based on double-channel language feature mining is proposed, where deep transformation layer (DTL) is employed in both of the stages of motion feature extraction and language modeling, to increase the number of feature transformation and enhance the power of representation and generalization of the features. In addition, the early deep sequential fusion strategy is introduced into the model with element-wise product for feature fusing. Moreover, for more comprehensive information, the late deep sequential fusion strategy is also employed, and the output probabilities from the modules with DTL and without DTL are fused with weight average for further improving accuracy and semantics of generated sentence. Multiple experiments and ablation study are conducted on two public datasets including Youtube2Text and MSR-VTT2016, and competitive results compared to the other popular methods are achieved. Especially on CIDEr metric, the performance reaches to 82.5 and 45.9 on the two datasets respectively, demonstrating the effectiveness of the proposed model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Automatic speech recognition: a survey

Article 10 November 2020

References

Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: IEEE conference on computer vision and pattern recognition, pp 12487–12496
Badrinarayanan V, Kendall A, Cipolla R (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495
Article Google Scholar
Ballas N, Yao L, Pal C, Courville A (2015) Delving deeper into convolutional networks for learning video representations. In: International conference on learning representations
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Annual meeting of the association for computational linguistics workshop, pp 65–72
Baraldi L, Costantino G, Rita C (2017) Hierarchical boundary-aware neural encoder for video captioning. In: IEEE conference on computer vision and pattern recognition, pp 3185–3194
Bin Y, Yang Y, Shen F, Xie N, Shen H, Li X (2019) Describing video with attention based bidirectional lstm. IEEE Trans Cybern 49 (7):2631–2641
Article Google Scholar
Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: The 49th annual meeting of the association for computational linguistics, pp 190–200
Chen H, Ding G, Lin Z, Zhao S, Han J (2018) Show, observe and tell: attribute-driven attention model for image captioning. In: International joint conference on artificial intelligence, pp 606–612
Chen H, Ding G, Zhao S, Han J (2018) Temporal-difference learning with sampling baseline for image captioning. In: The AAAI conference on artificial intelligence, pp 6706–6713
Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: picking informative frames for video captioning. In: The European conference on computer vision, pp 367–384
Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T (2019) Temporal deformable convolutional encoder-decoder networks for video captioning. In: Proceedings of the association for the advance of artificial intelligence, pp 8167–8174
Cho K, Courville A, Bengio Y (2015) Describing multimedia content using attention-based encoder-decoder networks. IEEE Trans Multimed 17(11):1875–1886
Article Google Scholar
Ding G, Chen M, Zhao S, Chen H, Han J, Liu Q (2019) Neural image caption generation with weighted training and reference. Cogn Comput 11(6):763–777
Article Google Scholar
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE conference on computer vision and pattern recognition, pp 2625–2634
Dong J, Li X, Lan W, Huo Y, Snoek CG (2016) Early embedding and late reranking for video captioning. In: ACM conference on multimedia conference, pp 1082–1086
Gao L, Guo Z, Zhang H, Xu X, Shen H (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimed 19 (9):2045–2055
Article Google Scholar
Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Kate S (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: IEEE international conference on computer vision, pp 2712–2719
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, pp 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 09(08):1735–1780
Article Google Scholar
Iashin V, Rahtu E (2020) A better use of audio-visual cues: Dense video captioning with bi-model transformer. arXiv:2005.08271v1
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, et al. (2014) Caffe: convolutional architecture for fast feature embedding. In: ACM conference on multimedia, pp 675–678
Jin Q, Chen J, Chen S, Xiong Y, Hauptmann A (2016) Describing videos using multimodal fusion. In: ACM conference on multimedia conference, pp 1087–1091
Krishnamoorthy N, Malkarnenkar G, Mooney RJ, Saenko K, Sergio G (2013) Generating natural-language video descriptions using text-mined knowledge, The AAAI conference on artificial intelligence, pp 541–547
Li W, Guo D, Fang X (2018) Multimodal architecture for video captioning with memory networks and an attention mechanism. Pattern Recognit Lett 105:23–29
Article Google Scholar
Lin CY, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Annual meeting of the association for computational linguistics, pp 21–26
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollr P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755
Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: IEEE conference on computer vision and pattern recognition, pp 1029–1038
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: IEEE conference on computer vision and pattern recognition, pp 4594–4602
Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: IEEE international conference on computer vision, pp 984–992
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Annual meeting of the association for computational linguistics, pp 311–318
Pu Y, Min MR, Gan Z, Carin L (2016) Adaptive feature abstraction for translating video to language. arXiv:1611.07837
Quan Q, He F, Li H (2020) A multi-phase blending method with incremental intensity for training detection networks. Multimed Tools Appl, in press. https://doi.org/10.1007/s00371-020-01796-7
Ramanishka V, Abir D, Huk PD, Subhashini V, Anne HL, Marcus R, Kate S (2016) Multimodal video description. In: ACM conference on multimedia conference, pp 1092–1096
Ren S, He K, Girshick R, Sun J (2016) Faster R-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, et al. (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Shen Z, Li J, Su Z, Li M, Chen Y, Jiang Y-G, Xue X (2017) Weakly supervised dense video captioning. In: IEEE conference on computer vision and pattern recognition, pp 5159–5167
Shetty R, Laaksonen J (2016) Frame- and segment-level features and candidate pool evaluation for video caption generation. In: ACM conference on multimedia conference, pp 1073–1076
Song J, Guo Y, Gao L, Li X, Alan H, Shen H (2019) From deterministic to generative: multimodal stochastic rnns for video captioning. IEEE Trans Neural Netw Learn Syst 30(10):3047–3058
Article Google Scholar
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition, pp 1–9
Tang P, Wang H, Kwong S (2017) G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition. Neurocomputing 225:188–197
Article Google Scholar
Tang P, Wang H, Kwong S (2018) Deep sequential fusion lstm network for image description. Neurocomputing 312:154–164
Article Google Scholar
Tang P, Wang H, Li Q (2019) Rich visual and language representation with complementary semantics for video captioning. ACM Trans Multimed Comput Commun Appl 15(2:31):1–23
Google Scholar
Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of international conference on computational linguistics, pp 1218–1227
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: The conference on neural information processing systems, pp 5998–6008
Vedantam R, Zitnick CL, Parikh D (2015) CIDErr: consensus-based image description evaluation. In: IEEE conference on computer vision and pattern recognition, pp 4566–4575
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence–video to text. In: IEEE international conference on computer vision, pp 4534–4542
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2015) Translating videos to natural language using deep recurrent neural networks. In: The 2015 annual conference of the North American chapter of the ACL, pp 1494–1504
Venugopalan S, Hendricks LA, Mooney R, Saenko K (2016) Improving lstm-based video description with linguistic knowledge mined from text. In: Conference on empirical methods in natural language processing, pp 1961–1966
Wang B, Ma L, Zhang W, Liu W (2018) Reconstruction network for video captioning. In: IEEE conference on computer vision and pattern recognition, pp 7622–7631
Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: multimodal memory modelling for video captioning. In: IEEE conference on computer vision and pattern recognition, pp 7512–7520
Wei S, Zhao Y, Zhu Z, Nan L (2010) Multimodal fusion for video search reranking. IEEE Trans Knowl Data Eng 22(8):1191–1199
Article Google Scholar
Wu Q, Shen C, Liu L, Dick A, Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: IEEE international conference on computer vision, pp 203–212
Wu Y, He F, Zhang D, Li X (2018) Service-oriented feature-based data exchange for cloud-based design and manufacturing. IEEE Trans Serv Comput 11(2):341–353
Article Google Scholar
Xu H, Venugopalan S, Ramanishka V, Rohrbach M, Saenko K (2015) A multi-scale multiple instance video description network. arXiv:1505.05914
Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. In: IEEE conference on computer vision and pattern recognition, pp 5288–5296
Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen H (2018) Video captioning by adversarial lstm. IEEE Trans Image Process 27(11):5600–5611
Article MathSciNet Google Scholar
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: IEEE international conference on computer vision, pp 4507–4515
Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: IEEE conference on computer vision and pattern recognition, pp 4584–4593
Yu H, He F, Pan Y (2020) A scalable region-based level set method using adaptive bilateral filter for noisy image segmentation. Multimed Tools Appl 79:5743–5765
Article Google Scholar
Zhang J, He F, Chen Y (2020) A new haze removal approach for sky/river alike scenes based on external and internal clues. Multimed Tools Appl 79:2085–2107
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Electronics & Information Engineering, Jinggangshan University, Ji’an, 343009, People’s Republic of China
Pengjie Tang, Jiewu Xia, Yunlan Tan & Bin Tan
Jiangxi Engineering Laboratory of IoT Technology for Crop Growth, Ji’an, 343009, People’s Republic of China
Pengjie Tang, Jiewu Xia, Yunlan Tan & Bin Tan

Authors

Pengjie Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jiewu Xia
View author publications
You can also search for this author in PubMed Google Scholar
Yunlan Tan
View author publications
You can also search for this author in PubMed Google Scholar
Bin Tan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiewu Xia.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Research Foundation of Art Planning of Jiangxi Province (No. YG2017283); Bidding Project for the Foundation of Colleges Key Research on Humanities and Social Science of Jiangxi Province (No. JD17082); The Doctoral Scientific Research Foundation of Jinggangshang University (No. JZB1923, JZB1807); National Natural Science Foundation of P. R. China (No. 61762052).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, P., Xia, J., Tan, Y. et al. Double-channel language feature mining based model for video description. Multimed Tools Appl 79, 33193–33213 (2020). https://doi.org/10.1007/s11042-020-09674-z

Download citation

Received: 02 January 2020
Revised: 15 July 2020
Accepted: 20 August 2020
Published: 31 August 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11042-020-09674-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Double-channel language feature mining based model for video description

Abstract

Access this article

Similar content being viewed by others

Video summarization using deep learning techniques: a detailed analysis and investigation

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Automatic speech recognition: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Double-channel language feature mining based model for video description

Abstract

Access this article

Similar content being viewed by others

Video summarization using deep learning techniques: a detailed analysis and investigation

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Automatic speech recognition: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation