skip to main content
research-article

A Multi-instance Multi-label Dual Learning Approach for Video Captioning

Published: 14 June 2021 Publication History

Abstract

Video captioning is a challenging task in the field of multimedia processing, which aims to generate informative natural language descriptions/captions to describe video contents. Previous video captioning approaches mainly focused on capturing visual information in videos using an encoder-decoder structure to generate video captions. Recently, a new encoder-decoder-reconstructor structure was proposed for video captioning, which captured the information in both videos and captions. Based on this, this article proposes a novel multi-instance multi-label dual learning approach (MIMLDL) to generate video captions based on the encoder-decoder-reconstructor structure. Specifically, MIMLDL contains two modules: caption generation and video reconstruction modules. The caption generation module utilizes a lexical fully convolutional neural network (Lexical FCN) with a weakly supervised multi-instance multi-label learning mechanism to learn a translatable mapping between video regions and lexical labels to generate video captions. Then the video reconstruction module synthesizes visual sequences to reproduce raw videos using the outputs of the caption generation module. A dual learning mechanism fine-tunes the two modules according to the gap between the raw and the reproduced videos. Thus, our approach can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw visual sequences. Experimental results on a benchmark dataset demonstrate that MIMLDL can improve the accuracy of video captioning.

References

[1]
B. Wang, L. Ma, W. Zhang, and W. Liu. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7622–7631.
[2]
J. Wang, W. Wang, Y. Huang, L. Wang, and Tan. 2018. M3: Multimodal memory modelling for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7512–7520.
[3]
C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai. 2019. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Trans. Multim. 22, 1 (2019), 229–241.
[4]
A. Wang, H. Hu, and L. Yang. 2018. Image captioning with affective guiding and selective attention. ACM Trans. Multim. Comput., Commun., Applic. 14, 3 (2018), 1–15.
[5]
L. Yang, H. Hu, S. Xing, and X. Lu. 2020. Constrained LSTM and residual attention for image captioning. ACM Trans. Multim. Comput., Commun., Applic. 16, 3 (2020), 1–18.
[6]
J. Wu, H. Hu, and L. Yang. 2019. Pseudo-3D attention transfer network with content-aware strategy for image captioning. ACM Trans. Multim. Comput., Commun., Applic. 15, 3 (2019), 1–19.
[7]
A. Kojima, T. Tamura, and K. Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171–184.
[8]
M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. 2013. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision. 433–440.
[9]
R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 1–7.
[10]
J. Ma, R. Wang, W. Ji, H. Zheng, E Zhu, and J. Yin. 2019. Relational recurrent neural networks for polyphonic sound event detection. Multim. Tools Applic. 78, 20 (2019), 29509–29527.
[11]
Y. Wu, X. Ji, W. Ji, Y. Tian, and H. Zhou. 2020. CASR: A context-aware residual network for single-image superresolution. Neural Comput. Applic. 32, 6 (2020), 14533--14548.
[12]
Z. Liu, Z. Li, M. Zong, W. Ji, R. Wang, and Y. Tian. 2019. Spatiotemporal saliency based multi-stream networks for action recognition. In Proceedings of the Asian Conference on Pattern Recognition. 74–84.
[13]
S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1494–1504.
[14]
S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542.
[15]
C. Zhang and Y. Tian. 2016. Automatic video description generation via LSTM with joint two-stream encoding. In Proceedings of the 23rd International Conference on Pattern Recognition. 2924–2929.
[16]
L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507–4515.
[17]
Z. Wu, T. Yao, Y. Fu, and Y. Jiang. 2017. Deep learning for video classification and captioning. In Frontiers of Multimedia Research. ACM, 3–29.
[18]
Z. Shen, J. Li, Z. Su, M. Li, Y. Chen, Y. Jiang, and X. Xue. 2017. Weakly supervised dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1916–1924.
[19]
T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89, 1-2 (1997), 31–71.
[20]
P. Shamsolmoali, M. Zareapoor, H. Zhou, and J. Yang. 2020. AMIL: Adversarial multi-instance learning for human pose estimation. ACM Trans. Multim. Comput., Commun., Applic. 16, 1 (2020), 1–23.
[21]
X. Zhang, H. Shi, C. Li, and P. Li. 2020. Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 12886–12893.
[22]
P. Luo, G. Wang, L. Lin, and X. Wang. 2017. Deep dual learning for semantic image segmentation. In Proceedings of the IEEE International Conference on Computer Vision. 2718–2726.
[23]
Y. Xia, J. Bian, T. Qin, N. Yu, and T. Liu. 2017. Dual inference for machine learning. In Proceedings of the International Joint Conferences on Artificial Intelligence. 3112–3118.
[24]
Z. Yi, H. Zhang, P. Tan, and M. Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision. 2849–2857.
[25]
T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning. 1857–1865.
[26]
J. Zhu, T. Park, P. Isola, and A. A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223–2232.
[27]
D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W. Ma. 2016. Dual learning for machine translation. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 820–828.
[28]
Y. Wang, Y. Xia, L. Zhao, J. Bian, T. Qin, G. Liu, and T. Liu. 2018. Dual transfer learning for neural machine translation with marginal distribution regularization. In Proceedings of the AAAI Conference on Artificial Intelligence. 1–7.
[29]
G. Lample, A. Conneau, L. Denoyer, and M. A. Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. In Proceedings of the International Conference on Learning Representations. 1–14.
[30]
M. Artetxe, G. Labaka, E. Agirre, and K. Cho. 2018. Unsupervised neural machine translation. In Proceedings of the International Conference on Learning Representations. 1–12.
[31]
Y. Wang, Y. Xia, T. He, F. Tian, T. Qin, C. Zhai, and T. Liu. 2019. Multi-agent dual learning. In Proceedings of the International Conference on Learning Representations. 1–15.
[32]
Z. Zhao, Y. Xia, T. Qin, and T. Liu. 2019. Dual learning: Theoretical study and algorithmic extensions. In Proceedings of the International Conference on Learning Representations. 1–16.
[33]
Y. Xia, T. Qin, W. Chen, J. Bian, N. Yu, and T. Liu. 2017. Dual supervised learning. In Proceedings of the International Conference on Machine Learning. 3789–3798.
[34]
Y. Xia, X. Tan, F. Tian, T. Qin, N. Yu, and T. Liu. 2018. Model-level dual learning. In Proceedings of the International Conference on Machine Learning. 5383–5392.
[35]
W. Zhao, W. Xu, M. Yang, J. Ye, Z. Zhao, Y. Feng, and Y. Qiao. 2017. Dual learning for cross-domain image captioning. In Proceedings of the ACM on Conference on Information and Knowledge Management. 29–38.
[36]
K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. 173–180.
[37]
H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1482.
[38]
L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. 2016. Deep compositional captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–10.
[39]
D. Heckerman. 1990. A tractable inference algorithm for diagnosing multiple diseases. In Mach. Intell. Pattern Recog. 10 (1990), 163–171.
[40]
M. Gygli, H. Grabner, and L. V. Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3090–3098.
[41]
J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.
[42]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 5998–6008.
[43]
S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.
[44]
K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Meeting on Association for Computational Linguistics. 311–318.
[45]
C. Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, 74–81.
[46]
X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325. (2015).
[47]
C. Tang, X. Liu, S. An, and P. Wang. 2020. BR2Net: Defocus blur detection via bidirectional channel attention residual refining network. IEEE Trans. Multim.
[48]
C. Tang, X. Liu, P. Wang, C. Zhang, M. Li, and L. Wang. 2019. Adaptive hypergraph embedded semi-supervised multi-label image annotation. IEEE Trans. Multim. 21, 11 (2019), 2837–2849.
[49]
X. Liu, L. Wang, J. Zhang, J. Yin, and H. Liu. 2013. Global and local structure preservation for feature selection. IEEE Trans. Neural Netw. Learn. Syst. 25, 6 (2013), 1083–1095.
[50]
Y. Tian, X. Wang, J. Wu, R. Wang, and B. Yang. 2019. Multi-scale hierarchical residual network for dense captioning. J. Artif. Intell. Res. 64 (2019), 181–196.

Cited By

View all
  • (2024)Dimensionality Reduction for Partial Label Learning: A Unified and Adaptive ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336772136:8(3765-3782)Online publication date: 1-Aug-2024
  • (2024)Combinatorial Analysis of Deep Learning and Machine Learning Video Captioning Studies: A Systematic Literature ReviewIEEE Access10.1109/ACCESS.2024.335798012(35048-35080)Online publication date: 2024
  • (2023)A State-of-the-Art Computer Vision Adopting Non-Euclidean Deep-Learning ModelsInternational Journal of Intelligent Systems10.1155/2023/86746412023Online publication date: 1-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 17, Issue 2s
June 2021
349 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3465440
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2021
Accepted: 01 January 2021
Revised: 01 December 2020
Received: 01 July 2020
Published in TOMM Volume 17, Issue 2s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deep neural networks
  2. Dual learning
  3. Multiple instance learning
  4. Multimedia processing
  5. Video captioning

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)5
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Dimensionality Reduction for Partial Label Learning: A Unified and Adaptive ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336772136:8(3765-3782)Online publication date: 1-Aug-2024
  • (2024)Combinatorial Analysis of Deep Learning and Machine Learning Video Captioning Studies: A Systematic Literature ReviewIEEE Access10.1109/ACCESS.2024.335798012(35048-35080)Online publication date: 2024
  • (2023)A State-of-the-Art Computer Vision Adopting Non-Euclidean Deep-Learning ModelsInternational Journal of Intelligent Systems10.1155/2023/86746412023Online publication date: 1-Jan-2023
  • (2023)Boosting Few-shot Object Detection with Discriminative Representation and Class MarginACM Transactions on Multimedia Computing, Communications, and Applications10.1145/360847820:3(1-19)Online publication date: 10-Nov-2023
  • (2023)Fine-Grained Text-to-Video Temporal Grounding from Coarse BoundaryACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357982519:5(1-21)Online publication date: 16-Mar-2023
  • (2023)Attention-Augmented Memory Network for Image Multi-Label ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357016619:3(1-24)Online publication date: 25-Feb-2023
  • (2023)Aligning Image Semantics and Label Concepts for Image Multi-Label ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355027819:2(1-23)Online publication date: 6-Feb-2023
  • (2023)Learning Video-Text Aligned Representations for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/354682819:2(1-21)Online publication date: 6-Feb-2023
  • (2023)Retrieval Augmented Convolutional Encoder-decoder Networks for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353922519:1s(1-24)Online publication date: 23-Jan-2023
  • (2023)Guided Graph Attention Learning for Video-Text MatchingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353853318:2s(1-23)Online publication date: 6-Jan-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media