research-article

A Multi-instance Multi-label Dual Learning Approach for Video Captioning

Authors:

Ruili WangAuthors Info & Claims

ACM Transactions on Multimidia Computing Communications and Applications, Volume 17, Issue 2s

Article No.: 72, Pages 1 - 18

https://doi.org/10.1145/3446792

Published: 14 June 2021 Publication History

Abstract

Video captioning is a challenging task in the field of multimedia processing, which aims to generate informative natural language descriptions/captions to describe video contents. Previous video captioning approaches mainly focused on capturing visual information in videos using an encoder-decoder structure to generate video captions. Recently, a new encoder-decoder-reconstructor structure was proposed for video captioning, which captured the information in both videos and captions. Based on this, this article proposes a novel multi-instance multi-label dual learning approach (MIMLDL) to generate video captions based on the encoder-decoder-reconstructor structure. Specifically, MIMLDL contains two modules: caption generation and video reconstruction modules. The caption generation module utilizes a lexical fully convolutional neural network (Lexical FCN) with a weakly supervised multi-instance multi-label learning mechanism to learn a translatable mapping between video regions and lexical labels to generate video captions. Then the video reconstruction module synthesizes visual sequences to reproduce raw videos using the outputs of the caption generation module. A dual learning mechanism fine-tunes the two modules according to the gap between the raw and the reproduced videos. Thus, our approach can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw visual sequences. Experimental results on a benchmark dataset demonstrate that MIMLDL can improve the accuracy of video captioning.

References

[1]

B. Wang, L. Ma, W. Zhang, and W. Liu. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7622–7631.

[2]

J. Wang, W. Wang, Y. Huang, L. Wang, and Tan. 2018. M3: Multimodal memory modelling for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7512–7520.

[3]

C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai. 2019. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Trans. Multim. 22, 1 (2019), 229–241.

Digital Library

[4]

A. Wang, H. Hu, and L. Yang. 2018. Image captioning with affective guiding and selective attention. ACM Trans. Multim. Comput., Commun., Applic. 14, 3 (2018), 1–15.

Digital Library

[5]

L. Yang, H. Hu, S. Xing, and X. Lu. 2020. Constrained LSTM and residual attention for image captioning. ACM Trans. Multim. Comput., Commun., Applic. 16, 3 (2020), 1–18.

Digital Library

[6]

J. Wu, H. Hu, and L. Yang. 2019. Pseudo-3D attention transfer network with content-aware strategy for image captioning. ACM Trans. Multim. Comput., Commun., Applic. 15, 3 (2019), 1–19.

Digital Library

[7]

A. Kojima, T. Tamura, and K. Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171–184.

Digital Library

[8]

M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. 2013. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision. 433–440.

Digital Library

[9]

R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 1–7.

Digital Library

[10]

J. Ma, R. Wang, W. Ji, H. Zheng, E Zhu, and J. Yin. 2019. Relational recurrent neural networks for polyphonic sound event detection. Multim. Tools Applic. 78, 20 (2019), 29509–29527.

[11]

Y. Wu, X. Ji, W. Ji, Y. Tian, and H. Zhou. 2020. CASR: A context-aware residual network for single-image superresolution. Neural Comput. Applic. 32, 6 (2020), 14533--14548.

[12]

Z. Liu, Z. Li, M. Zong, W. Ji, R. Wang, and Y. Tian. 2019. Spatiotemporal saliency based multi-stream networks for action recognition. In Proceedings of the Asian Conference on Pattern Recognition. 74–84.

[13]

S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1494–1504.

[14]

S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542.

Digital Library

[15]

C. Zhang and Y. Tian. 2016. Automatic video description generation via LSTM with joint two-stream encoding. In Proceedings of the 23rd International Conference on Pattern Recognition. 2924–2929.

[16]

L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507–4515.

Digital Library

[17]

Z. Wu, T. Yao, Y. Fu, and Y. Jiang. 2017. Deep learning for video classification and captioning. In Frontiers of Multimedia Research. ACM, 3–29.

Digital Library

[18]

Z. Shen, J. Li, Z. Su, M. Li, Y. Chen, Y. Jiang, and X. Xue. 2017. Weakly supervised dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1916–1924.

[19]

T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89, 1-2 (1997), 31–71.

Digital Library

[20]

P. Shamsolmoali, M. Zareapoor, H. Zhou, and J. Yang. 2020. AMIL: Adversarial multi-instance learning for human pose estimation. ACM Trans. Multim. Comput., Commun., Applic. 16, 1 (2020), 1–23.

Digital Library

[21]

X. Zhang, H. Shi, C. Li, and P. Li. 2020. Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 12886–12893.

[22]

P. Luo, G. Wang, L. Lin, and X. Wang. 2017. Deep dual learning for semantic image segmentation. In Proceedings of the IEEE International Conference on Computer Vision. 2718–2726.

[23]

Y. Xia, J. Bian, T. Qin, N. Yu, and T. Liu. 2017. Dual inference for machine learning. In Proceedings of the International Joint Conferences on Artificial Intelligence. 3112–3118.

Digital Library

[24]

Z. Yi, H. Zhang, P. Tan, and M. Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision. 2849–2857.

[25]

T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning. 1857–1865.

Digital Library

[26]

J. Zhu, T. Park, P. Isola, and A. A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223–2232.

[27]

D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W. Ma. 2016. Dual learning for machine translation. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 820–828.

Digital Library

[28]

Y. Wang, Y. Xia, L. Zhao, J. Bian, T. Qin, G. Liu, and T. Liu. 2018. Dual transfer learning for neural machine translation with marginal distribution regularization. In Proceedings of the AAAI Conference on Artificial Intelligence. 1–7.

[29]

G. Lample, A. Conneau, L. Denoyer, and M. A. Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. In Proceedings of the International Conference on Learning Representations. 1–14.

[30]

M. Artetxe, G. Labaka, E. Agirre, and K. Cho. 2018. Unsupervised neural machine translation. In Proceedings of the International Conference on Learning Representations. 1–12.

[31]

Y. Wang, Y. Xia, T. He, F. Tian, T. Qin, C. Zhai, and T. Liu. 2019. Multi-agent dual learning. In Proceedings of the International Conference on Learning Representations. 1–15.

[32]

Z. Zhao, Y. Xia, T. Qin, and T. Liu. 2019. Dual learning: Theoretical study and algorithmic extensions. In Proceedings of the International Conference on Learning Representations. 1–16.

[33]

Y. Xia, T. Qin, W. Chen, J. Bian, N. Yu, and T. Liu. 2017. Dual supervised learning. In Proceedings of the International Conference on Machine Learning. 3789–3798.

Digital Library

[34]

Y. Xia, X. Tan, F. Tian, T. Qin, N. Yu, and T. Liu. 2018. Model-level dual learning. In Proceedings of the International Conference on Machine Learning. 5383–5392.

[35]

W. Zhao, W. Xu, M. Yang, J. Ye, Z. Zhao, Y. Feng, and Y. Qiao. 2017. Dual learning for cross-domain image captioning. In Proceedings of the ACM on Conference on Information and Knowledge Management. 29–38.

Digital Library

[36]

K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. 173–180.

Digital Library

[37]

H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1482.

[38]

L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. 2016. Deep compositional captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–10.

[39]

D. Heckerman. 1990. A tractable inference algorithm for diagnosing multiple diseases. In Mach. Intell. Pattern Recog. 10 (1990), 163–171.

Digital Library

[40]

M. Gygli, H. Grabner, and L. V. Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3090–3098.

[41]

J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.

[42]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 5998–6008.

Digital Library

[43]

S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.

Digital Library

[44]

K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Meeting on Association for Computational Linguistics. 311–318.

Digital Library

[45]

C. Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, 74–81.

[46]

X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325. (2015).

[47]

C. Tang, X. Liu, S. An, and P. Wang. 2020. BR2Net: Defocus blur detection via bidirectional channel attention residual refining network. IEEE Trans. Multim.

[48]

C. Tang, X. Liu, P. Wang, C. Zhang, M. Li, and L. Wang. 2019. Adaptive hypergraph embedded semi-supervised multi-label image annotation. IEEE Trans. Multim. 21, 11 (2019), 2837–2849.

[49]

X. Liu, L. Wang, J. Zhang, J. Yin, and H. Liu. 2013. Global and local structure preservation for feature selection. IEEE Trans. Neural Netw. Learn. Syst. 25, 6 (2013), 1083–1095.

[50]

Y. Tian, X. Wang, J. Wu, R. Wang, and B. Yang. 2019. Multi-scale hierarchical residual network for dense captioning. J. Artif. Intell. Res. 64 (2019), 181–196.

Digital Library

Cited By

Yu XWang DZhang M(2024)Dimensionality Reduction for Partial Label Learning: A Unified and Adaptive ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336772136:8(3765-3782)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3367721
Kehkashan TAlsaeedi AYafooz WIsmail NAl-Dhaqm A(2024)Combinatorial Analysis of Deep Learning and Machine Learning Video Captioning Studies: A Systematic Literature ReviewIEEE Access10.1109/ACCESS.2024.335798012(35048-35080)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3357980
Chowdhury SSany MAhamed MDas SBadal FDas PTasneem ZHasan MIslam MAli MAbhi SIslam MSarker S(2023)A State-of-the-Art Computer Vision Adopting Non-Euclidean Deep-Learning ModelsInternational Journal of Intelligent Systems10.1155/2023/86746412023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/8674641
Show More Cited By

Index Terms

A Multi-instance Multi-label Dual Learning Approach for Video Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

An attention based dual learning approach for video captioning
Abstract
Video captioning aims to generate sentences/captions to describe video contents. It is one of the key tasks in the field of multimedia processing. However, most of the current video captioning approaches utilize only the visual ...
Highlights
- We propose a novel attention based dual learning approach for video captioning.
Multiple instance learning with bag dissimilarities

Multiple instance learning (MIL) is concerned with learning from sets (bags) of objects (instances), where the individual instance labels are ambiguous. In this setting, supervised learning cannot be applied directly. Often, specialized MIL methods ...
Instance Annotation for Multi-Instance Multi-Label Learning
Special Issue on ACM SIGKDD 2012

Multi-instance multi-label learning (MIML) is a framework for supervised classification where the objects to be classified are bags of instances associated with multiple labels. For example, an image can be represented as a bag of segments and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 17, Issue 2s

June 2021

349 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3465440

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2021

Accepted: 01 January 2021

Revised: 01 December 2020

Received: 01 July 2020

Published in TOMM Volume 17, Issue 2s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
365
Total Downloads

Downloads (Last 12 months)31
Downloads (Last 6 weeks)5

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yu XWang DZhang M(2024)Dimensionality Reduction for Partial Label Learning: A Unified and Adaptive ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336772136:8(3765-3782)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3367721
Kehkashan TAlsaeedi AYafooz WIsmail NAl-Dhaqm A(2024)Combinatorial Analysis of Deep Learning and Machine Learning Video Captioning Studies: A Systematic Literature ReviewIEEE Access10.1109/ACCESS.2024.335798012(35048-35080)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3357980
Chowdhury SSany MAhamed MDas SBadal FDas PTasneem ZHasan MIslam MAli MAbhi SIslam MSarker S(2023)A State-of-the-Art Computer Vision Adopting Non-Euclidean Deep-Learning ModelsInternational Journal of Intelligent Systems10.1155/2023/86746412023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/8674641
Shi YYang SYang WShi DLi X(2023)Boosting Few-shot Object Detection with Discriminative Representation and Class MarginACM Transactions on Multimedia Computing, Communications, and Applications10.1145/360847820:3(1-19)Online publication date: 10-Nov-2023
https://dl.acm.org/doi/10.1145/3608478
Hao JSun HRen PZhong YWang JQi QLiao J(2023)Fine-Grained Text-to-Video Temporal Grounding from Coarse BoundaryACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357982519:5(1-21)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3579825
Zhou WHou YChen DHu HSu T(2023)Attention-Augmented Memory Network for Image Multi-Label ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357016619:3(1-24)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3570166
Zhou WXia ZDou PSu THu H(2023)Aligning Image Semantics and Label Concepts for Image Multi-Label ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355027819:2(1-23)Online publication date: 6-Feb-2023
https://dl.acm.org/doi/10.1145/3550278
Shi YXu HYuan CLi BHu WZha Z(2023)Learning Video-Text Aligned Representations for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/354682819:2(1-21)Online publication date: 6-Feb-2023
https://dl.acm.org/doi/10.1145/3546828
Chen JPan YLi YYao TChao HMei T(2023)Retrieval Augmented Convolutional Encoder-decoder Networks for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353922519:1s(1-24)Online publication date: 23-Jan-2023
https://dl.acm.org/doi/10.1145/3539225
Li KLiu CStopa MAmano JFu Y(2023)Guided Graph Attention Learning for Video-Text MatchingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353853318:2s(1-23)Online publication date: 6-Jan-2023
https://dl.acm.org/doi/10.1145/3538533
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents