research-article

Search-oriented Micro-video Captioning

Authors:
Liqiang Nie

Harbin Institute of Technology, Shenzhen, Shenzhen, China

Harbin Institute of Technology, Shenzhen, Shenzhen, China
View Profile

,
Leigang Qu

Shandong University, Qingdao, China

Shandong University, Qingdao, China
View Profile

,
Dai Meng

Kuaishou, Beijing, China

Kuaishou, Beijing, China
View Profile

,
Min Zhang

Harbin Institute of Technology, Shenzhen, Shenzhen, China

Harbin Institute of Technology, Shenzhen, Shenzhen, China
View Profile

,
Qi Tian

Huawei Cloud & AI, Shenzhen, China

Huawei Cloud & AI, Shenzhen, China
View Profile

,
Alberto Del Bimbo

University of Florence, Florence, Italy

University of Florence, Florence, Italy
View Profile

MM '22: Proceedings of the 30th ACM International Conference on MultimediaOctober 2022Pages 3234–3243https://doi.org/10.1145/3503161.3548180

Published:10 October 2022Publication History

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 3234–3243

ABSTRACT

Pioneer efforts have been dedicated to the content-oriented video captioning that generates relevant sentences to describe the visual contents of a given video from the producer perspective. By contrast, this work targets at the search-oriented one that summarizes the given video via generating query-like sentences from the consumer angle. Beyond relevance, diversity is vital in characterizing consumers' seeking intention from different aspects. Towards this end, we devise a large-scale multimodal pre-training network regularized by five tasks to strengthen the downstream video representation, which is well-trained over our collected 11M micro-videos. Thereafter, we present a flow-based diverse captioning model to generate different captions from consumers' search demand. This model is optimized via a reconstruction loss and a KL divergence between the prior and the posterior. We justify our model over our constructed golden dataset comprising 690k <query, micro-video> pairs and experimental results demonstrate its superiority.

References

Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelovi?, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised Multimodal Versatile Networks. In Proceedings of the Neural Information Processing Systems Conference. 25--37.Google Scholar
Jyoti Aneja, Harsh Agrawal, Dhruv Batra, and Alexander Schwing. 2019. Sequential Latent Spaces for Modeling the Intention during Diverse Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 4261--4270.Google ScholarCross Ref
David Chen and William B Dolan. 2011. Collecting Highly Parallel Data for Paraphrase Evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 190--200.Google Scholar
Zhi Cui, Yanran Li, Jiayi Zhang, Jianwei Cui, Chen Wei, and Bin Wang. 2020. Focus-Constrained Attention Mechanism for CVAE-based Response Generation. In Findings of the Conference on Empirical Methods in Natural Language Processing. ACL, 2021--2030.Google Scholar
Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. 2017. Towards Diverse and Natural Image Descriptions via a Conditional GAN. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2970--2979.Google ScholarCross Ref
Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G Schwing, and David Forsyth. 2019. Fast, Diverse and Accurate Image Captioning Guided by Part-of-Speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 10695--10704.Google ScholarCross Ref
Kevin Gimpel, Dhruv Batra, Chris Dyer, and Gregory Shakhnarovich. 2013. A Systematic Exploration of Diversity in Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1100--1111.Google Scholar
Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. Coot: Cooperative Hierarchical Transformer for Video-text Representation Learning. In Proceedings of the Neural Information Processing Systems Conference. Curran Associates, Inc., 289--297.Google Scholar
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and Describing Arbitrary Activities using Semantic Hierarchies and Zero-shot Recognition. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2712--2719.Google ScholarDigital Library
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning. PMLR, 448--456.Google Scholar
Durk P Kingma and Prafulla Dhariwal. 2018. Glow: Generative Flow with Invertible 1x1 Convolutions. In Proceedings of the Neural Information Processing Systems Conference. Curran Associates, Inc., 10236--10245.Google Scholar
Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. 2002. Natural Language Description of Human Activities from Video Images based on Concept Hierarchy of Actions. International Journal of Computer Vision, Vol. 50, 2 (2002), 171--184.Google ScholarDigital Library
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual Genome: Connecting Language and Vision using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.Google ScholarDigital Library
Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating Natural-Language Video Descriptions using Text-mined Knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 541--547.Google ScholarCross Ref
Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara Berg, and Mohit Bansal. 2020. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 2603--2614.Google ScholarCross Ref
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 7871--7880.Google ScholarCross Ref
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for VideoLanguage Omni-representation Pre-training. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 2046--2065.Google ScholarCross Ref
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision. Springer, 740--755.Google ScholarCross Ref
Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal Moment Localization in Videos. In Proceedings of the ACM International Conference on Multimedia. ACM, 843--851.Google ScholarDigital Library
Wu Liu, Tao Mei, Yongdong Zhang, Jintao Li, and Shipeng Li. 2013. Listen, Look, and Gotcha: Instant Video Search with Mobile Phones by Layered Audio-video Indexing. In Proceedings of the ACM International Conference on Multimedia. ACM, 887--896.Google ScholarDigital Library
Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-Temporal Graph for Video Captioning with Knowledge Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 10870--10879.Google ScholarCross Ref
Danilo Rezende and Shakir Mohamed. 2015. Variational Inference with Normalizing Flows. In Proceedings of the International Conference on Machine Learning. PMLR, 1530--1538.Google Scholar
Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating Video Content to Natural Language Descriptions. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 433--440.Google ScholarDigital Library
Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4135--4144.Google ScholarCross Ref
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning Structured Output Representation using Deep Conditional Generative Models. Proceedings of the Neural Information Processing Systems Conference (2015), 1--9.Google Scholar
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A Joint Model for Video and Language Representation Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7464--7473.Google ScholarCross Ref
Yongqing Sun, Zuxuan Wu, Xi Wang, Hiroyuki Arai, Tetsuya Kinebuchi, and Yu-Gang Jiang. 2016. Exploiting Objects with LSTMs for Video Categorization. In Proceedings of the ACM International Conference on Multimedia. ACM, 142--146.Google ScholarDigital Library
Ganchao Tan, Daqing Liu, Meng Wang, and Zheng-Jun Zha. 2020. Learning to Discretely Compose Reasoning Module Networks for Video Captioning. In Proceedings of the International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, 745--752.Google ScholarCross Ref
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of the Neural Information Processing Systems Conference. Curran Associates, Inc., 5998--6008.Google Scholar
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to Sequence-Video to Text. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4534--4542.Google ScholarDigital Library
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015b. Translating Videos to Natural Language using Deep Recurrent Neural Networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. ACL, 1494--1504.Google ScholarCross Ref
Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2018. Diverse Beam Search for Improved Description of Complex Scenes. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 1--9.Google ScholarCross Ref
Liwei Wang, Alexander Schwing, and Svetlana Lazebnik. 2017. Diverse and Accurate Image Description using a Variational Auto-encoder with an Additive Gaussian Encoding Space. In Proceedings of the Neural Information Processing Systems Conference. Curran Associates, Inc., 5756--5766.Google Scholar
Huanhou Xiao and Jinglun Shi. 2019. Diverse Video Captioning through Latent Variable Expansion. arXiv preprint arXiv:1910.12019 (2019), 1--11.Google Scholar
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5288--5296.Google ScholarCross Ref
Bang Yang, Yuexian Zou, Fenglin Liu, and Can Zhang. 2021. Non-Autoregressive Coarse-to-Fine Video Captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 3119--3127.Google ScholarCross Ref
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing Videos by Exploiting Temporal Structure. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4507--4515.Google ScholarDigital Library
Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 654--664.Google ScholarCross Ref
Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end Dense Video Captioning with Masked Transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 8739--8748.Google ScholarCross Ref
Linchao Zhu and Yi Yang. 2020. Actbert: Learning Global-local Video-text Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 8746--8755.Google ScholarCross Ref

Index Terms

Search-oriented Micro-video Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing

Recommendations

Hierarchical & multimodal video captioning

In this paper, we proposed to discover and integrate the rich and primeval external knowledge (i.e., frame-based image caption) to benefit the video caption task.We propose a Hierarchical & Multimodal Video Caption (HMVC) model to jointly learn the ...
Read More
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
Read More
Global semantic enhancement network for video captioning
Abstract
Video captioning aims to briefly describe the content of a video in accurate and fluent natural language, which is a hot research topic in multimedia processing. As a bridge between video and natural language, video captioning is a challenging ...
Highlights
- A video captioning framework called global semantic enhancement network is proposed.
- It highlights features of informative frames in aggregated video representations.
- It enhances semantic correlations between video and language ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 October 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
large-scale data collection
multimodal pre-training network
search-oriented captioning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 1,412
  Total Downloads
- Downloads (Last 12 months)354
- Downloads (Last 6 weeks)37
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Search-oriented Micro-video Captioning

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hierarchical & multimodal video captioning

Learning Multimodal Attention LSTM Networks for Video Captioning

Global semantic enhancement network for video captioning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media