ABSTRACT
Pioneer efforts have been dedicated to the content-oriented video captioning that generates relevant sentences to describe the visual contents of a given video from the producer perspective. By contrast, this work targets at the search-oriented one that summarizes the given video via generating query-like sentences from the consumer angle. Beyond relevance, diversity is vital in characterizing consumers' seeking intention from different aspects. Towards this end, we devise a large-scale multimodal pre-training network regularized by five tasks to strengthen the downstream video representation, which is well-trained over our collected 11M micro-videos. Thereafter, we present a flow-based diverse captioning model to generate different captions from consumers' search demand. This model is optimized via a reconstruction loss and a KL divergence between the prior and the posterior. We justify our model over our constructed golden dataset comprising 690k <query, micro-video> pairs and experimental results demonstrate its superiority.
- Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelovi?, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised Multimodal Versatile Networks. In Proceedings of the Neural Information Processing Systems Conference. 25--37.Google Scholar
- Jyoti Aneja, Harsh Agrawal, Dhruv Batra, and Alexander Schwing. 2019. Sequential Latent Spaces for Modeling the Intention during Diverse Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 4261--4270.Google ScholarCross Ref
- David Chen and William B Dolan. 2011. Collecting Highly Parallel Data for Paraphrase Evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 190--200.Google Scholar
- Zhi Cui, Yanran Li, Jiayi Zhang, Jianwei Cui, Chen Wei, and Bin Wang. 2020. Focus-Constrained Attention Mechanism for CVAE-based Response Generation. In Findings of the Conference on Empirical Methods in Natural Language Processing. ACL, 2021--2030.Google Scholar
- Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. 2017. Towards Diverse and Natural Image Descriptions via a Conditional GAN. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2970--2979.Google ScholarCross Ref
- Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G Schwing, and David Forsyth. 2019. Fast, Diverse and Accurate Image Captioning Guided by Part-of-Speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 10695--10704.Google ScholarCross Ref
- Kevin Gimpel, Dhruv Batra, Chris Dyer, and Gregory Shakhnarovich. 2013. A Systematic Exploration of Diversity in Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1100--1111.Google Scholar
- Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. Coot: Cooperative Hierarchical Transformer for Video-text Representation Learning. In Proceedings of the Neural Information Processing Systems Conference. Curran Associates, Inc., 289--297.Google Scholar
- Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and Describing Arbitrary Activities using Semantic Hierarchies and Zero-shot Recognition. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2712--2719.Google ScholarDigital Library
- Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning. PMLR, 448--456.Google Scholar
- Durk P Kingma and Prafulla Dhariwal. 2018. Glow: Generative Flow with Invertible 1x1 Convolutions. In Proceedings of the Neural Information Processing Systems Conference. Curran Associates, Inc., 10236--10245.Google Scholar
- Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. 2002. Natural Language Description of Human Activities from Video Images based on Concept Hierarchy of Actions. International Journal of Computer Vision, Vol. 50, 2 (2002), 171--184.Google ScholarDigital Library
- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual Genome: Connecting Language and Vision using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.Google ScholarDigital Library
- Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating Natural-Language Video Descriptions using Text-mined Knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 541--547.Google ScholarCross Ref
- Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara Berg, and Mohit Bansal. 2020. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 2603--2614.Google ScholarCross Ref
- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 7871--7880.Google ScholarCross Ref
- Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for VideoLanguage Omni-representation Pre-training. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 2046--2065.Google ScholarCross Ref
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision. Springer, 740--755.Google ScholarCross Ref
- Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal Moment Localization in Videos. In Proceedings of the ACM International Conference on Multimedia. ACM, 843--851.Google ScholarDigital Library
- Wu Liu, Tao Mei, Yongdong Zhang, Jintao Li, and Shipeng Li. 2013. Listen, Look, and Gotcha: Instant Video Search with Mobile Phones by Layered Audio-video Indexing. In Proceedings of the ACM International Conference on Multimedia. ACM, 887--896.Google ScholarDigital Library
- Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-Temporal Graph for Video Captioning with Knowledge Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 10870--10879.Google ScholarCross Ref
- Danilo Rezende and Shakir Mohamed. 2015. Variational Inference with Normalizing Flows. In Proceedings of the International Conference on Machine Learning. PMLR, 1530--1538.Google Scholar
- Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating Video Content to Natural Language Descriptions. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 433--440.Google ScholarDigital Library
- Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4135--4144.Google ScholarCross Ref
- Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning Structured Output Representation using Deep Conditional Generative Models. Proceedings of the Neural Information Processing Systems Conference (2015), 1--9.Google Scholar
- Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A Joint Model for Video and Language Representation Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7464--7473.Google ScholarCross Ref
- Yongqing Sun, Zuxuan Wu, Xi Wang, Hiroyuki Arai, Tetsuya Kinebuchi, and Yu-Gang Jiang. 2016. Exploiting Objects with LSTMs for Video Categorization. In Proceedings of the ACM International Conference on Multimedia. ACM, 142--146.Google ScholarDigital Library
- Ganchao Tan, Daqing Liu, Meng Wang, and Zheng-Jun Zha. 2020. Learning to Discretely Compose Reasoning Module Networks for Video Captioning. In Proceedings of the International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, 745--752.Google ScholarCross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of the Neural Information Processing Systems Conference. Curran Associates, Inc., 5998--6008.Google Scholar
- Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to Sequence-Video to Text. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4534--4542.Google ScholarDigital Library
- Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015b. Translating Videos to Natural Language using Deep Recurrent Neural Networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. ACL, 1494--1504.Google ScholarCross Ref
- Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2018. Diverse Beam Search for Improved Description of Complex Scenes. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 1--9.Google ScholarCross Ref
- Liwei Wang, Alexander Schwing, and Svetlana Lazebnik. 2017. Diverse and Accurate Image Description using a Variational Auto-encoder with an Additive Gaussian Encoding Space. In Proceedings of the Neural Information Processing Systems Conference. Curran Associates, Inc., 5756--5766.Google Scholar
- Huanhou Xiao and Jinglun Shi. 2019. Diverse Video Captioning through Latent Variable Expansion. arXiv preprint arXiv:1910.12019 (2019), 1--11.Google Scholar
- Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5288--5296.Google ScholarCross Ref
- Bang Yang, Yuexian Zou, Fenglin Liu, and Can Zhang. 2021. Non-Autoregressive Coarse-to-Fine Video Captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 3119--3127.Google ScholarCross Ref
- Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing Videos by Exploiting Temporal Structure. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4507--4515.Google ScholarDigital Library
- Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 654--664.Google ScholarCross Ref
- Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end Dense Video Captioning with Masked Transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 8739--8748.Google ScholarCross Ref
- Linchao Zhu and Yi Yang. 2020. Actbert: Learning Global-local Video-text Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 8746--8755.Google ScholarCross Ref
Index Terms
- Search-oriented Micro-video Captioning
Recommendations
Hierarchical & multimodal video captioning
In this paper, we proposed to discover and integrate the rich and primeval external knowledge (i.e., frame-based image caption) to benefit the video caption task.We propose a Hierarchical & Multimodal Video Caption (HMVC) model to jointly learn the ...
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on MultimediaAutomatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
Global semantic enhancement network for video captioning
AbstractVideo captioning aims to briefly describe the content of a video in accurate and fluent natural language, which is a hot research topic in multimedia processing. As a bridge between video and natural language, video captioning is a challenging ...
Highlights- A video captioning framework called global semantic enhancement network is proposed.
- It highlights features of informative frames in aggregated video representations.
- It enhances semantic correlations between video and language ...
Comments