skip to main content
10.1145/3503161.3548180acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Search-oriented Micro-video Captioning

Published:10 October 2022Publication History

ABSTRACT

Pioneer efforts have been dedicated to the content-oriented video captioning that generates relevant sentences to describe the visual contents of a given video from the producer perspective. By contrast, this work targets at the search-oriented one that summarizes the given video via generating query-like sentences from the consumer angle. Beyond relevance, diversity is vital in characterizing consumers' seeking intention from different aspects. Towards this end, we devise a large-scale multimodal pre-training network regularized by five tasks to strengthen the downstream video representation, which is well-trained over our collected 11M micro-videos. Thereafter, we present a flow-based diverse captioning model to generate different captions from consumers' search demand. This model is optimized via a reconstruction loss and a KL divergence between the prior and the posterior. We justify our model over our constructed golden dataset comprising 690k <query, micro-video> pairs and experimental results demonstrate its superiority.

References

  1. Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelovi?, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised Multimodal Versatile Networks. In Proceedings of the Neural Information Processing Systems Conference. 25--37.Google ScholarGoogle Scholar
  2. Jyoti Aneja, Harsh Agrawal, Dhruv Batra, and Alexander Schwing. 2019. Sequential Latent Spaces for Modeling the Intention during Diverse Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 4261--4270.Google ScholarGoogle ScholarCross RefCross Ref
  3. David Chen and William B Dolan. 2011. Collecting Highly Parallel Data for Paraphrase Evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 190--200.Google ScholarGoogle Scholar
  4. Zhi Cui, Yanran Li, Jiayi Zhang, Jianwei Cui, Chen Wei, and Bin Wang. 2020. Focus-Constrained Attention Mechanism for CVAE-based Response Generation. In Findings of the Conference on Empirical Methods in Natural Language Processing. ACL, 2021--2030.Google ScholarGoogle Scholar
  5. Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. 2017. Towards Diverse and Natural Image Descriptions via a Conditional GAN. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2970--2979.Google ScholarGoogle ScholarCross RefCross Ref
  6. Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G Schwing, and David Forsyth. 2019. Fast, Diverse and Accurate Image Captioning Guided by Part-of-Speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 10695--10704.Google ScholarGoogle ScholarCross RefCross Ref
  7. Kevin Gimpel, Dhruv Batra, Chris Dyer, and Gregory Shakhnarovich. 2013. A Systematic Exploration of Diversity in Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1100--1111.Google ScholarGoogle Scholar
  8. Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. Coot: Cooperative Hierarchical Transformer for Video-text Representation Learning. In Proceedings of the Neural Information Processing Systems Conference. Curran Associates, Inc., 289--297.Google ScholarGoogle Scholar
  9. Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and Describing Arbitrary Activities using Semantic Hierarchies and Zero-shot Recognition. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2712--2719.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning. PMLR, 448--456.Google ScholarGoogle Scholar
  11. Durk P Kingma and Prafulla Dhariwal. 2018. Glow: Generative Flow with Invertible 1x1 Convolutions. In Proceedings of the Neural Information Processing Systems Conference. Curran Associates, Inc., 10236--10245.Google ScholarGoogle Scholar
  12. Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. 2002. Natural Language Description of Human Activities from Video Images based on Concept Hierarchy of Actions. International Journal of Computer Vision, Vol. 50, 2 (2002), 171--184.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual Genome: Connecting Language and Vision using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating Natural-Language Video Descriptions using Text-mined Knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 541--547.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara Berg, and Mohit Bansal. 2020. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 2603--2614.Google ScholarGoogle ScholarCross RefCross Ref
  16. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 7871--7880.Google ScholarGoogle ScholarCross RefCross Ref
  17. Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for VideoLanguage Omni-representation Pre-training. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 2046--2065.Google ScholarGoogle ScholarCross RefCross Ref
  18. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision. Springer, 740--755.Google ScholarGoogle ScholarCross RefCross Ref
  19. Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal Moment Localization in Videos. In Proceedings of the ACM International Conference on Multimedia. ACM, 843--851.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Wu Liu, Tao Mei, Yongdong Zhang, Jintao Li, and Shipeng Li. 2013. Listen, Look, and Gotcha: Instant Video Search with Mobile Phones by Layered Audio-video Indexing. In Proceedings of the ACM International Conference on Multimedia. ACM, 887--896.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-Temporal Graph for Video Captioning with Knowledge Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 10870--10879.Google ScholarGoogle ScholarCross RefCross Ref
  22. Danilo Rezende and Shakir Mohamed. 2015. Variational Inference with Normalizing Flows. In Proceedings of the International Conference on Machine Learning. PMLR, 1530--1538.Google ScholarGoogle Scholar
  23. Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating Video Content to Natural Language Descriptions. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 433--440.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4135--4144.Google ScholarGoogle ScholarCross RefCross Ref
  25. Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning Structured Output Representation using Deep Conditional Generative Models. Proceedings of the Neural Information Processing Systems Conference (2015), 1--9.Google ScholarGoogle Scholar
  26. Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A Joint Model for Video and Language Representation Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7464--7473.Google ScholarGoogle ScholarCross RefCross Ref
  27. Yongqing Sun, Zuxuan Wu, Xi Wang, Hiroyuki Arai, Tetsuya Kinebuchi, and Yu-Gang Jiang. 2016. Exploiting Objects with LSTMs for Video Categorization. In Proceedings of the ACM International Conference on Multimedia. ACM, 142--146.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ganchao Tan, Daqing Liu, Meng Wang, and Zheng-Jun Zha. 2020. Learning to Discretely Compose Reasoning Module Networks for Video Captioning. In Proceedings of the International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, 745--752.Google ScholarGoogle ScholarCross RefCross Ref
  29. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of the Neural Information Processing Systems Conference. Curran Associates, Inc., 5998--6008.Google ScholarGoogle Scholar
  30. Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to Sequence-Video to Text. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4534--4542.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015b. Translating Videos to Natural Language using Deep Recurrent Neural Networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. ACL, 1494--1504.Google ScholarGoogle ScholarCross RefCross Ref
  32. Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2018. Diverse Beam Search for Improved Description of Complex Scenes. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  33. Liwei Wang, Alexander Schwing, and Svetlana Lazebnik. 2017. Diverse and Accurate Image Description using a Variational Auto-encoder with an Additive Gaussian Encoding Space. In Proceedings of the Neural Information Processing Systems Conference. Curran Associates, Inc., 5756--5766.Google ScholarGoogle Scholar
  34. Huanhou Xiao and Jinglun Shi. 2019. Diverse Video Captioning through Latent Variable Expansion. arXiv preprint arXiv:1910.12019 (2019), 1--11.Google ScholarGoogle Scholar
  35. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5288--5296.Google ScholarGoogle ScholarCross RefCross Ref
  36. Bang Yang, Yuexian Zou, Fenglin Liu, and Can Zhang. 2021. Non-Autoregressive Coarse-to-Fine Video Captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 3119--3127.Google ScholarGoogle ScholarCross RefCross Ref
  37. Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing Videos by Exploiting Temporal Structure. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4507--4515.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 654--664.Google ScholarGoogle ScholarCross RefCross Ref
  39. Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end Dense Video Captioning with Masked Transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 8739--8748.Google ScholarGoogle ScholarCross RefCross Ref
  40. Linchao Zhu and Yi Yang. 2020. Actbert: Learning Global-local Video-text Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 8746--8755.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Search-oriented Micro-video Captioning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '22: Proceedings of the 30th ACM International Conference on Multimedia
        October 2022
        7537 pages
        ISBN:9781450392037
        DOI:10.1145/3503161

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 October 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader