research-article

MEME: Multi-Encoder Multi-Expert Framework with Data Augmentation for Video Retrieval

Authors:
Seong-Min Kang

Chung-Ang University, Seoul, Republic of Korea

Chung-Ang University, Seoul, Republic of Korea

0009-0005-8231-5763
View Profile

,
Yoon-Sik Cho

Chung-Ang University, Seoul, Republic of Korea

Chung-Ang University, Seoul, Republic of Korea

0000-0002-9110-7414
View Profile

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2023Pages 475–484https://doi.org/10.1145/3539618.3591726

Published:18 July 2023Publication History

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 475–484

ABSTRACT

Text-to-video(T2V) retrieval aims to find relevant videos from text queries. The recently introduced Contrastive Language Image Pretraining (CLIP), a pretrained language-vision model trained on large-scale image and caption pairs, has been extensively studied in the literature for this task. Existing studies on T2V task have aimed to transfer the CLIP knowledge and focus on enhancing retrieval performance through fine-grained representation learning. While fine-grained contrast has achieved some remarkable results, less attention has been paid to coarse-grained contrasts. To this end, we propose a method called Graph Patch Spreading (GPS) to aggregate patches across frames at the coarse-grained level. We apply GPS to our proposed framework called Multi-Encoder Multi-Expert (MEME) framework. Our proposed scheme is general enough to be applied to any existing CLIP-based video-text retrieval models. We demonstrate the effectiveness of our method on existing models over the benchmark datasets MSR-VTT, MSVD, and LSMDC datasets. Our code can be found at https://github.com/kang7734/MEME__.

References

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748--8763. PMLR, 2021.Google Scholar
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomput., 508(C):293--304, oct 2022.Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ?ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.Google Scholar
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.Google Scholar
Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 638--647, 2022.Google ScholarDigital Library
Haoran Wang, Di Xu, Dongliang He, Fu Li, Zhong Ji, Jungong Han, and Errui Ding. Boosting video-text retrieval with explicit high-level semantics, 2022.Google ScholarDigital Library
Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. Multi-modal transformer for video retrieval, 2020.Google ScholarDigital Library
Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, and Yang Liu. Teachtext: Crossmodal generalized distillation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11583--11593, 2021.Google ScholarCross Ref
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728--1738, 2021.Google ScholarCross Ref
Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7331--7341, 2021.Google ScholarCross Ref
Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021.Google Scholar
Alexander Kunitsyn, Maksim Kalashnikov, Maksim Dzabraev, and Andrei Ivaniuta. Mdmmt-2: Multidomain multimodal transformer for video retrieval, one more step towards generalization. arXiv preprint arXiv:2203.07086, 2022.Google Scholar
Zijian Gao, Jingyu Liu, Weiqi Sun, Sheng Chen, Dedan Chang, and Lili Zhao. Clip2tv: Align, match and distill for video-text retrieval, 2022.Google Scholar
Satya Krishna Gorti, Noël Vouitsis, Junwei Ma, Keyvan Golestan, Maksims Volkovs, Animesh Garg, and Guangwei Yu. X-pool: Cross-modal language-video attention for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5006--5015, 2022.Google ScholarCross Ref
Qiang Wang, Yanhao Zhang, Yun Zheng, Pan Pan, and Xian-Sheng Hua. Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111, 2022.Google Scholar
Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, and Dong Shen. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss, 2021.Google Scholar
Shuai Zhao, Linchao Zhu, Xiaohan Wang, and Yi Yang. Centerclip: Token clustering for efficient text-video retrieval. In SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 11-15, 2022, Madrid, Spain, 2022.Google ScholarDigital Library
Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, and Qin Jin. Ts2-net: Token shift and selection transformer for text-video retrieval. In Computer Vision -- ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XIV, page 319--335, Berlin, Heidelberg, 2022. Springer-Verlag.Google ScholarDigital Library
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6787--6800, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.Google ScholarCross Ref
H. Lu, N. Fei, Y. Huo, Y. Gao, Z. Lu, and J. Wen. Cots: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15671--15680, Los Alamitos, CA, USA, jun 2022. IEEE Computer Society.Google ScholarCross Ref
Youngjae Yu, Jongseok Kim, and Gunhee Kim. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pages 471--487, 2018.Google ScholarDigital Library
Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, Xiaohu Qie, and Ping Luo. Bridging video-text retrieval with multiple choice questions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16167--16176, 2022.Google ScholarCross Ref
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630--2640, 2019.Google ScholarCross Ref
Jingjia Huang, Yinan Li, Jiashi Feng, Xiaoshuai Sun, and Rongrong Ji. Clover: Towards a unified video-language alignment and fusion model. arXiv preprint arXiv:2207.07885, 2022.Google Scholar
Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036--5045, 2022.Google ScholarCross Ref
Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, and Jianlong Fu. Long-form video-language pre-training with multimodal temporal contrastive learning. arXiv preprint arXiv:2210.06031, 2022.Google Scholar
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.Google Scholar
Jesús Andrés Portillo-Quintero, José Carlos Ortiz-Bayliss, and Hugo Terashima-Marín. A straightforward framework for video retrieval using clip. In Pattern Recognition: 13th Mexican Conference, MCPR 2021, Mexico City, Mexico, June 23-26, 2021, Proceedings, pages 3--12. Springer, 2021.Google ScholarDigital Library
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. Boosting image captioning with attributes, 2017.Google ScholarCross Ref
Drew A. Hudson and Christopher D. Manning. Learning by abstraction: The neural state machine, 2019.Google Scholar
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. From captions to visual concepts and back. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1473--1482, 2015.Google ScholarCross Ref
Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, and Lingyun Ma. Consensus-aware visual-semantic embedding for image-text matching. In European Conference on Computer Vision, 2020.Google ScholarDigital Library
Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. FILIP: Fine-grained interactive language-image pre-training. In International Conference on Learning Representations, 2022.Google Scholar
J. Yang, Y. Bisk, and J. Gao. Taco: Token-aware cascade contrastive learning for video-text alignment. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11542--11552, Los Alamitos, CA, USA, oct 2021. IEEE Computer Society.Google ScholarCross Ref
Yunze Liu, Qingnan Fan, Shanghang Zhang, Hao Dong, Thomas Funkhouser, and Li Yi. Contrastive multimodal fusion with tupleinfonce. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 754--763, 2021.Google ScholarCross Ref
Sijie Mai, Ying Zeng, Shuangjia Zheng, and Haifeng Hu. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Transactions on Affective Computing, 2022.Google ScholarDigital Library
Luyu Wang, Pauline Luc, Adria Recasens, Jean-Baptiste Alayrac, and Aaron van den Oord. Multimodal self-supervised learning of general audio representations. arXiv preprint arXiv:2104.12807, 2021.Google Scholar
Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34:24206--24221, 2021.Google Scholar
Simion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, and Samuel Albanie. Cross modal retrieval with querybank normalisation, June 2022.Google Scholar
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288--5296, 2016.Google ScholarCross Ref
David Chen and William Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 190--200, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.Google ScholarDigital Library
Anna Rohrbach, Marcus Rohrbach, and Bernt Schiele. The long-short story of movie description. In German conference on pattern recognition, pages 209--221. Springer, 2015Google ScholarDigital Library

Index Terms

MEME: Multi-Encoder Multi-Expert Framework with Data Augmentation for Video Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval
Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In ...
Read More
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained ...
Read More
Fine-grained Cross-modal Alignment Network for Text-Video Retrieval
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Despite the recent progress of cross-modal text-to-video retrieval techniques, their performance is still unsatisfactory. Most existing works follow a trend of learning a joint embedding space to measure the distance between global-level or local-level ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2023
3567 pages
ISBN:9781450394086
DOI:10.1145/3539618
General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 July 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clip
multi-grained contrast
text-video retrieval
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 252
  Total Downloads
- Downloads (Last 12 months)252
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MEME: Multi-Encoder Multi-Expert Framework with Data Augmentation for Video Retrieval

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval