short-paper

Cross-Probe BERT for Fast Cross-Modal Search

Authors:

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2178 - 2183

https://doi.org/10.1145/3477495.3531826

Published: 07 July 2022 Publication History

Get Access

Abstract

Owing to the effectiveness of cross-modal attentions, text-vision BERT models have achieved excellent performance in text-image retrieval. Nevertheless, cross-modal attentions in text-vision BERT models require expensive computation cost when tackling text-vision retrieval due to their pairwise input. Therefore, normally, it is impractical for deploying them for large-scale cross-modal retrieval in real applications. To address the inefficiency issue in exiting text-vision BERT models, in this work, we develop a novel architecture, cross-probe BERT. It devises a small number of text and vision probes, and the cross-modal attentions are efficiency achieved through the interactions between text and vision probes. It takes lightweight computation cost, and meanwhile effectively exploits cross-modal attention. Systematic experiments on public benchmarks demonstrate the excellent effectiveness and efficiency of our cross-probe BERT.

Supplementary Material

MP4 File (sigir2022.mp4)

Owing to the effectiveness of cross-modal attentions, text-vision BERT models have achieved excellent performance in text-image retrieval. Nevertheless, cross-modal attentions in text-vision BERT models require expensive computation cost when tackling text-vision retrieval due to their pairwise input. Therefore, normally, it is impractical for deploying them for large-scale cross-modal retrieval in real applications. To address the inefficiency issue in exiting text-vision BERT models, in this work, we develop a novel architecture, cross-probe BERT. It devises a small number of text and vision probes, and the cross-modal attentions are efficiency achieved through the interactions between text and vision probes. It takes lightweight computation cost, and meanwhile effectively exploits cross-modal attention. Systematic experiments on public benchmarks demonstrate excellent effectiveness and efficiency of our cross-probe BERT.

Download
18.51 MB

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, UT, 6077--6086.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Scalable Deep Multimodal Learning for Cross-Modal Retrieval

Correlation Autoencoder Hashing for Supervised Cross-Modal Search

HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations