Skip to main content

Most and Least Retrievable Images in Visual-Language Query Systems

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13697))

Included in the following conference series:

Abstract

This is the first work to introduce the Most Retrievable Image(MRI) and Least Retrievable Image(LRI) concepts in modern text-to-image retrieval systems. An MRI is associated with and thus can be retrieved by many unrelated texts, while an LRI is disassociated from and thus not retrievable by related texts. Both of them have important practical applications and implications. Due to their one-to-many nature, it is fundamentally challenging to construct MRI and LRI. This research addresses this nontrivial problem by developing novel and effective loss functions to craft perturbations that essentially corrupt feature correlation between visual and language spaces, thus enabling MRI and LRI. The proposed schemes are implemented based on CLIP, a state-of-the-art image and text representation model, to demonstrate MRI and LRI and their application in privacy-preserved image sharing and malicious advertisement. They are evaluated by extensive experiments based on the modern visual-language models on multiple benchmarks, including Paris, ImageNet, Flickr30k, and MSCOCO. The experimental results show the effectiveness and robustness of the proposed schemes for constructing MRI and LRI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Here, we set \(\varepsilon =16/255\), which is commonly used in the robustness analysis for image classification systems.

References

  1. Acar, G., Eubank, C., Englehardt, S., Juarez, M., Narayanan, A., Diaz, C.: The web never forgets: persistent tracking mechanisms in the wild. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security(CCS), pp. 674–689 (2014)

    Google Scholar 

  2. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 6077–6086 (2018)

    Google Scholar 

  3. Thomas, A.: Ogiz Elibol: defense against adversarial attack-rank3. ‘github.com/anlthms/nips-2017/tree/master/mmd’ (2017)

    Google Scholar 

  4. Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 2425–2433 (2015)

    Google Scholar 

  5. Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS) (2019)

    Google Scholar 

  6. Belghazi, M.I., et al.: Mutual information neural estimation. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 531–540 (2018)

    Google Scholar 

  7. Benjamin, E.: False and deceptive display ads at yahoo’s right media. www.benedelman.org/rightmedia-deception (2009)

  8. Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: Proceedings of the IEEE Symposium on Security and Privacy (S&P), pp. 39–57 (2017)

    Google Scholar 

  9. Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3558–3568 (2021)

    Google Scholar 

  10. Chen, H., Zhang, H., Chen, P.Y., Yi, J., Hsieh, C.J.: Attacking visual language grounding with adversarial examples: a case study on neural image captioning. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2018)

    Google Scholar 

  11. Chen, Y.C., et al.: Uniter: universal image-text representation learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 104–120 (2020)

    Google Scholar 

  12. Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: Proceedings of the International Conference on Machine Learning(ICML), pp. 2206–2216 (2020)

    Google Scholar 

  13. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 248–255 (2009)

    Google Scholar 

  14. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  15. Dzabraev, M., Kalashnikov, M., Komkov, S., Petiushko, A.: MDMMT: multidomain multimodal transformer for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3363 (2021)

    Google Scholar 

  16. Gao, J., Lanchantin, J., Soffa, M.L., Qi, Y.: Black-box generation of adversarial text sequences to evade deep learning classifiers. In: Proceedings of the IEEE Security and Privacy Workshops (SPW), pp. 50–56 (2018)

    Google Scholar 

  17. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: Proceedings of the International Conference on Learning Representations (ICLR) (2015)

    Google Scholar 

  18. Guo, C., Rana, M., Cissé, M., van der Maaten, L.: Countering adversarial images using input transformations. In: 6th International Conference on Learning Representations, ICLR (2018)

    Google Scholar 

  19. Han, Y., Shen, Y.: Accurate spear phishing campaign attribution and early detection. In: Proceedings of the Annual ACM Symposium on Applied Computing(SAC), pp. 2079–2086 (2016)

    Google Scholar 

  20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 770–778 (2016)

    Google Scholar 

  21. Ji, J., et al.: Attacking image captioning towards accuracy-preserving target words removal. In: Proceedings of the ACM International Conference on Multimedia(ACMMM), pp. 4226–4234 (2020)

    Google Scholar 

  22. Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11336–11344 (2020)

    Google Scholar 

  23. Li, J., Ji, R., Liu, H., Hong, X., Gao, Y., Tian, Q.: Universal perturbation attack against image retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4899–4908 (2019)

    Google Scholar 

  24. Li, L., Ma, R., Guo, Q., Xue, X., Qiu, X.: BERT-ATTACK: adversarial attack against BERT using BERT. In: Proceedings of the IEEE Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020)

    Google Scholar 

  25. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2019)

    Google Scholar 

  26. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: What does BERT with vision look at? In: Proceedings of the Annual Meeting of the Association for Computational Linguistics(ACL), pp. 5265–5275 (2020)

    Google Scholar 

  27. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  28. Liu, Z., Zhao, Z., Larson, M.: Who’s afraid of adversarial queries? the impact of image modifications on content-based image retrieval. In: Proceedings of the Annual ACM International Conference on Multimedia Retrieval(ICMR), pp. 306–314 (2019)

    Google Scholar 

  29. Lu, J., Batra, D., Parikh, D., Lee, S.: VILBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of the Advances in Neural Information Processing Systems(NeurIPS), vol. 32 (2019)

    Google Scholar 

  30. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008)

    MATH  Google Scholar 

  31. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  32. Mosbach, M., Andriushchenko, M., Trost, T., Hein, M., Klakow, D.: Logit pairing methods can fool gradient-based attacks. In: Proceedings of the NeurIPS Workshop on Security in Machine Learning (2018)

    Google Scholar 

  33. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: improving particular object retrieval in large scale image databases. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 1–8. IEEE (2008)

    Google Scholar 

  34. Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)

  35. Reznichenko, A., Francis, P.: Private-by-design advertising meets the real world. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security(CCS), pp. 116–128 (2014)

    Google Scholar 

  36. Sayak Paul, P.Y.C.: Vision transformers are robust learners. arXiv preprint arXiv:2105.07581 (2021)

  37. Sharma, V., Kalra, A., Vaibhav, Chaudhary, S., Patel, L., Morency, L.: Attend and attack: attention guided adversarial attacks on visual question answering models. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2018)

    Google Scholar 

  38. Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: Proceedings of the International Conference on Learning Representations (ICLR) (2020)

    Google Scholar 

  39. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2019)

    Google Scholar 

  40. Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: Proceedings of the International Conference on Machine Learning(ICML), pp. 6105–6114 (2019)

    Google Scholar 

  41. The New York Times: clearview ai’s facial recognition app called illegal in canada. www.nytimes.com/2021/02/03/technology/clearview-ai-illegal-canada.html (2021). Accessed 03 Feb 2021

  42. Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 776–794 (2020)

    Google Scholar 

  43. Tolias, G., Radenovic, F., Chum, O.: Targeted mismatch adversarial attack: query with a flower to retrieve the tower. In: Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV) pp. 5037–5046 (2019)

    Google Scholar 

  44. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)

    Google Scholar 

  45. Xie, C., Wang, J., Zhang, Z., Ren, Z., Yuille, A.L.: Mitigating adversarial effects through randomization. In: 6th International Conference on Learning Representations, ICLR (2018)

    Google Scholar 

  46. Xie, C., et al.: Improving transferability of adversarial examples with input diversity. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2730–2739 (2019)

    Google Scholar 

  47. Xu, W., Evans, D., Qi, Y.: Feature squeezing: detecting adversarial examples in deep neural networks. In: 25th Annual Network and Distributed System Security Symposium NDSS (2018)

    Google Scholar 

  48. Xu, X., Chen, X., Liu, C., Rohrbach, A., Darrell, T., Song, D.: Fooling vision and language models despite localization and attention mechanism. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 4951–4961 (2018)

    Google Scholar 

  49. Xu, Y., et al.: Exact adversarial attack to image captioning via structured output learning with latent variables. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4135–4144 (2019)

    Google Scholar 

  50. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguistics 2, 67–78 (2014)

    Article  Google Scholar 

  51. Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 6720–6731 (2019)

    Google Scholar 

  52. Zhao, G., Zhang, M., Liu, J., Li, Y., Wen, J.-R.: AP-GAN: adversarial patch attack on content-based image retrieval systems. GeoInformatica, 1–31 (2020). https://doi.org/10.1007/s10707-020-00418-7

Download references

Acknowledgements

This work was supported in part by the NSF under Grant CNS-2120279, CNS-1950704, CNS-1828593, CNS-2153358 and OAC-1829771, ONR under Grant N00014-20-1-2065, AFRL under grant FA8750-19-3-1000, NSA under Grant H98230-21-1-0165 and H98230-21-1-0278, DoD CoE-AIML under Contract Number W911NF-20-2-0277, the Commonwealth Cyber Initiative, and InterDigital Communications, Inc.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongyi Wu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 104 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, L., Ning, R., Li, J., Xin, C., Wu, H. (2022). Most and Least Retrievable Images in Visual-Language Query Systems. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19836-6_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19835-9

  • Online ISBN: 978-3-031-19836-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics