R $$^2$$ -Bench: Benchmarking the Robustness of Referring Perception Models Under Perturbations

Li, Xiang; Qiu, Kai; Wang, Jinglu; Xu, Xiaohao; Singh, Rita; Yamazaki, Kashu; Chen, Hao; Huang, Xiaonan; Raj, Bhiksha

doi:10.1007/978-3-031-72673-6_12

Xiang Li¹³,
Kai Qiu¹³,
Jinglu Wang¹⁴,
Xiaohao Xu¹⁵,
Rita Singh¹³,
Kashu Yamazaki¹³,
Hao Chen¹³,
Xiaonan Huang¹⁵ &
…
Bhiksha Raj^13,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15067))

Included in the following conference series:

European Conference on Computer Vision

357 Accesses

Abstract

Referring perception, which aims at grounding visual objects with multimodal referring guidance, is essential for bridging the gap between humans, who provide instructions, and the environment where intelligent systems perceive. Despite progress in this field, the robustness of referring perception models (RPMs) against disruptive perturbations is not well explored. This work thoroughly assesses the resilience of RPMs against various perturbations in both general and specific contexts. Recognizing the complex nature of referring perception tasks, we present a comprehensive taxonomy of perturbations, and then develop a versatile toolbox for synthesizing and evaluating the effects of composite disturbances. Employing this toolbox, we construct ${\textbf {R}}^2$-Bench, a benchmark for assessing the Robustness of Referring perception models under noisy conditions across five key tasks. Moreover, we propose the ${\textbf {R}}^2$-Agent, an LLM-based agent that simplifies and automates model evaluation via natural language instructions. Our investigation uncovers the vulnerabilities of current RPMs to various perturbations and provides tools for assessing model robustness, potentially promoting the safe and resilient integration of intelligent systems into complex real-world scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Modeling Context in Referring Expressions

References

Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Ahn, H., et al.: Visually grounding language instruction for history-dependent manipulation. In: 2022 International Conference on Robotics and Automation (ICRA), pp. 675–682. IEEE (2022)
Google Scholar
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 609–617 (2017)
Google Scholar
Arandjelovic, R., Zisserman, A.: Objects that sound. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 435–451 (2018)
Google Scholar
Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4985–4995 (2022)
Google Scholar
Chen, F., Zhang, H., Hu, K., Huang, Y.K., Zhu, C., Savvides, M.: Enhanced training of query-based object detection via selective query recollection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23756–23765 (2023)
Google Scholar
Chen, F., et al.: Unitail: detecting, reading, and matching in retail scene (2022). https://arxiv.org/abs/2204.00298
Chen, F., Zhang, H., Yang, Z., Chen, H., Hu, K., Savvides, M.: Rtgen: generating region-text pairs for open-vocabulary object detection (2024). https://arxiv.org/abs/2405.19854
Cheng, H.K., Oh, S.W., Price, B., Lee, J.Y., Schwing, A.: Putting the object back into video object segmentation. arXiv preprint arXiv:2310.12982 (2023)
Cheng, H.K., Oh, S.W., Price, B., Schwing, A., Lee, J.Y.: Tracking anything with decoupled video segmentation. In: ICCV (2023)
Google Scholar
Cheng, H.K., Schwing, A.G.: XMem: long-term video object segmentation with an Atkinson-Shiffrin memory model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 640–658. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_37
Chapter Google Scholar
Cheng, Y., Wang, R., Pan, Z., Feng, R., Zhang, Y.: Look, listen, and attend: co-attention network for self-supervised audio-visual representation learning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 3884–3892 (2020)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)
Google Scholar
Ding, H., Liu, C., He, S., Jiang, X., Loy, C.C.: Mevis: a large-scale benchmark for video segmentation with motion expressions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2694–2703 (2023)
Google Scholar
Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16321–16330 (2021)
Google Scholar
Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023)
Gao, S., Chen, Z., Chen, G., Wang, W., Lu, T.: Avsegformer: audio-visual segmentation with transformer. arXiv preprint arXiv:2307.01146 (2023)
Han, M., Wang, Y., Li, Z., Yao, L., Chang, X., Qiao, Y.: HTML: hybrid temporal-scale multimodal learning framework for referring video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13414–13423 (2023)
Google Scholar
Handa, A., Whelan, T., McDonald, J., Davison, A.: A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In: IEEE International Conference on Robotics and Automation, ICRA, Hong Kong, China (2014)
Google Scholar
Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261 (2019)
Hu, Y., Lin, F., Zhang, T., Yi, L., Gao, Y.: Look before you leap: unveiling the power of GPT-4V in robotic vision-language planning. arXiv preprint arXiv:2311.17842 (2023)
Huang, W., et al.: Grounded decoding: guiding text generation with grounded models for robot control. arXiv preprint arXiv:2303.00855 (2023)
Jatavallabhula, K.M., et al.: Conceptfusion: open-set multimodal 3D mapping. arXiv preprint arXiv:2302.07241 (2023)
Ke, L., et al.: Segment anything in high quality. In: NeurIPS (2023)
Google Scholar
Khoreva, A., Rohrbach, A., Schiele, B.: Video object segmentation with language referring expressions. In: ACCV (2018)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Li, K., Yang, Z., Chen, L., Yang, Y., Xun, J.: CATR: combinatorial-dependence audio-queried transformer for audio-visual video segmentation. arXiv preprint arXiv:2309.09709 (2023)
Li, M., Li, S., Zhang, X., Zhang, L.: Univs: unified and universal video segmentation with prompts as queries. arXiv preprint arXiv:2402.18115 (2024)
Li, X., Cao, H., Zhao, S., Li, J., Zhang, L., Raj, B.: Panoramic video salient object detection with ambisonic audio guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1424–1432 (2023)
Google Scholar
Li, X., Lin, C.C., Chen, Y., Liu, Z., Wang, J., Singh, R., Raj, B.: Paintseg: painting pixels for training-free segmentation. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)
Google Scholar
Li, X., Wang, J., Li, X., Lu, Y.: Hybrid instance-aware temporal fusion for online video instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 1429–1437 (2022)
Google Scholar
Li, X., Wang, J., Li, X., Lu, Y.: Video instance segmentation by instance flow assembly. IEEE Trans. Multimedia 25, 7469–7479 (2022)
Article Google Scholar
Li, X., Wang, J., Xu, X., Li, X., Raj, B., Lu, Y.: Robust referring video object segmentation with cyclic structural consensus. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22236–22245 (2023)
Google Scholar
Li, X., et al.: Towards robust audiovisual segmentation in complex environments with quantization-based semantic decomposition. arXiv preprint arXiv:2310.00132 (2023)
Li, X., et al.: Towards noise-tolerant speech-referring video object segmentation: bridging speech and text. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2283–2296 (2023)
Google Scholar
Liang, T., et al.: Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118 (2023)
Liu, J., et al.: Polyformer: referring image segmentation as sequential polygon generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18653–18663 (2023)
Google Scholar
Liu, J., Wang, Y., Ju, C., Ma, C., Zhang, Y., Xie, W.: Annotation-free audio-visual segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5604–5614 (2024)
Google Scholar
Liu, S., et al.: Grounding dino: marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Liu, S., et al.: Dragon: a dialogue-based robot for assistive navigation with visual language grounding. IEEE Robot. Autom. Lett. (2024)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)
Google Scholar
Miao, B., Bennamoun, M., Gao, Y., Mian, A.: Spectrum-guided multi-granularity referring video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 920–930 (2023)
Google Scholar
Mo, S., Morgado, P.: A closer look at weakly-supervised audio-visual source localization. arXiv preprint arXiv:2209.09634 (2022)
Pan, W., et al.: Wnet: audio-guided video object segmentation via wavelet-based cross-modal denoising networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1320–1331 (2022)
Google Scholar
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4358–4366 (2018)
Google Scholar
Seo, S., Lee, J.-Y., Han, B.: URVOS: unified referring video object segmentation network with a large-scale benchmark. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 208–223. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_13
Chapter Google Scholar
Sun, J., Huang, D.A., Lu, B., Liu, Y.H., Zhou, B., Garg, A.: Plate: visually-grounded planning with transformers in procedural tasks. IEEE Robot. Autom. Lett. 7(2), 4924–4930 (2022)
Article Google Scholar
Tang, J., Zheng, G., Yang, S.: Temporal collection and distribution for referring video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15466–15476 (2023)
Google Scholar
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Tziafas, G., Kasaei, H.: Few-shot visual grounding for natural human-robot interaction. In: 2021 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), pp. 50–55. IEEE (2021)
Google Scholar
Wang, Z., Cai, S., Chen, G., Liu, A., Ma, X.S., Liang, Y.: Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)
Google Scholar
Wu, D., Wang, T., Zhang, Y., Zhang, X., Shen, J.: Onlinerefer: a simple online baseline for referring video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2761–2770 (2023)
Google Scholar
Wu, J., Jiang, Y., Sun, P., Yuan, Z., Luo, P.: Language as queries for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4974–4984 (2022)
Google Scholar
Wu, J., Jiang, Y., Yan, B., Lu, H., Yuan, Z., Luo, P.: Segment every reference object in spatial and temporal spaces. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2538–2550 (2023)
Google Scholar
Xi, Z., et al.: The rise and potential of large language model based agents: a survey. arXiv preprint arXiv:2309.07864 (2023)
Xiong, Y., et al.: Efficientsam: leveraged masked image pretraining for efficient segment anything. arXiv preprint arXiv:2312.00863 (2023)
Xu, N., et al.: Youtube-VOS: a large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)
Xu, X., Wang, J., Li, X., Lu, Y.: Reliable propagation-correction modulation for video object segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2946–2954 (2022)
Google Scholar
Xu, X., Wang, J., Ming, X., Lu, Y.: Towards robust video object segmentation with adaptive object calibration. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 2709–2718 (2022)
Google Scholar
Xu, X., et al.: Customizable perturbation synthesis for robust slam benchmarking. arXiv preprint arXiv:2402.08125 (2024)
Xu, Z., Chen, Z., Zhang, Y., Song, Y., Wan, X., Li, G.: Bridging vision and language encoders: parameter-efficient tuning for referring image segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17503–17512 (2023)
Google Scholar
Yamazaki, K., et al.: Open-fusion: real-time open-vocabulary 3D mapping and queryable scene representation. arXiv preprint arXiv:2310.03923 (2023)
Yamazaki, K., et al.: Vlcap: vision-language with contrastive learning for coherent video paragraph captioning. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 3656–3661. IEEE (2022)
Google Scholar
Yamazaki, K., Vo, K., Truong, Q.S., Raj, B., Le, N.: Vltint: visual-linguistic transformer-in-transformer for coherent video paragraph captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 3081–3090 (2023)
Google Scholar
Yang, J., Gao, M., Li, Z., Gao, S., Wang, F., Zheng, F.: Track anything: segment anything meets videos (2023)
Google Scholar
Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: LAVT: language-aware vision transformer for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18155–18165 (2022)
Google Scholar
Yang, Z., Wei, Y., Yang, Y.: Associating objects with transformers for video object segmentation. In: Advances in Neural Information Processing Systems, vol. 34, pp. 2491–2502 (2021)
Google Scholar
Yang, Z., Yang, Y.: Decoupling features in hierarchical propagation for video object segmentation. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36324–36336 (2022)
Google Scholar
Yao, J., Wang, X., Ye, L., Liu, W.: Matte anything: interactive natural image matting with segment anything models. arXiv preprint arXiv:2306.04121 (2023)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Chapter Google Scholar
Yu, T., et al.: Inpaint anything: segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)
Zhang, J., Cui, Y., Wu, G., Wang, L.: Joint modeling of feature, correspondence, and a compressed memory for video object segmentation. arXiv preprint arXiv:2308.13505 (2023)
Zhao, Q., et al.: Competeai: understanding the competition behaviors in large language model-based agents. arXiv preprint arXiv:2310.17512 (2023)
Zhou, J., et al.: Audio-visual segmentation with semantics. arXiv preprint arXiv:2301.13190 (2023)
Zhou, J., et al.: Audio-visual segmentation. In: European Conference on Computer Vision (2022)
Google Scholar
Zhu, C., Chen, F., Ahmed, U., Shen, Z., Savvides, M.: Semantic relation reasoning for shot-stable few-shot object detection. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8778–8787 (2021). https://api.semanticscholar.org/CorpusID:232093016
Zhu, C., Chen, F., Shen, Z., Savvides, M.: Soft anchor-point object detection. In: European Conference on Computer Vision (2019). https://api.semanticscholar.org/CorpusID:208512715
Zou, X., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023)
Google Scholar
Zou, X., et al.: Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718 (2023)
Zou, X., et al.: Segment everything everywhere all at once. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar

Download references

Author information

Authors and Affiliations

CMU, Pittsburgh, USA
Xiang Li, Kai Qiu, Rita Singh, Kashu Yamazaki, Hao Chen & Bhiksha Raj
Microsoft Research Asia, Beijing, China
Jinglu Wang
University of Michigan, Ann Arbor, USA
Xiaohao Xu & Xiaonan Huang
MBZUAI, Abu Dhabi, United Arab Emirates
Bhiksha Raj

Authors

Xiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Kai Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Jinglu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Rita Singh
View author publications
You can also search for this author in PubMed Google Scholar
Kashu Yamazaki
View author publications
You can also search for this author in PubMed Google Scholar
Hao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaonan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Bhiksha Raj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiang Li .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 15522 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X. et al. (2025). R$^2$-Bench: Benchmarking the Robustness of Referring Perception Models Under Perturbations. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15067. Springer, Cham. https://doi.org/10.1007/978-3-031-72673-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-72673-6_12
Published: 22 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72672-9
Online ISBN: 978-3-031-72673-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

R\(^2\)-Bench: Benchmarking the Robustness of Referring Perception Models Under Perturbations

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Modeling Context in Referring Expressions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 15522 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

R\(^2\)-Bench: Benchmarking the Robustness of Referring Perception Models Under Perturbations

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Modeling Context in Referring Expressions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 15522 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us