Abstract
Referring perception, which aims at grounding visual objects with multimodal referring guidance, is essential for bridging the gap between humans, who provide instructions, and the environment where intelligent systems perceive. Despite progress in this field, the robustness of referring perception models (RPMs) against disruptive perturbations is not well explored. This work thoroughly assesses the resilience of RPMs against various perturbations in both general and specific contexts. Recognizing the complex nature of referring perception tasks, we present a comprehensive taxonomy of perturbations, and then develop a versatile toolbox for synthesizing and evaluating the effects of composite disturbances. Employing this toolbox, we construct \({\textbf {R}}^2\)-Bench, a benchmark for assessing the Robustness of Referring perception models under noisy conditions across five key tasks. Moreover, we propose the \({\textbf {R}}^2\)-Agent, an LLM-based agent that simplifies and automates model evaluation via natural language instructions. Our investigation uncovers the vulnerabilities of current RPMs to various perturbations and provides tools for assessing model robustness, potentially promoting the safe and resilient integration of intelligent systems into complex real-world scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Ahn, H., et al.: Visually grounding language instruction for history-dependent manipulation. In: 2022 International Conference on Robotics and Automation (ICRA), pp. 675–682. IEEE (2022)
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 609–617 (2017)
Arandjelovic, R., Zisserman, A.: Objects that sound. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 435–451 (2018)
Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4985–4995 (2022)
Chen, F., Zhang, H., Hu, K., Huang, Y.K., Zhu, C., Savvides, M.: Enhanced training of query-based object detection via selective query recollection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23756–23765 (2023)
Chen, F., et al.: Unitail: detecting, reading, and matching in retail scene (2022). https://arxiv.org/abs/2204.00298
Chen, F., Zhang, H., Yang, Z., Chen, H., Hu, K., Savvides, M.: Rtgen: generating region-text pairs for open-vocabulary object detection (2024). https://arxiv.org/abs/2405.19854
Cheng, H.K., Oh, S.W., Price, B., Lee, J.Y., Schwing, A.: Putting the object back into video object segmentation. arXiv preprint arXiv:2310.12982 (2023)
Cheng, H.K., Oh, S.W., Price, B., Schwing, A., Lee, J.Y.: Tracking anything with decoupled video segmentation. In: ICCV (2023)
Cheng, H.K., Schwing, A.G.: XMem: long-term video object segmentation with an Atkinson-Shiffrin memory model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 640–658. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_37
Cheng, Y., Wang, R., Pan, Z., Feng, R., Zhang, Y.: Look, listen, and attend: co-attention network for self-supervised audio-visual representation learning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 3884–3892 (2020)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)
Ding, H., Liu, C., He, S., Jiang, X., Loy, C.C.: Mevis: a large-scale benchmark for video segmentation with motion expressions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2694–2703 (2023)
Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16321–16330 (2021)
Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023)
Gao, S., Chen, Z., Chen, G., Wang, W., Lu, T.: Avsegformer: audio-visual segmentation with transformer. arXiv preprint arXiv:2307.01146 (2023)
Han, M., Wang, Y., Li, Z., Yao, L., Chang, X., Qiao, Y.: HTML: hybrid temporal-scale multimodal learning framework for referring video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13414–13423 (2023)
Handa, A., Whelan, T., McDonald, J., Davison, A.: A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In: IEEE International Conference on Robotics and Automation, ICRA, Hong Kong, China (2014)
Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261 (2019)
Hu, Y., Lin, F., Zhang, T., Yi, L., Gao, Y.: Look before you leap: unveiling the power of GPT-4V in robotic vision-language planning. arXiv preprint arXiv:2311.17842 (2023)
Huang, W., et al.: Grounded decoding: guiding text generation with grounded models for robot control. arXiv preprint arXiv:2303.00855 (2023)
Jatavallabhula, K.M., et al.: Conceptfusion: open-set multimodal 3D mapping. arXiv preprint arXiv:2302.07241 (2023)
Ke, L., et al.: Segment anything in high quality. In: NeurIPS (2023)
Khoreva, A., Rohrbach, A., Schiele, B.: Video object segmentation with language referring expressions. In: ACCV (2018)
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Li, K., Yang, Z., Chen, L., Yang, Y., Xun, J.: CATR: combinatorial-dependence audio-queried transformer for audio-visual video segmentation. arXiv preprint arXiv:2309.09709 (2023)
Li, M., Li, S., Zhang, X., Zhang, L.: Univs: unified and universal video segmentation with prompts as queries. arXiv preprint arXiv:2402.18115 (2024)
Li, X., Cao, H., Zhao, S., Li, J., Zhang, L., Raj, B.: Panoramic video salient object detection with ambisonic audio guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1424–1432 (2023)
Li, X., Lin, C.C., Chen, Y., Liu, Z., Wang, J., Singh, R., Raj, B.: Paintseg: painting pixels for training-free segmentation. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)
Li, X., Wang, J., Li, X., Lu, Y.: Hybrid instance-aware temporal fusion for online video instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 1429–1437 (2022)
Li, X., Wang, J., Li, X., Lu, Y.: Video instance segmentation by instance flow assembly. IEEE Trans. Multimedia 25, 7469–7479 (2022)
Li, X., Wang, J., Xu, X., Li, X., Raj, B., Lu, Y.: Robust referring video object segmentation with cyclic structural consensus. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22236–22245 (2023)
Li, X., et al.: Towards robust audiovisual segmentation in complex environments with quantization-based semantic decomposition. arXiv preprint arXiv:2310.00132 (2023)
Li, X., et al.: Towards noise-tolerant speech-referring video object segmentation: bridging speech and text. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2283–2296 (2023)
Liang, T., et al.: Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118 (2023)
Liu, J., et al.: Polyformer: referring image segmentation as sequential polygon generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18653–18663 (2023)
Liu, J., Wang, Y., Ju, C., Ma, C., Zhang, Y., Xie, W.: Annotation-free audio-visual segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5604–5614 (2024)
Liu, S., et al.: Grounding dino: marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Liu, S., et al.: Dragon: a dialogue-based robot for assistive navigation with visual language grounding. IEEE Robot. Autom. Lett. (2024)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)
Miao, B., Bennamoun, M., Gao, Y., Mian, A.: Spectrum-guided multi-granularity referring video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 920–930 (2023)
Mo, S., Morgado, P.: A closer look at weakly-supervised audio-visual source localization. arXiv preprint arXiv:2209.09634 (2022)
Pan, W., et al.: Wnet: audio-guided video object segmentation via wavelet-based cross-modal denoising networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1320–1331 (2022)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4358–4366 (2018)
Seo, S., Lee, J.-Y., Han, B.: URVOS: unified referring video object segmentation network with a large-scale benchmark. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 208–223. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_13
Sun, J., Huang, D.A., Lu, B., Liu, Y.H., Zhou, B., Garg, A.: Plate: visually-grounded planning with transformers in procedural tasks. IEEE Robot. Autom. Lett. 7(2), 4924–4930 (2022)
Tang, J., Zheng, G., Yang, S.: Temporal collection and distribution for referring video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15466–15476 (2023)
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Tziafas, G., Kasaei, H.: Few-shot visual grounding for natural human-robot interaction. In: 2021 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), pp. 50–55. IEEE (2021)
Wang, Z., Cai, S., Chen, G., Liu, A., Ma, X.S., Liang, Y.: Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)
Wu, D., Wang, T., Zhang, Y., Zhang, X., Shen, J.: Onlinerefer: a simple online baseline for referring video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2761–2770 (2023)
Wu, J., Jiang, Y., Sun, P., Yuan, Z., Luo, P.: Language as queries for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4974–4984 (2022)
Wu, J., Jiang, Y., Yan, B., Lu, H., Yuan, Z., Luo, P.: Segment every reference object in spatial and temporal spaces. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2538–2550 (2023)
Xi, Z., et al.: The rise and potential of large language model based agents: a survey. arXiv preprint arXiv:2309.07864 (2023)
Xiong, Y., et al.: Efficientsam: leveraged masked image pretraining for efficient segment anything. arXiv preprint arXiv:2312.00863 (2023)
Xu, N., et al.: Youtube-VOS: a large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)
Xu, X., Wang, J., Li, X., Lu, Y.: Reliable propagation-correction modulation for video object segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2946–2954 (2022)
Xu, X., Wang, J., Ming, X., Lu, Y.: Towards robust video object segmentation with adaptive object calibration. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 2709–2718 (2022)
Xu, X., et al.: Customizable perturbation synthesis for robust slam benchmarking. arXiv preprint arXiv:2402.08125 (2024)
Xu, Z., Chen, Z., Zhang, Y., Song, Y., Wan, X., Li, G.: Bridging vision and language encoders: parameter-efficient tuning for referring image segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17503–17512 (2023)
Yamazaki, K., et al.: Open-fusion: real-time open-vocabulary 3D mapping and queryable scene representation. arXiv preprint arXiv:2310.03923 (2023)
Yamazaki, K., et al.: Vlcap: vision-language with contrastive learning for coherent video paragraph captioning. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 3656–3661. IEEE (2022)
Yamazaki, K., Vo, K., Truong, Q.S., Raj, B., Le, N.: Vltint: visual-linguistic transformer-in-transformer for coherent video paragraph captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 3081–3090 (2023)
Yang, J., Gao, M., Li, Z., Gao, S., Wang, F., Zheng, F.: Track anything: segment anything meets videos (2023)
Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: LAVT: language-aware vision transformer for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18155–18165 (2022)
Yang, Z., Wei, Y., Yang, Y.: Associating objects with transformers for video object segmentation. In: Advances in Neural Information Processing Systems, vol. 34, pp. 2491–2502 (2021)
Yang, Z., Yang, Y.: Decoupling features in hierarchical propagation for video object segmentation. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36324–36336 (2022)
Yao, J., Wang, X., Ye, L., Liu, W.: Matte anything: interactive natural image matting with segment anything models. arXiv preprint arXiv:2306.04121 (2023)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Yu, T., et al.: Inpaint anything: segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)
Zhang, J., Cui, Y., Wu, G., Wang, L.: Joint modeling of feature, correspondence, and a compressed memory for video object segmentation. arXiv preprint arXiv:2308.13505 (2023)
Zhao, Q., et al.: Competeai: understanding the competition behaviors in large language model-based agents. arXiv preprint arXiv:2310.17512 (2023)
Zhou, J., et al.: Audio-visual segmentation with semantics. arXiv preprint arXiv:2301.13190 (2023)
Zhou, J., et al.: Audio-visual segmentation. In: European Conference on Computer Vision (2022)
Zhu, C., Chen, F., Ahmed, U., Shen, Z., Savvides, M.: Semantic relation reasoning for shot-stable few-shot object detection. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8778–8787 (2021). https://api.semanticscholar.org/CorpusID:232093016
Zhu, C., Chen, F., Shen, Z., Savvides, M.: Soft anchor-point object detection. In: European Conference on Computer Vision (2019). https://api.semanticscholar.org/CorpusID:208512715
Zou, X., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023)
Zou, X., et al.: Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718 (2023)
Zou, X., et al.: Segment everything everywhere all at once. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, X. et al. (2025). R\(^2\)-Bench: Benchmarking the Robustness of Referring Perception Models Under Perturbations. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15067. Springer, Cham. https://doi.org/10.1007/978-3-031-72673-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-72673-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72672-9
Online ISBN: 978-3-031-72673-6
eBook Packages: Computer ScienceComputer Science (R0)