Distractors-Immune Representation Learning with Cross-Modal Contrastive Regularization for Change Captioning

Tu, Yunbin; Li, Liang; Su, Li; Yan, Chenggang; Huang, Qingming

doi:10.1007/978-3-031-72775-7_18

Yunbin Tu¹³,
Liang Li¹⁴,
Li Su¹³,
Chenggang Yan¹⁵ &
…
Qingming Huang¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15101))

Included in the following conference series:

European Conference on Computer Vision

332 Accesses

Abstract

Change captioning aims to succinctly describe the semantic change between a pair of similar images, while being immune to distractors (illumination and viewpoint changes). Under these distractors, unchanged objects often appear pseudo changes about location and scale, and certain objects might overlap others, resulting in perturbational and discrimination-degraded features between two images. However, most existing methods directly capture the difference between them, which risk obtaining error-prone difference features. In this paper, we propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations and decorrelates different ones in a self-supervised manner, thus attaining a pair of stable image representations under distractors. Then, the model can better interact them to capture the reliable difference features for caption generation. To yield words based on the most related difference features, we further design a cross-modal contrastive regularization, which regularizes the cross-modal alignment by maximizing the contrastive alignment between the attended difference features and generated words. Extensive experiments show that our method outperforms the state-of-the-art methods on four public datasets. The code is available at https://github.com/tuyunbin/DIRL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning

Group-Based Distinctive Image Captioning with Memory Difference Encoding and Attention

Article Open access 08 October 2024

Complementary Shifted Transformer for Image Captioning

Article 10 June 2023

References

Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Black, A., Shi, J., Fai, Y., Bui, T., Collomosse, J.: VIXEN: visual text comparison network for image difference captioning. In: AAAI (2024)
Google Scholar
Chen, J., Li, L., Su, L., Zha, Z.j., Huang, Q.: Prompt-enhanced multiple instance learning for weakly supervised video anomaly detection. In: CVPR, pp. 18319–18329 (2024)
Google Scholar
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Cho, J., Yoon, S., Kale, A., Dernoncourt, F., Bui, T., Bansal, M.: Fine-grained image captioning with clip reward. In: Findings of NAACL, pp. 517–527 (2022)
Google Scholar
Guo, Z., Wang, T.J., Laaksonen, J.: CLIP4IDC: CLIP for image difference captioning. In: AACL, pp. 33–42 (2022)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hosseinzadeh, M., Wang, Y.: Image change captioning by learning from an auxiliary task. In: CVPR, pp. 2725–2734 (2021)
Google Scholar
Hoxha, G., Chouaf, S., Melgani, F., Smara, Y.: Change captioning: a new paradigm for multitemporal remote sensing image analysis. IEEE Trans. Geosci. Remote Sens. 60, 1–14 (2022)
Google Scholar
Huang, Q., et al.: Image difference captioning with instance-level fine-grained feature representation. IEEE Trans. Multimedia 24, 2004–2017 (2022)
Google Scholar
Islam, M.M., Ho, N., Yang, X., Nagarajan, T., Torresani, L., Bertasius, G.: Video recap: recursive captioning of hour-long videos. In: CVPR (2024)
Google Scholar
Jhamtani, H., Berg-Kirkpatrick, T.: Learning to describe differences between pairs of similar images. In: EMNLP, pp. 4024–4034 (2018)
Google Scholar
Kim, H., Kim, J., Lee, H., Park, H., Kim, G.: Agnostic change captioning with cycle consistency. In: ICCV, pp. 2095–2104 (2021)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Li, L., Gao, X., Deng, J., Tu, Y., Zha, Z.J., Huang, Q.: Long short-term relation transformer with global gating for video captioning. IEEE Trans. Image Process. 31, 2726–2738 (2022)
Article Google Scholar
Li, M., Lin, B., Chen, Z., Lin, H., Liang, X., Chang, X.: Dynamic graph enhanced contrastive learning for chest x-ray report generation. In: CVPR, pp. 3334–3343 (2023)
Google Scholar
Liao, Z., Huang, Q., Liang, Y., Fu, M., Cai, Y., Li, Q.: Scene graph with 3d information for change captioning. In: ACM MM, pp. 5074–5082 (2021)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Liu, C., Zhao, R., Chen, H., Zou, Z., Shi, Z.: Remote sensing image change captioning with dual-branch transformers: a new method and a large scale dataset. IEEE Trans. Geosci. Remote Sens. 60, 1–20 (2022)
Google Scholar
Liu, X., et al.: Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding. IEEE Trans. Pattern Anal. Mach. Intell. 45(3), 3003–3018 (2023)
Google Scholar
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
Google Scholar
Park, D.H., Darrell, T., Rohrbach, A.: Robust change captioning. In: ICCV, pp. 4624–4633 (2019)
Google Scholar
Qiu, Y., et al.: Describing and localizing multiple changes with transformers. In: ICCV, pp. 1971–1980 (2021)
Google Scholar
Shi, X., Yang, X., Gu, J., Joty, S., Cai, J.: Finding it at another side: a viewpoint-adapted matching encoder for change captioning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 574–590. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_34
Chapter Google Scholar
Sun, Y., Qiu, Y., Khan, M., Matsuzawa, F., Iwata, K.: The STVchrono dataset: towards continuous change recognition in time. In: CVPR, pp. 14111–14120 (2024)
Google Scholar
Sun, Y., et al.: Bidirectional difference locating and semantic consistency reasoning for change captioning. Int. J. Intell. Syst. 37(5), 2969–2987 (2022)
Article Google Scholar
Tan, H., Dernoncourt, F., Lin, Z., Bui, T., Bansal, M.: Expressing visual relationships via language. In: ACL, pp. 1873–1883 (2019)
Google Scholar
Tang, W., Li, L., Liu, X., Jin, L., Tang, J., Li, Z.: Context disentangling and prototype inheriting for robust visual grounding. IEEE Trans. Pattern Anal. Mach. Intell. 46(5), 3213–3229 (2024)
Article Google Scholar
Tu, Y., Li, L., Su, L., Du, J., Lu, K., Huang, Q.: Viewpoint-adaptive representation disentanglement network for change captioning. IEEE Trans. Image Process. 32, 2620–2635 (2023)
Article Google Scholar
Tu, Y., et al.: I$^2$transformer: intra-and inter-relation embedding transformer for TV show captioning. IEEE Trans. Image Process. 31, 3565–3577 (2022)
Article Google Scholar
Tu, Y., Li, L., Su, L., Lu, K., Huang, Q.: Neighborhood contrastive transformer for change captioning. IEEE Trans. Multimedia 25, 9518–9529 (2023)
Article Google Scholar
Tu, Y., Li, L., Su, L., Zha, Z.J., Huang, Q.: Smart: syntax-calibrated multi-aspect relation transformer for change captioning. IEEE Trans. Pattern Anal. Mach. Intell. 46(7), 4926–4943 (2024)
Article Google Scholar
Tu, Y., Li, L., Su, L., Zha, Z.J., Yan, C., Huang, Q.: Self-supervised cross-view representation reconstruction for change captioning. In: ICCV, pp. 2805–2815 (2023)
Google Scholar
Tu, Y., Li, L., Su, L., Zha, Z.J., Yan, C., Huang, Q.: Context-aware difference distilling for multi-change captioning. In: ACL (2024)
Google Scholar
Tu, Y., Li, L., Yan, C., Gao, S., Yu, Z.: R${\hat{\,}}$3Net:relation-embedded representation reconstruction network for change captioning. In: EMNLP, pp. 9319–9329 (2021)
Google Scholar
Tu, Y., et al.: Semantic relation-aware difference representation learning for change captioning. In: Findings of ACL, pp. 63–73 (2021)
Google Scholar
Tu, Y., Zhou, C., Guo, J., Li, H., Gao, S., Yu, Z.: Relation-aware attention for video captioning via graph learning. Pattern Recogn. 136, 109204 (2023)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)
Google Scholar
Wang, Q., Zhang, Y., Zheng, Y., Pan, P., Hua, X.S.: Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111 (2022)
Xiao, J., Li, L., Lv, H., Wang, S., Huang, Q.: R&B: region and boundary aware zero-shot grounded text-to-image generation. ICLR (2024)
Google Scholar
Yang, C.F., Tsai, Y.H.H., Fan, W.C., Salakhutdinov, R.R., Morency, L.P., Wang, F.: Paraphrasing is all you need for novel object captioning. In: NeurIPS, vol. 35, pp. 6492–6504 (2022)
Google Scholar
Yao, L., Wang, W., Jin, Q.: Image difference captioning with pre-training and contrastive learning. In: AAAI (2022)
Google Scholar
Yue, S., Tu, Y., Li, L., Gao, S., Yu, Z.: Multi-grained representation aggregating transformer with gating cycle for change captioning. ACM Trans. Multimedia Comput. Commun. Appl. (2024)
Google Scholar
Yue, S., Tu, Y., Li, L., Yang, Y., Gao, S., Yu, Z.: I3N: intra- and inter-representation interaction network for change captioning. IEEE Trans. Multimedia 25, 8828–8841 (2023)
Article Google Scholar
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: ICML, pp. 12310–12320 (2021)
Google Scholar

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China: 62322211, 61931008, 62236008, 62336008, U21B2038, 62225207, Fundamental Research Funds for the Central Universities (E2ET1104), “Pionee” and “Leading Goose” R&D Program of Zhejiang Province (2024C01023).

Author information

Authors and Affiliations

University of Chinese Academy of Sciences, Beijing, China
Yunbin Tu, Li Su & Qingming Huang
Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, China
Liang Li
Lishui Institute of HDU, Hangzhou Dianzi University (HDU), Hangzhou, China
Chenggang Yan

Authors

Yunbin Tu
View author publications
You can also search for this author in PubMed Google Scholar
Liang Li
View author publications
You can also search for this author in PubMed Google Scholar
Li Su
View author publications
You can also search for this author in PubMed Google Scholar
Chenggang Yan
View author publications
You can also search for this author in PubMed Google Scholar
Qingming Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Liang Li or Li Su .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1796 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tu, Y., Li, L., Su, L., Yan, C., Huang, Q. (2025). Distractors-Immune Representation Learning with Cross-Modal Contrastive Regularization for Change Captioning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-72775-7_18
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72774-0
Online ISBN: 978-3-031-72775-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Distractors-Immune Representation Learning with Cross-Modal Contrastive Regularization for Change Captioning