Abstract
Change captioning aims to succinctly describe the semantic change between a pair of similar images, while being immune to distractors (illumination and viewpoint changes). Under these distractors, unchanged objects often appear pseudo changes about location and scale, and certain objects might overlap others, resulting in perturbational and discrimination-degraded features between two images. However, most existing methods directly capture the difference between them, which risk obtaining error-prone difference features. In this paper, we propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations and decorrelates different ones in a self-supervised manner, thus attaining a pair of stable image representations under distractors. Then, the model can better interact them to capture the reliable difference features for caption generation. To yield words based on the most related difference features, we further design a cross-modal contrastive regularization, which regularizes the cross-modal alignment by maximizing the contrastive alignment between the attended difference features and generated words. Extensive experiments show that our method outperforms the state-of-the-art methods on four public datasets. The code is available at https://github.com/tuyunbin/DIRL.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Black, A., Shi, J., Fai, Y., Bui, T., Collomosse, J.: VIXEN: visual text comparison network for image difference captioning. In: AAAI (2024)
Chen, J., Li, L., Su, L., Zha, Z.j., Huang, Q.: Prompt-enhanced multiple instance learning for weakly supervised video anomaly detection. In: CVPR, pp. 18319–18329 (2024)
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Cho, J., Yoon, S., Kale, A., Dernoncourt, F., Bui, T., Bansal, M.: Fine-grained image captioning with clip reward. In: Findings of NAACL, pp. 517–527 (2022)
Guo, Z., Wang, T.J., Laaksonen, J.: CLIP4IDC: CLIP for image difference captioning. In: AACL, pp. 33–42 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Hosseinzadeh, M., Wang, Y.: Image change captioning by learning from an auxiliary task. In: CVPR, pp. 2725–2734 (2021)
Hoxha, G., Chouaf, S., Melgani, F., Smara, Y.: Change captioning: a new paradigm for multitemporal remote sensing image analysis. IEEE Trans. Geosci. Remote Sens. 60, 1–14 (2022)
Huang, Q., et al.: Image difference captioning with instance-level fine-grained feature representation. IEEE Trans. Multimedia 24, 2004–2017 (2022)
Islam, M.M., Ho, N., Yang, X., Nagarajan, T., Torresani, L., Bertasius, G.: Video recap: recursive captioning of hour-long videos. In: CVPR (2024)
Jhamtani, H., Berg-Kirkpatrick, T.: Learning to describe differences between pairs of similar images. In: EMNLP, pp. 4024–4034 (2018)
Kim, H., Kim, J., Lee, H., Park, H., Kim, G.: Agnostic change captioning with cycle consistency. In: ICCV, pp. 2095–2104 (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Li, L., Gao, X., Deng, J., Tu, Y., Zha, Z.J., Huang, Q.: Long short-term relation transformer with global gating for video captioning. IEEE Trans. Image Process. 31, 2726–2738 (2022)
Li, M., Lin, B., Chen, Z., Lin, H., Liang, X., Chang, X.: Dynamic graph enhanced contrastive learning for chest x-ray report generation. In: CVPR, pp. 3334–3343 (2023)
Liao, Z., Huang, Q., Liang, Y., Fu, M., Cai, Y., Li, Q.: Scene graph with 3d information for change captioning. In: ACM MM, pp. 5074–5082 (2021)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Liu, C., Zhao, R., Chen, H., Zou, Z., Shi, Z.: Remote sensing image change captioning with dual-branch transformers: a new method and a large scale dataset. IEEE Trans. Geosci. Remote Sens. 60, 1–20 (2022)
Liu, X., et al.: Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding. IEEE Trans. Pattern Anal. Mach. Intell. 45(3), 3003–3018 (2023)
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
Park, D.H., Darrell, T., Rohrbach, A.: Robust change captioning. In: ICCV, pp. 4624–4633 (2019)
Qiu, Y., et al.: Describing and localizing multiple changes with transformers. In: ICCV, pp. 1971–1980 (2021)
Shi, X., Yang, X., Gu, J., Joty, S., Cai, J.: Finding it at another side: a viewpoint-adapted matching encoder for change captioning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 574–590. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_34
Sun, Y., Qiu, Y., Khan, M., Matsuzawa, F., Iwata, K.: The STVchrono dataset: towards continuous change recognition in time. In: CVPR, pp. 14111–14120 (2024)
Sun, Y., et al.: Bidirectional difference locating and semantic consistency reasoning for change captioning. Int. J. Intell. Syst. 37(5), 2969–2987 (2022)
Tan, H., Dernoncourt, F., Lin, Z., Bui, T., Bansal, M.: Expressing visual relationships via language. In: ACL, pp. 1873–1883 (2019)
Tang, W., Li, L., Liu, X., Jin, L., Tang, J., Li, Z.: Context disentangling and prototype inheriting for robust visual grounding. IEEE Trans. Pattern Anal. Mach. Intell. 46(5), 3213–3229 (2024)
Tu, Y., Li, L., Su, L., Du, J., Lu, K., Huang, Q.: Viewpoint-adaptive representation disentanglement network for change captioning. IEEE Trans. Image Process. 32, 2620–2635 (2023)
Tu, Y., et al.: I\(^2\)transformer: intra-and inter-relation embedding transformer for TV show captioning. IEEE Trans. Image Process. 31, 3565–3577 (2022)
Tu, Y., Li, L., Su, L., Lu, K., Huang, Q.: Neighborhood contrastive transformer for change captioning. IEEE Trans. Multimedia 25, 9518–9529 (2023)
Tu, Y., Li, L., Su, L., Zha, Z.J., Huang, Q.: Smart: syntax-calibrated multi-aspect relation transformer for change captioning. IEEE Trans. Pattern Anal. Mach. Intell. 46(7), 4926–4943 (2024)
Tu, Y., Li, L., Su, L., Zha, Z.J., Yan, C., Huang, Q.: Self-supervised cross-view representation reconstruction for change captioning. In: ICCV, pp. 2805–2815 (2023)
Tu, Y., Li, L., Su, L., Zha, Z.J., Yan, C., Huang, Q.: Context-aware difference distilling for multi-change captioning. In: ACL (2024)
Tu, Y., Li, L., Yan, C., Gao, S., Yu, Z.: R\({\hat{\,}}\)3Net:relation-embedded representation reconstruction network for change captioning. In: EMNLP, pp. 9319–9329 (2021)
Tu, Y., et al.: Semantic relation-aware difference representation learning for change captioning. In: Findings of ACL, pp. 63–73 (2021)
Tu, Y., Zhou, C., Guo, J., Li, H., Gao, S., Yu, Z.: Relation-aware attention for video captioning via graph learning. Pattern Recogn. 136, 109204 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)
Wang, Q., Zhang, Y., Zheng, Y., Pan, P., Hua, X.S.: Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111 (2022)
Xiao, J., Li, L., Lv, H., Wang, S., Huang, Q.: R&B: region and boundary aware zero-shot grounded text-to-image generation. ICLR (2024)
Yang, C.F., Tsai, Y.H.H., Fan, W.C., Salakhutdinov, R.R., Morency, L.P., Wang, F.: Paraphrasing is all you need for novel object captioning. In: NeurIPS, vol. 35, pp. 6492–6504 (2022)
Yao, L., Wang, W., Jin, Q.: Image difference captioning with pre-training and contrastive learning. In: AAAI (2022)
Yue, S., Tu, Y., Li, L., Gao, S., Yu, Z.: Multi-grained representation aggregating transformer with gating cycle for change captioning. ACM Trans. Multimedia Comput. Commun. Appl. (2024)
Yue, S., Tu, Y., Li, L., Yang, Y., Gao, S., Yu, Z.: I3N: intra- and inter-representation interaction network for change captioning. IEEE Trans. Multimedia 25, 8828–8841 (2023)
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: ICML, pp. 12310–12320 (2021)
Acknowledgements
This work was supported in part by National Natural Science Foundation of China: 62322211, 61931008, 62236008, 62336008, U21B2038, 62225207, Fundamental Research Funds for the Central Universities (E2ET1104), “Pionee” and “Leading Goose” R&D Program of Zhejiang Province (2024C01023).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tu, Y., Li, L., Su, L., Yan, C., Huang, Q. (2025). Distractors-Immune Representation Learning with Cross-Modal Contrastive Regularization for Change Captioning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-72775-7_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72774-0
Online ISBN: 978-3-031-72775-7
eBook Packages: Computer ScienceComputer Science (R0)