research-article

Learning with Adaptive Knowledge for Continual Image-Text Modeling

Authors:
Yutian Luo

Renmin University of China, China

Renmin University of China, China

0000-0001-8707-0180
View Profile

,
Yizhao Gao

Renmin University of China, China

Renmin University of China, China

0000-0002-6903-2962
View Profile

,
Zhiwu Lu

Renmin University of China, China

Renmin University of China, China

0000-0003-0280-7724
View Profile

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia RetrievalJune 2023Pages 472–480https://doi.org/10.1145/3591106.3592297

Published:12 June 2023Publication History

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

Pages 472–480

ABSTRACT

In realistic application scenarios, existing methods for image-text modeling have limitations in dealing with data stream: training on all data needs too much computation/storage resources, and even the full access to previous data is invalid. In this work, we thus propose a new continual image-text modeling (CITM) setting that requires a model to be trained sequentially on a number of diverse image-text datasets. Although recent continual learning methods can be directly applied to the CITM setting, most of them only consider reusing part of previous data or aligning the output distributions of previous and new models, which is a partial or indirect way to acquire the old knowledge. In contrast, we propose a novel dynamic historical adaptation (DHA) method which can holistically and directly review the old knowledge from a historical model. Concretely, the historical model transfers its total parameters to the main/current model to utilize the holistic old knowledge. In turn, the main model dynamically transfers its parameters to the historical model at every five training steps to ensure that the knowledge gap between them is not too large. Extensive experiments show that our proposed DHA outperforms other representative/latest continual learning methods under the CITM setting.

Supplemental Material

Available for Download

pdf

Appendix (536.1 KB)

pdf

Appendix (536.1 KB)

References

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. 2018. Memory aware synapses: Learning what (not) to forget. In European Conference on Computer Vision. 139–154.Google ScholarDigital Library
Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. 2017. Expert gate: Lifelong learning with a network of experts. In IEEE Conference on Computer Vision and Pattern Recognition. 3366–3375.Google ScholarCross Ref
Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. 2019. Gradient based sample selection for online continual learning. Advances in Neural Information Processing Systems 32 (2019), 11817–11826.Google Scholar
Craig Atkinson, Brendan McCane, Lech Szymanski, and Anthony Robins. 2018. Pseudo-recursal: Solving the catastrophic forgetting problem in deep neural networks. arXiv preprint arXiv:1802.03875 (2018).Google Scholar
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE/CVF International Conference on Computer Vision. 1728–1738.Google ScholarCross Ref
Ali Furkan Biten, Lluis Gomez, Marçal Rusinol, and Dimosthenis Karatzas. 2019. Good news, everyone! context driven entity-aware captioning for news images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12466–12475.Google ScholarCross Ref
Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. 2020. Dark experience for general continual learning: a strong, simple baseline. Advances in Neural Information Processing Systems 33 (2020), 15920–15930.Google Scholar
Hyuntak Cha, Jaeho Lee, and Jinwoo Shin. 2021. Co2l: Contrastive continual learning. In IEEE International Conference on Computer Vision. 9516–9525.Google ScholarCross Ref
Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. 2018. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In European Conference on Computer Vision. 532–547.Google ScholarDigital Library
Arslan Chaudhry, Albert Gordo, Puneet Kumar Dokania, Philip H. S. Torr, and David Lopez-Paz. 2021. Using Hindsight to Anchor Past Knowledge in Continual Learning. In AAAI Conference on Artificial Intelligence. 6993–7001.Google ScholarCross Ref
Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. 2019. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486 (2019).Google Scholar
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12652–12660.Google Scholar
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning. 1597–1607.Google Scholar
Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. 2021. A continual learning survey: Defying forgetting in classification tasks. TPAMI 44, 7 (2021), 3366–3385.Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. Coot: Cooperative hierarchical transformer for video-text representation learning. Advances in Neural Information Processing Systems 33 (2020), 22605–22618.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarCross Ref
Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, 2021. WenLan: Bridging vision and language by large-scale multi-modal pre-training. arXiv preprint arXiv:2103.06561 (2021).Google Scholar
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning. 4904–4916.Google Scholar
Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding the Long-Short Term Memory Model for Image Caption Generation. In IEEE International Conference on Computer Vision. 2407–2415.Google Scholar
Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image Generation from Scene Graphs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1219–1228.Google ScholarCross Ref
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114, 13 (2017), 3521–3526.Google ScholarCross Ref
Frantzeska Lavda, Jason Ramapuram, Magda Gregorova, and Alexandros Kalousis. 2018. Continual classification learning using generative models. arXiv preprint arXiv:1810.10612 (2018).Google Scholar
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. ArXiv abs/1803.08024 (2018).Google Scholar
Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7331–7341.Google ScholarCross Ref
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI Conference on Artificial Intelligence. 11336–11344.Google ScholarCross Ref
Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 12 (2017), 2935–2947.Google ScholarDigital Library
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. 740–755.Google ScholarCross Ref
Yaoyao Liu, Yuting Su, An-An Liu, Bernt Schiele, and Qianru Sun. 2020. Mnemonics training: Multi-class incremental learning without forgetting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12245–12254.Google ScholarCross Ref
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems 32 (2019), 13–23.Google Scholar
Arun Mallya, Dillon Davis, and Svetlana Lazebnik. 2018. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In European Conference on Computer Vision. 67–82.Google ScholarDigital Library
Arun Mallya and Svetlana Lazebnik. 2018. Packnet: Adding multiple tasks to a single network by iterative pruning. In IEEE Conference on Computer Vision and Pattern Recognition. 7765–7773.Google ScholarCross Ref
Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. MirrorGAN: Learning Text-To-Image Generation by Redescription. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1505–1514.Google ScholarCross Ref
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. 8748–8763.Google Scholar
Jason Ramapuram, Magda Gregorová, and Alexandros Kalousis. 2020. Lifelong Generative Modeling. ArXiv abs/1705.09847 (2020).Google Scholar
Amal Rannen, Rahaf Aljundi, Matthew B Blaschko, and Tinne Tuytelaars. 2017. Encoder based lifelong learning. In IEEE International Conference on Computer Vision. 1320–1328.Google ScholarCross Ref
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. 2017. icarl: Incremental classifier and representation learning. In IEEE Conference on Computer Vision and Pattern Recognition. 2001–2010.Google ScholarCross Ref
Amir Rosenfeld and John K Tsotsos. 2018. Incremental learning through deep adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 3 (2018), 651–663.Google ScholarCross Ref
Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. 2018. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning. 4548–4557.Google Scholar
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Annual Meeting of the Association for Computational Linguistics. 2556–2565.Google ScholarCross Ref
Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. 2017. Continual learning with deep generative replay. Advances in Neural Information Processing Systems 30 (2017), 2994–3003.Google Scholar
Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. 2021. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In ACM SIGIR Conference on Research and Development in Information Retrieval. 2443–2449.Google ScholarDigital Library
Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019).Google Scholar
Oriol Vinyals, Alexander Toshev, Samy Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Ju Xu and Zhanxing Zhu. 2018. Reinforced continual learning. Advances in Neural Information Processing Systems 31 (2018), 907–916.Google Scholar
Peter Young, Alice Lai, Micah Hodosh, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.Google ScholarCross Ref
Junting Zhang, Jie Zhang, Shalini Ghosh, Dawei Li, Serafettin Tasci, Larry Heck, Heming Zhang, and C-C Jay Kuo. 2020. Class-incremental learning via deep model consolidation. In IEEE/CVF Winter Conference on Applications of Computer Vision. 1131–1140.Google ScholarCross Ref
Linchao Zhu and Yi Yang. 2020. Actbert: Learning global-local video-text representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8746–8755.Google ScholarCross Ref

Index Terms

Learning with Adaptive Knowledge for Continual Image-Text Modeling
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval

Recommendations

Knowledge Decomposition and Replay: A Novel Cross-modal Image-Text Retrieval Continual Learning Method
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

To enable machines to mimic human cognitive abilities and alleviate the catastrophic forgetting problem in cross-modal image-text retrieval (CMITR), this paper proposes a novel continual learning method, Knowledge Decomposition and Replay (KDR), which ...
Read More
Adaptive online continual multi-view learning
Abstract
Deep neural networks (DNNs) have gained great success in information fusion. However, recent studies report that DNNs are suffering from catastrophic forgetting, i.e., DNNs would forget the knowledge learned from previous tasks when training on ...
Highlights
- The paper proposes a new setting for continual multi-view learning.
- A novel multi-view learning method is formulated to handle the distribution shift issue.
- Comprehensive experiments are conducted to verify the method.
Read More
Self-Attention Meta-Learner for Continual Learning
AAMAS '21: Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems

Continual learning aims to provide intelligent agents capable of learning multiple tasks sequentially with neural networks. In most settings of the current approaches, the agent starts from randomly initialized parameters and is optimized to master the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
June 2023
694 pages
ISBN:9798400701788
DOI:10.1145/3591106
Editors:
Ioannis (Yiannis) Kompatsiaris
Centre for Research and Technology Hellas, Greece
,
Jiebo Luo
University of Rochester,USA
,
Nicu Sebe
University of Trento, Italy
,
Angela Yao
National University of Singapore, Singapore
,
Vasileios Mezaris
Centre for Research and Technology Hellas, Greece
,
Symeon Papadopoulos
Centre for Research and Technology Hellas, Greece
,
Adrian Popescu
CEA LIST, France
,
Zi (Helen) Huang
University of Queensland, Australia
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 June 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Image-text modeling
continual learning
contrastive learning
cross-modal retrieval
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate254of830submissions,31%
Upcoming Conference
ICMR '24

Sponsor:

sigmm

International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket , Thailand
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 121
  Total Downloads
- Downloads (Last 12 months)121
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Learning with Adaptive Knowledge for Continual Image-Text Modeling

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Knowledge Decomposition and Replay: A Novel Cross-modal Image-Text Retrieval Continual Learning Method

Adaptive online continual multi-view learning

Self-Attention Meta-Learner for Continual Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media