research-article

M3R: Masked Token Mixup and Cross-Modal Reconstruction for Zero-Shot Learning

Authors:

Qiangchang Wang,

Yilong YinAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 3161 - 3171

https://doi.org/10.1145/3581783.3612104

Published: 27 October 2023 Publication History

Abstract

In the zero-shot learning (ZSL), learned representation spaces are often biased toward seen classes, thus limiting the ability to predict previously unseen classes. In this paper, we propose Masked token Mixup and cross-Modal Reconstruction for zero-shot learning, termed as M3R, which can significantly alleviate the bias toward seen classes. The M3R mainly consists of Random Token Mixup (RTM), Unseen Class Detection (UCD), and Hard Cross-modal Reconstruction (HCR). Firstly, mappings without proper adaptations to unseen classes would cause the bias toward seen classes. To address this issue, the RTM is introduced to generate diverse unseen class agents, thereby broadening the representation space to cover unknown classes. It is applied at a randomly selected layer in the Vision Transformer, producing smooth low- and high-level representation space boundaries to cover rich attributes. Secondly, it should be noted that unseen class agents generated by the RTM may be mixed with seen class samples. To overcome this challenge, the UCD is designed to generate greater entropy values for unseen classes, thereby distinguishing seen classes from unseen classes. Thirdly, to further mitigate the bias toward seen classes and explore associations between semantics and visual images, the HCR is proposed, which can reconstruct masked pixels based on few discriminative tokens and attribute embeddings. This approach can enable models to have a deep understanding of image contents and build powerful connections between semantic attributes and visual information. Both qualitative and quantitative results demonstrate the effectiveness and usefulness of our proposed M3R model.

References

[1]

Yashas Annadani and Soma Biswas. 2018. Preserving semantic relations for zero-shot learning. In CVPR. 7603--7612.

[2]

Roman Bachmann, David Mizrahi, Andrei Atanov, and Amir Zamir. 2022. Multimae: Multi-modal multi-task masked autoencoders. In ECCV. Springer, 348--367.

[3]

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In ICML. PMLR, 1298--1312.

[4]

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).

[5]

Shuhao Cao, Peng Xu, and David A Clifton. 2022. How to understand masked autoencoders. arXiv preprint arXiv:2202.03670 (2022).

[6]

Dubing Chen, Yuming Shen, Haofeng Zhang, and Philip H.S. Torr. 2022d. Zero-Shot Logit Adjustment. In IJCAI. 813--819.

[7]

Jie-Neng Chen, Shuyang Sun, Ju He, Philip HS Torr, Alan Yuille, and Song Bai. 2022 e. Transmix: Attend to mix for vision transformers. In CVPR. 12135--12144.

[8]

Shiming Chen, Ziming Hong, Yang Liu, Guo-Sen Xie, Baigui Sun, Hao Li, Qinmu Peng, Ke Lu, and Xinge You. 2022b. TransZero: Attribute-guided Transformer for Zero-Shot Learning. In AAAI.

[9]

Shiming Chen, Ziming Hong, Guo-Sen Xie, Wenhan Yang, Qinmu Peng, Kai Wang, Jian Zhao, and Xinge You. 2022c. MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning. In CVPR. 7612--7621.

[10]

Shiming Chen, GuoSen Xie, Yang Liu, Qinmu Peng, Baigui Sun, Hao Li, Xinge You, and Ling Shao. 2021. Hsva: Hierarchical semantic-visual adaptation for zero-shot learning. NeurIPS, Vol. 34 (2021), 16622--16634.

[11]

Zhihong Chen, Yuhao Du, Jinpeng Hu, Yang Liu, Guanbin Li, Xiang Wan, and Tsung-Hui Chang. 2022a. Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training. In MICCAI. Springer, 679--689.

[12]

Hyeong Kyu Choi, Joonmyung Choi, and Hyunwoo J. Kim. 2022. TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers. In NeurIPS.

[13]

Yu-Ying Chou, Hsuan-Tien Lin, and Tyng-Luh Liu. 2021. Adaptive and generative zero-shot learning. In ICLR.

[14]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.

[15]

Yaogong Feng, Xiaowen Huang, Pengbo Yang, Jian Yu, and Jitao Sang. 2022. Non-Generative Generalized Zero-Shot Learning via Task-Correlated Disentanglement and Controllable Samples Synthesis. In CVPR. 9346--9355.

[16]

Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. 2015. Transductive multi-view zero-shot learning. IEEE TPAMI, Vol. 37, 11 (2015), 2332--2345.

Digital Library

[17]

Jiannan Ge, Hongtao Xie, Shaobo Min, Pandeng Li, and Yongdong Zhang. 2022. Dual Part Discovery Network for Zero-Shot Learning. In ACM MM. 3244--3252.

[18]

Xinyang Geng, Hao Liu, Lisa Lee, Dale Schuurams, Sergey Levine, and Pieter Abbeel. 2022. Multimodal masked autoencoders learn transferable representations. arXiv preprint arXiv:2205.14204 (2022).

[19]

Zongyan Han, Zhenyong Fu, Shuo Chen, and Jian Yang. 2021. Contrastive embedding for generalized zero-shot learning. In CVPR. 2371--2381.

[20]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022a. Masked autoencoders are scalable vision learners. In CVPR. 16000--16009.

[21]

Rundong He, Zhongyi Han, Xiankai Lu, and Yilong Yin. 2022b. RONF: Reliable Outlier Synthesis under Noisy Feature Space for Out-of-Distribution Detection. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022,João Magalh a es, Alberto Del Bimbo, Shin'ichi Satoh, Nicu Sebe, Xavier Alameda-Pineda, Qin Jin, Vincent Oria, and Laura Toni (Eds.). ACM, 4242--4251. https://doi.org/10.1145/3503161.3547815

Digital Library

[22]

Rundong He, Rongxue Li, Zhongyi Han, and Yilong Yin. 2022c. Topological Structure Learning for Weakly-Supervised Out-of-Distribution Detection. CoRR, Vol. abs/2209.07837 (2022). https://doi.org/10.48550/arXiv.2209.07837 [arXiv]2209.07837

[23]

Dat Huynh and Ehsan Elhamifar. 2020. Fine-grained generalized zero-shot learning via dense attribute-based attention. In CVPR. 4483--4493.

[24]

Ioannis Kakogeorgiou, Spyros Gidaris, Bill Psomas, Yannis Avrithis, Andrei Bursuc, Konstantinos Karantzalos, and Nikos Komodakis. 2022. What to hide from your students: Attention-guided masked image modeling. In European Conference on Computer Vision. Springer, 300--318.

Digital Library

[25]

Jang Hyun Kim, Wonho Choo, Hosan Jeong, and Hyun Oh Song. 2021. Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity. In ICLR.

[26]

Jang-Hyun Kim, Wonho Choo, and Hyun Oh Song. 2020. Puzzle mix: Exploiting saliency and local statistics for optimal mixup. In ICML. PMLR, 5275--5285.

[27]

Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR (Poster).

[28]

Xia Kong, Zuodong Gao, Xiaofan Li, Ming Hong, Jun Liu, Chengjie Wang, Yuan Xie, and Yanyun Qu. 2022. En-Compactness: Self-Distillation Embedding & Contrastive Generation for Generalized Zero-Shot Learning. In CVPR. 9306--9315.

[29]

Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2013. Attribute-based classification for zero-shot visual object categorization. IEEE TPAMI, Vol. 36, 3 (2013), 453--465.

Digital Library

[30]

Gang Li, Heliang Zheng, Daqing Liu, Chaoyue Wang, Bing Su, and Changwen Zheng. 2022. SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders. In NeuraIPS.

[31]

Jingjing Li, Mengmeng Jing, Ke Lu, Lei Zhu, Yang Yang, and Zi Huang. 2019. Alleviating feature confusion for generative zero-shot learning. In ACM MM. 1587--1595.

[32]

Yan Li, Junge Zhang, Jianguo Zhang, and Kaiqi Huang. 2018. Discriminative learning of latent features for zero-shot recognition. In CVPR. 7463--7471.

[33]

Yang Liu, Jishun Guo, Deng Cai, and Xiaofei He. 2019. Attribute attention for semantic disambiguation in zero-shot learning. In ICCV. 6698--6707.

[34]

Yang Liu, Lei Zhou, Xiao Bai, Yifei Huang, Lin Gu, Jun Zhou, and Tatsuya Harada. 2021. Goal-oriented gaze estimation for zero-shot learning. In CVPR. 3794--3803.

[35]

Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. IEEE, 722--729.

Digital Library

[36]

Genevieve Patterson and James Hays. 2012. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR. IEEE, 2751--2758.

[37]

Viraj Uday Prabhu, Sriram Yenamandra, Aaditya Singh, and Judy Hoffman. 2022. Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency. In NeuraIPS.

[38]

Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. 2021. ImageNet-21K Pretraining for the Masses. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).

[39]

Hongzu Su, Jingjing Li, Zhi Chen, Lei Zhu, and Ke Lu. 2022. Distinguishing unseen from seen for generalized zero-shot learning. In CVPR. 7885--7894.

[40]

Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. 2019. Manifold mixup: Better representations by interpolating hidden states. In ICML. PMLR, 6438--6447.

[41]

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The caltech-ucsd birds-200--2011 dataset. California Institute of Technology (2011), 1--8.

[42]

Chaoqun Wang, Shaobo Min, Xuejin Chen, Xiaoyan Sun, and Houqiang Li. 2021. Dual Progressive Prototype Network for Generalized Zero-Shot Learning. NeurIPS, Vol. 34 (2021).

[43]

Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018. Feature generating networks for zero-shot learning. In CVPR. 5542--5551.

[44]

Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. 2019. f-vaegan-d2: A feature generating framework for any-shot learning. In CVPR. 10275--10284.

[45]

Guo-Sen Xie, Li Liu, Fan Zhu, Fang Zhao, Zheng Zhang, Yazhou Yao, Jie Qin, and Ling Shao. 2020. Region graph embedding network for zero-shot learning. In ECCV. Springer, 562--580.

[46]

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. 2022. Simmim: A simple framework for masked image modeling. In CVPR. 9653--9663.

[47]

Bingrong Xu, Zhigang Zeng, Cheng Lian, and Zhengming Ding. 2022. Generative mixup networks for zero-shot learning. IEEE TNNLS (2022).

[48]

Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, and Zeynep Akata. 2020. Attribute prototype network for zero-shot learning. NeurIPS, Vol. 33 (2020), 21969--21980.

[49]

Xihong Yang, Xiaochang Hu, Sihang Zhou, Xinwang Liu, and En Zhu. 2022. Interpolation-based contrastive learning for few-label semi-supervised learning. IEEE Transactions on Neural Networks and Learning Systems (2022).

[50]

Xihong Yang, Yue Liu, Sihang Zhou, Siwei Wang, Wenxuan Tu, Qun Zheng, Xinwang Liu, Liming Fang, and En Zhu. 2023. Cluster-guided Contrastive Graph Clustering Network. In Proceedings of the AAAI conference on artificial intelligence, Vol. 37. 10834--10842.

Digital Library

[51]

Zhongqi Yue, Tan Wang, Hanwang Zhang, Qianru Sun, and Xian-Sheng Hua. 2021. Counterfactual Zero-Shot and Open-Set Visual Recognition. In CVPR.

[52]

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV. 6023--6032.

[53]

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations.

[54]

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. 2022a. iBOT: Image BERT Pre-Training with Online Tokenizer. ICLR (2022).

[55]

Qiang Zhou, Chaohui Yu, Hao Luo, Zhibin Wang, and Hao Li. 2022b. Mimco: Masked image modeling pre-training with contrastive teacher. In ACM MM. 4487--4495.

Cited By

Qu XYu JGai KZhuang JTang YXiong GGou GWu QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680829(4581-4590)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680829
Li WWang QZhao PYin YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)KNN Transformer with Pyramid Prompts for Few-Shot LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680601(1082-1091)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680601
Chen LWang QLi ZYin YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Hypergraph-guided Intra- and Inter-category Relation Modeling for Fine-grained Visual RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680589(8043-8052)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680589
Show More Cited By

Index Terms

M3R: Masked Token Mixup and Cross-Modal Reconstruction for Zero-Shot Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition

Recommendations

Zero-Shot Learning with Superclasses
Neural Information Processing
Abstract
Zero-shot learning (ZSL) can be regarded as transfer learning from seen classes to unseen ones so that the later can be recognized without any training samples. Its main difficulty lies in that there often exists a large domain gap between the ...
Zero-Shot Learning with Noisy Labels
Abstract
Zero-shot learning (ZSL) is an attractive technique that can recognize novel object classes without any visual examples, but most existing methods assume that the class labels of the training instances from seen classes are accurate and reliable. ...
Pseudo Transfer with Marginalized Corrupted Attribute for Zero-shot Learning
MM '18: Proceedings of the 26th ACM international conference on Multimedia

Zero-shot learning (ZSL) aims to recognize unseen classes that are excluded from training classes. ZSL suffers from 1) Zero-shot bias (Z-Bias) --- model is biased towards seen classes because unseen data is inaccessible for training; 2) Zero-shot ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Major basic research project of Shandong Natural Science Foundation
National Key Research and Development Program of China
The Fundamental Research Funds of Shandong University

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
215
Total Downloads

Downloads (Last 12 months)121
Downloads (Last 6 weeks)13

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Qu XYu JGai KZhuang JTang YXiong GGou GWu QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680829(4581-4590)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680829
Li WWang QZhao PYin YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)KNN Transformer with Pyramid Prompts for Few-Shot LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680601(1082-1091)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680601
Chen LWang QLi ZYin YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Hypergraph-guided Intra- and Inter-category Relation Modeling for Fine-grained Visual RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680589(8043-8052)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680589
Zhao PXi XWang QYin Y(2024)Characterizing Hierarchical Semantic-Aware Parts With Transformers for Generalized Zero-Shot LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342249134:11(11493-11506)Online publication date: Nov-2024
https://doi.org/10.1109/TCSVT.2024.3422491

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents