skip to main content
10.1145/3581783.3612104acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

M3R: Masked Token Mixup and Cross-Modal Reconstruction for Zero-Shot Learning

Published: 27 October 2023 Publication History

Abstract

In the zero-shot learning (ZSL), learned representation spaces are often biased toward seen classes, thus limiting the ability to predict previously unseen classes. In this paper, we propose Masked token Mixup and cross-Modal Reconstruction for zero-shot learning, termed as M3R, which can significantly alleviate the bias toward seen classes. The M3R mainly consists of Random Token Mixup (RTM), Unseen Class Detection (UCD), and Hard Cross-modal Reconstruction (HCR). Firstly, mappings without proper adaptations to unseen classes would cause the bias toward seen classes. To address this issue, the RTM is introduced to generate diverse unseen class agents, thereby broadening the representation space to cover unknown classes. It is applied at a randomly selected layer in the Vision Transformer, producing smooth low- and high-level representation space boundaries to cover rich attributes. Secondly, it should be noted that unseen class agents generated by the RTM may be mixed with seen class samples. To overcome this challenge, the UCD is designed to generate greater entropy values for unseen classes, thereby distinguishing seen classes from unseen classes. Thirdly, to further mitigate the bias toward seen classes and explore associations between semantics and visual images, the HCR is proposed, which can reconstruct masked pixels based on few discriminative tokens and attribute embeddings. This approach can enable models to have a deep understanding of image contents and build powerful connections between semantic attributes and visual information. Both qualitative and quantitative results demonstrate the effectiveness and usefulness of our proposed M3R model.

References

[1]
Yashas Annadani and Soma Biswas. 2018. Preserving semantic relations for zero-shot learning. In CVPR. 7603--7612.
[2]
Roman Bachmann, David Mizrahi, Andrei Atanov, and Amir Zamir. 2022. Multimae: Multi-modal multi-task masked autoencoders. In ECCV. Springer, 348--367.
[3]
Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In ICML. PMLR, 1298--1312.
[4]
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).
[5]
Shuhao Cao, Peng Xu, and David A Clifton. 2022. How to understand masked autoencoders. arXiv preprint arXiv:2202.03670 (2022).
[6]
Dubing Chen, Yuming Shen, Haofeng Zhang, and Philip H.S. Torr. 2022d. Zero-Shot Logit Adjustment. In IJCAI. 813--819.
[7]
Jie-Neng Chen, Shuyang Sun, Ju He, Philip HS Torr, Alan Yuille, and Song Bai. 2022 e. Transmix: Attend to mix for vision transformers. In CVPR. 12135--12144.
[8]
Shiming Chen, Ziming Hong, Yang Liu, Guo-Sen Xie, Baigui Sun, Hao Li, Qinmu Peng, Ke Lu, and Xinge You. 2022b. TransZero: Attribute-guided Transformer for Zero-Shot Learning. In AAAI.
[9]
Shiming Chen, Ziming Hong, Guo-Sen Xie, Wenhan Yang, Qinmu Peng, Kai Wang, Jian Zhao, and Xinge You. 2022c. MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning. In CVPR. 7612--7621.
[10]
Shiming Chen, GuoSen Xie, Yang Liu, Qinmu Peng, Baigui Sun, Hao Li, Xinge You, and Ling Shao. 2021. Hsva: Hierarchical semantic-visual adaptation for zero-shot learning. NeurIPS, Vol. 34 (2021), 16622--16634.
[11]
Zhihong Chen, Yuhao Du, Jinpeng Hu, Yang Liu, Guanbin Li, Xiang Wan, and Tsung-Hui Chang. 2022a. Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training. In MICCAI. Springer, 679--689.
[12]
Hyeong Kyu Choi, Joonmyung Choi, and Hyunwoo J. Kim. 2022. TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers. In NeurIPS.
[13]
Yu-Ying Chou, Hsuan-Tien Lin, and Tyng-Luh Liu. 2021. Adaptive and generative zero-shot learning. In ICLR.
[14]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
[15]
Yaogong Feng, Xiaowen Huang, Pengbo Yang, Jian Yu, and Jitao Sang. 2022. Non-Generative Generalized Zero-Shot Learning via Task-Correlated Disentanglement and Controllable Samples Synthesis. In CVPR. 9346--9355.
[16]
Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. 2015. Transductive multi-view zero-shot learning. IEEE TPAMI, Vol. 37, 11 (2015), 2332--2345.
[17]
Jiannan Ge, Hongtao Xie, Shaobo Min, Pandeng Li, and Yongdong Zhang. 2022. Dual Part Discovery Network for Zero-Shot Learning. In ACM MM. 3244--3252.
[18]
Xinyang Geng, Hao Liu, Lisa Lee, Dale Schuurams, Sergey Levine, and Pieter Abbeel. 2022. Multimodal masked autoencoders learn transferable representations. arXiv preprint arXiv:2205.14204 (2022).
[19]
Zongyan Han, Zhenyong Fu, Shuo Chen, and Jian Yang. 2021. Contrastive embedding for generalized zero-shot learning. In CVPR. 2371--2381.
[20]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022a. Masked autoencoders are scalable vision learners. In CVPR. 16000--16009.
[21]
Rundong He, Zhongyi Han, Xiankai Lu, and Yilong Yin. 2022b. RONF: Reliable Outlier Synthesis under Noisy Feature Space for Out-of-Distribution Detection. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022,João Magalh a es, Alberto Del Bimbo, Shin'ichi Satoh, Nicu Sebe, Xavier Alameda-Pineda, Qin Jin, Vincent Oria, and Laura Toni (Eds.). ACM, 4242--4251. https://doi.org/10.1145/3503161.3547815
[22]
Rundong He, Rongxue Li, Zhongyi Han, and Yilong Yin. 2022c. Topological Structure Learning for Weakly-Supervised Out-of-Distribution Detection. CoRR, Vol. abs/2209.07837 (2022). https://doi.org/10.48550/arXiv.2209.07837 [arXiv]2209.07837
[23]
Dat Huynh and Ehsan Elhamifar. 2020. Fine-grained generalized zero-shot learning via dense attribute-based attention. In CVPR. 4483--4493.
[24]
Ioannis Kakogeorgiou, Spyros Gidaris, Bill Psomas, Yannis Avrithis, Andrei Bursuc, Konstantinos Karantzalos, and Nikos Komodakis. 2022. What to hide from your students: Attention-guided masked image modeling. In European Conference on Computer Vision. Springer, 300--318.
[25]
Jang Hyun Kim, Wonho Choo, Hosan Jeong, and Hyun Oh Song. 2021. Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity. In ICLR.
[26]
Jang-Hyun Kim, Wonho Choo, and Hyun Oh Song. 2020. Puzzle mix: Exploiting saliency and local statistics for optimal mixup. In ICML. PMLR, 5275--5285.
[27]
Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR (Poster).
[28]
Xia Kong, Zuodong Gao, Xiaofan Li, Ming Hong, Jun Liu, Chengjie Wang, Yuan Xie, and Yanyun Qu. 2022. En-Compactness: Self-Distillation Embedding & Contrastive Generation for Generalized Zero-Shot Learning. In CVPR. 9306--9315.
[29]
Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2013. Attribute-based classification for zero-shot visual object categorization. IEEE TPAMI, Vol. 36, 3 (2013), 453--465.
[30]
Gang Li, Heliang Zheng, Daqing Liu, Chaoyue Wang, Bing Su, and Changwen Zheng. 2022. SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders. In NeuraIPS.
[31]
Jingjing Li, Mengmeng Jing, Ke Lu, Lei Zhu, Yang Yang, and Zi Huang. 2019. Alleviating feature confusion for generative zero-shot learning. In ACM MM. 1587--1595.
[32]
Yan Li, Junge Zhang, Jianguo Zhang, and Kaiqi Huang. 2018. Discriminative learning of latent features for zero-shot recognition. In CVPR. 7463--7471.
[33]
Yang Liu, Jishun Guo, Deng Cai, and Xiaofei He. 2019. Attribute attention for semantic disambiguation in zero-shot learning. In ICCV. 6698--6707.
[34]
Yang Liu, Lei Zhou, Xiao Bai, Yifei Huang, Lin Gu, Jun Zhou, and Tatsuya Harada. 2021. Goal-oriented gaze estimation for zero-shot learning. In CVPR. 3794--3803.
[35]
Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. IEEE, 722--729.
[36]
Genevieve Patterson and James Hays. 2012. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR. IEEE, 2751--2758.
[37]
Viraj Uday Prabhu, Sriram Yenamandra, Aaditya Singh, and Judy Hoffman. 2022. Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency. In NeuraIPS.
[38]
Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. 2021. ImageNet-21K Pretraining for the Masses. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
[39]
Hongzu Su, Jingjing Li, Zhi Chen, Lei Zhu, and Ke Lu. 2022. Distinguishing unseen from seen for generalized zero-shot learning. In CVPR. 7885--7894.
[40]
Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. 2019. Manifold mixup: Better representations by interpolating hidden states. In ICML. PMLR, 6438--6447.
[41]
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The caltech-ucsd birds-200--2011 dataset. California Institute of Technology (2011), 1--8.
[42]
Chaoqun Wang, Shaobo Min, Xuejin Chen, Xiaoyan Sun, and Houqiang Li. 2021. Dual Progressive Prototype Network for Generalized Zero-Shot Learning. NeurIPS, Vol. 34 (2021).
[43]
Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018. Feature generating networks for zero-shot learning. In CVPR. 5542--5551.
[44]
Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. 2019. f-vaegan-d2: A feature generating framework for any-shot learning. In CVPR. 10275--10284.
[45]
Guo-Sen Xie, Li Liu, Fan Zhu, Fang Zhao, Zheng Zhang, Yazhou Yao, Jie Qin, and Ling Shao. 2020. Region graph embedding network for zero-shot learning. In ECCV. Springer, 562--580.
[46]
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. 2022. Simmim: A simple framework for masked image modeling. In CVPR. 9653--9663.
[47]
Bingrong Xu, Zhigang Zeng, Cheng Lian, and Zhengming Ding. 2022. Generative mixup networks for zero-shot learning. IEEE TNNLS (2022).
[48]
Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, and Zeynep Akata. 2020. Attribute prototype network for zero-shot learning. NeurIPS, Vol. 33 (2020), 21969--21980.
[49]
Xihong Yang, Xiaochang Hu, Sihang Zhou, Xinwang Liu, and En Zhu. 2022. Interpolation-based contrastive learning for few-label semi-supervised learning. IEEE Transactions on Neural Networks and Learning Systems (2022).
[50]
Xihong Yang, Yue Liu, Sihang Zhou, Siwei Wang, Wenxuan Tu, Qun Zheng, Xinwang Liu, Liming Fang, and En Zhu. 2023. Cluster-guided Contrastive Graph Clustering Network. In Proceedings of the AAAI conference on artificial intelligence, Vol. 37. 10834--10842.
[51]
Zhongqi Yue, Tan Wang, Hanwang Zhang, Qianru Sun, and Xian-Sheng Hua. 2021. Counterfactual Zero-Shot and Open-Set Visual Recognition. In CVPR.
[52]
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV. 6023--6032.
[53]
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations.
[54]
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. 2022a. iBOT: Image BERT Pre-Training with Online Tokenizer. ICLR (2022).
[55]
Qiang Zhou, Chaohui Yu, Hao Luo, Zhibin Wang, and Hao Li. 2022b. Mimco: Masked image modeling pre-training with contrastive teacher. In ACM MM. 4487--4495.

Cited By

View all
  • (2024)Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680829(4581-4590)Online publication date: 28-Oct-2024
  • (2024)KNN Transformer with Pyramid Prompts for Few-Shot LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680601(1082-1091)Online publication date: 28-Oct-2024
  • (2024)Hypergraph-guided Intra- and Inter-category Relation Modeling for Fine-grained Visual RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680589(8043-8052)Online publication date: 28-Oct-2024
  • Show More Cited By

Index Terms

  1. M3R: Masked Token Mixup and Cross-Modal Reconstruction for Zero-Shot Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. masked image modeling
    2. mixup
    3. transformer
    4. zero-shot learning

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)121
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680829(4581-4590)Online publication date: 28-Oct-2024
    • (2024)KNN Transformer with Pyramid Prompts for Few-Shot LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680601(1082-1091)Online publication date: 28-Oct-2024
    • (2024)Hypergraph-guided Intra- and Inter-category Relation Modeling for Fine-grained Visual RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680589(8043-8052)Online publication date: 28-Oct-2024
    • (2024)Characterizing Hierarchical Semantic-Aware Parts With Transformers for Generalized Zero-Shot LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342249134:11(11493-11506)Online publication date: Nov-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media