research-article

Semantic Enhanced Cross-modal GAN for Zero-shot Learning

Authors:
Haotian Sun

University of Electronic Science and Technology of China, CN

University of Electronic Science and Technology of China, CN
View Profile

,
Jiwei Wei

University of Electronic Science and Technology of China, CN

University of Electronic Science and Technology of China, CN
View Profile

,
Yang Yang

University of Electronic Science and Technology of China, China

University of Electronic Science and Technology of China, China
View Profile

,
Xing Xu

University of Electronic Science and Technology of China, CN

University of Electronic Science and Technology of China, CN
View Profile

MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in AsiaDecember 2021Article No.: 1Pages 1–7https://doi.org/10.1145/3469877.3490581

Published:10 January 2022Publication History

MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

Pages 1–7

ABSTRACT

The goal of Zero-shot Learning (ZSL) is to recognize categories that are not seen during the training process. The traditional method is to learn an embedding space and map visual features and semantic features to this common space. However, this method inevitably encounters the bias problem, i.e., unseen instances are often incorrectly recognized as the seen classes. Some attempts are made by proposing another paradigm, which uses generative models to hallucinate the features of unseen samples. However, the generative models often suffer from instability issues, making it impractical for them to generate fine-grained features of unseen samples, thus resulting in very limited improvement. To resolve this, a Semantic Enhanced Cross-modal GAN (SECM GAN) is proposed by imposing the cross-modal association for improving the semantic and discriminative property of the generated features. Specifically, we first train a cross-modal embedding model called Semantic Enhanced Cross-modal Model (SECM), which is constrained by discrimination and semantics. Then we train our generative model based on Generative Adversarial Network (GAN) called SECM GAN, in which the generator generates cross-modal features, and the discriminator distinguishes true cross-modal features from generated cross-modal features. We deploy SECM as a weak constraint of GAN, which makes reliance on GAN get reduced. We evaluate extensive experiments on three widely used ZSL datasets to demonstrate the superiority of our framework.

References

Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. 2013. Label-embedding for attribute-based classification. In CVPR. 819–826.Google Scholar
Zeynep Akata, Scott E. Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. 2015. Evaluation of output embeddings for fine-grained image classification. In CVPR. 2927–2936.Google Scholar
Yi Bin, Yang Yang, Fumin Shen, Ning Xie, Heng Tao Shen, and Xuelong Li. 2018. Describing video with attention-based bidirectional LSTM. IEEE TCYB 49, 7 (2018), 2631–2641.Google ScholarCross Ref
Rafael Felix, B. G. Vijay Kumar, Ian D. Reid, and Gustavo Carneiro. 2018. Multi-modal Cycle-Consistent Generalized Zero-Shot Learning. In ECCV. 21–37.Google Scholar
Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In NIPS. 2121–2129.Google Scholar
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NIPS. 2672–2680.Google Scholar
Xiang Guan, Guoqing Wang, Xing Xu, and Yi Bin. 2021. Learning Hierarchal Channel Attention for Fine-grained Visual Classification. In ACM MM. 5011–5019.Google Scholar
Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. 2017. Improved Training of Wasserstein GANs. In NIPS. 5767–5777.Google Scholar
Omkar Gune, Biplab Banerjee, Subhasis Chaudhuri, and Fabio Cuzzolin. 2020. Generalized Zero-Shot Learning using Generated Proxy Unseen Samples and Entropy Separation. In ACM MM. 4262–4270.Google Scholar
Yuchen Guo, Guiguang Ding, Jungong Han, and Yue Gao. 2017. Synthesizing Samples for Zero-shot Learning. In IJCAI. 1774–1780.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.Google Scholar
He Huang, Changhu Wang, Philip S. Yu, and Chang-Dong Wang. 2019. Generative Dual Adversarial Network for Generalized Zero-Shot Learning. In CVPR. 801–810.Google Scholar
Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In ICLR.Google Scholar
Elyor Kodirov, Tao Xiang, and Shaogang Gong. 2017. Semantic Autoencoder for Zero-Shot Learning. In CVPR. 4447–4456.Google Scholar
Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In CVPR. 951–958.Google Scholar
Hanhui Li, Donghui Li, and Xiaonan Luo. 2014. Bap: Bimodal attribute prediction for zero-shot image categorization. In ACM MM. 1013–1016.Google ScholarDigital Library
Jingjing Li, Mengmeng Jing, Ke Lu, Zhengming Ding, Lei Zhu, and Zi Huang. 2019. Leveraging the Invariant Side of Generative Zero-Shot Learning. In CVPR. 7402–7411.Google Scholar
Jingjing Li, Mengmeng Jing, Ke Lu, Lei Zhu, Yang Yang, and Zi Huang. 2019. Alleviating feature confusion for generative zero-shot learning. In ACM MM. 1587–1595.Google Scholar
Jingjing Li, Mengmeng Jing, Lei Zhu, Zhengming Ding, Ke Lu, and Yang Yang. 2020. Learning Modality-Invariant Latent Representations for Generalized Zero-shot Learning. In ACM MM. 1348–1356.Google Scholar
Teng Long, Xing Xu, Youyou Li, Fumin Shen, Jingkuan Song, and Heng Tao Shen. 2018. Pseudo transfer with marginalized corrupted attribute for zero-shot learning. In ACM MM. 1802–1810.Google Scholar
Yang Long, Li Liu, Ling Shao, Fumin Shen, Guiguang Ding, and Jungong Han. 2017. From zero-shot learning to conventional supervised classification: Unseen visual data synthesis. In CVPR. 1627–1636.Google Scholar
Shaobo Min, Hantao Yao, Hongtao Xie, Zheng-Jun Zha, and Yongdong Zhang. 2019. Domain-specific embedding network for zero-shot recognition. In ACM MM. 2070–2078.Google Scholar
Ashish Mishra, Shiva Krishna Reddy, Anurag Mittal, and Hema A Murthy. 2018. A generative model for zero shot learning using conditional variational autoencoders. In CVPR workshops. 2188–2196.Google ScholarCross Ref
Jian Ni, Shanghang Zhang, and Haiyong Xie. 2019. Dual Adversarial Semantics-Consistent Network for Generalized Zero-Shot Learning. In NIPS. 6143–6154.Google Scholar
Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg Corrado, and Jeffrey Dean. 2014. Zero-Shot Learning by Convex Combination of Semantic Embeddings. In ICLR.Google Scholar
Akanksha Paul, Narayanan C. Krishnan, and Prateek Munjal. 2019. Semantically Aligned Bias Reducing Zero Shot Learning. In CVPR. 7056–7065.Google Scholar
Liang Peng, Yang Yang, Zheng Wang, Zi Huang, and Heng Tao Shen. 2020. Mra-net: Improving vqa via multi-modal relation attention network. IEEE TPAMI (2020).Google Scholar
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532–1543.Google Scholar
Bernardino Romera-Paredes and Philip Torr. 2015. An embarrassingly simple approach to zero-shot learning. In ICML. 2152–2161.Google Scholar
Edgar Schönfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. 2019. Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders. In CVPR. 8247–8255.Google Scholar
Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2020. Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE TKDE (2020).Google ScholarCross Ref
Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Y. Ng. 2013. Zero-Shot Learning Through Cross-Modal Transfer. In NIPS. 935–943.Google Scholar
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning Structured Output Representation using Deep Conditional Generative Models. In NIPS. 3483–3491.Google Scholar
Xinhang Song, Haitao Zeng, Sixian Zhang, Luis Herranz, and Shuqiang Jiang. 2020. Generalized Zero-shot Learning with Multi-source Semantic Embeddings for Scene Recognition. In ACM MM. 3976–3985.Google Scholar
Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. 2018. Generalized zero-shot learning via synthesized examples. In CVPR. 4281–4289.Google Scholar
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In ACM MM.Google Scholar
Jiwei Wei, Xing Xu, Zheng Wang, and Guoqing Wang. 2021. Meta Self-Paced Learning for Cross-Modal Matching. In ACM MM. 3835–3843.Google Scholar
Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, and Heng Tao Shen. 2020. Universal weighting metric learning for cross-modal matching. In CVPR. 13005–13014.Google Scholar
Jiwei Wei, Yang Yang, Xing Xu, Xiaofeng Zhu, and Heng Tao Shen. 2021. Universal Weighting Metric Learning for Cross-Modal Retrieval. IEEE TPAMI (2021).Google ScholarDigital Library
Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele. 2016. Latent embeddings for zero-shot classification. In CVPR. 69–77.Google Scholar
Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018. Feature Generating Networks for Zero-Shot Learning. In CVPR. 5542–5551.Google Scholar
Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. 2019. F-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning. In CVPR. 10275–10284.Google Scholar
Gang Yang, Jinlu Liu, Jieping Xu, and Xirong Li. 2018. Dissimilarity representation learning for generalized zero-shot recognition. In ACM MM. 2032–2039.Google Scholar
Yang Yang, Yadan Luo, Weilun Chen, Fumin Shen, Jie Shao, and Heng Tao Shen. 2016. Zero-shot hashing via transferring supervised knowledge. In ACM MM. 1286–1295.Google Scholar
Li Zhang, Tao Xiang, and Shaogang Gong. 2017. Learning a deep embedding model for zero-shot learning. In CVPR. 2021–2030.Google Scholar
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep Supervised Cross-Modal Retrieval. In CVPR. 10394–10403.Google Scholar
Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. 2018. A Generative Adversarial Approach for Zero-Shot Learning From Noisy Texts. In CVPR. 1004–1013.Google Scholar

Index Terms

Semantic Enhanced Cross-modal GAN for Zero-shot Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
    1. Learning paradigms
    2. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network
Conventional cross-modal retrieval models mainly assume the same scope of the classes for both the training set and the testing set. This assumption limits their extensibility on zero-shot cross-modal retrieval (ZS-CMR), where the testing set consists of ...
Read More
OntoZSL: Ontology-enhanced Zero-shot Learning
WWW '21: Proceedings of the Web Conference 2021

Zero-shot Learning (ZSL), which aims to predict for those classes that have never appeared in the training data, has arisen hot research interests. The key of implementing ZSL is to leverage the prior knowledge of classes which builds the semantic ...
Read More
Multi-label Generalized Zero-Shot Learning Using Identifiable Variational Autoencoders
Extended Reality
Abstract
Multi-label Zero-Shot Learning (ZSL) is an extension of traditional single-label ZSL, where the objective is to accurately classify images containing multiple unseen classes that are not available during training. Current techniques depends on ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia
December 2021
508 pages
ISBN:9781450386074
DOI:10.1145/3469877
Editors:
Chang-wen Chen,
Helen Huang,
Jun Zhou,
Tatsuya Harada,
Jianfei Cai,
Wu Liu,
Dong Xu
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 January 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GAN
Semantic Enhanced
Zero-shot Learning
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate59of204submissions,29%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 177
  Total Downloads
- Downloads (Last 12 months)58
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Semantic Enhanced Cross-modal GAN for Zero-shot Learning

MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network

OntoZSL: Ontology-enhanced Zero-shot Learning

Multi-label Generalized Zero-Shot Learning Using Identifiable Variational Autoencoders

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Semantic Enhanced Cross-modal GAN for Zero-shot Learning

MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network

OntoZSL: Ontology-enhanced Zero-shot Learning

Multi-label Generalized Zero-Shot Learning Using Identifiable Variational Autoencoders

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media