research-article

Learning Modality-Invariant Latent Representations for Generalized Zero-shot Learning

Authors:
Jingjing Li

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

,
Mengmeng Jing

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

,
Lei Zhu

Shandong Normal Unversity, Jinan, China

Shandong Normal Unversity, Jinan, China
View Profile

,
Zhengming Ding

Indiana University-Purdue University Indianapolis, Indianapolis, IN, USA

Indiana University-Purdue University Indianapolis, Indianapolis, IN, USA
View Profile

,
Ke Lu

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

,
Yang Yang

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

MM '20: Proceedings of the 28th ACM International Conference on MultimediaOctober 2020Pages 1348–1356https://doi.org/10.1145/3394171.3413503

Published:12 October 2020Publication History

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 1348–1356

ABSTRACT

Recently, feature generating methods have been successfully applied to zero-shot learning (ZSL). However, most previous approaches only generate visual representations for zero-shot recognition. In fact, typical ZSL is a classic multi-modal learning protocol which consists of a visual space and a semantic space. In this paper, therefore, we present a new method which can simultaneously generate both visual representations and semantic representations so that the essential multi-modal information associated with unseen classes can be captured. Specifically, we address the most challenging issue in such a paradigm, i.e., how to handle the domain shift and thus guarantee that the learned representations are modality-invariant. To this end, we propose two strategies: 1) leveraging the mutual information between the latent visual representations and the semantic representations; 2) maximizing the entropy of the joint distribution of the two latent representations. By leveraging the two strategies, we argue that the two modalities can be well aligned. At last, extensive experiments on five widely used datasets verify that the proposed method is able to significantly outperform previous the state-of-the-arts.

Supplemental Material

3394171.3413503.mp4

mp4

87.5 MB

Download

References

Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. 2016. Label-embedding for image classification. IEEE TPAMI, Vol. 38, 7 (2016), 1425--1438.Google ScholarCross Ref
Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. 2015. Evaluation of output embeddings for fine-grained image classification. In CVPR. 2927--2936.Google Scholar
Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. 2017. Generalization and equilibrium in generative adversarial nets (gans). In ICML. JMLR. org, 224--232.Google Scholar
Yuval Atzmon and Gal Chechik. 2019. Adaptive Confidence Smoothing for Generalized Zero-Shot Learning. In CVPR. 11671--11680.Google Scholar
Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. 2018. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062 (2018).Google Scholar
Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains. Machine learning, Vol. 79, 1--2 (2010), 151--175.Google Scholar
Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349 (2015).Google Scholar
Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. 2016. Synthesized classifiers for zero-shot learning. In CVPR. 5327--5336.Google Scholar
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS. 2172--2180.Google Scholar
Zhi Chen, Jingjing Li, Yadan Luo, Zi Huang, and Yang Yang. 2020. Canzsl: Cycle-consistent adversarial networks for zero-shot learning from natural language. In WACV. 874--883.Google Scholar
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. 248--255.Google Scholar
Zhengming Ding, Ming Shao, and Yun Fu. 2017. Low-Rank Embedded Ensemble Semantic Dictionary for Zero-Shot Learning. In CVPR. IEEE.Google Scholar
Zhengming Ding, Ming Shao, and Yun Fu. 2018. Generative zero-shot learning via low-rank embedded semantic dictionary. IEEE TPAMI (2018).Google Scholar
Rafael Felix, Vijay BG Kumar, Ian Reid, and Gustavo Carneiro. 2018. Multi-modal cycle-consistent generalized zero-shot learning. In ECCV. 21--37.Google Scholar
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et almbox. 2013. Devise: A deep visual-semantic embedding model. In NIPS. 2121--2129.Google Scholar
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS. 2672--2680.Google Scholar
Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A kernel two-sample test. JMLR, Vol. 13, Mar (2012), 723--773.Google Scholar
R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximization. ICLR (2019).Google Scholar
He Huang, Changhu Wang, Philip S Yu, and Chang-Dong Wang. 2019. Generative Dual Adversarial Network for Generalized Zero-shot Learning. In CVPR. 801--810.Google Scholar
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In CVPR. 1125--1134.Google Scholar
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).Google Scholar
Elyor Kodirov, Tao Xiang, Zhenyong Fu, and Shaogang Gong. 2015. Unsupervised domain adaptation for zero-shot learning. In ICCV. 2452--2460.Google Scholar
Elyor Kodirov, Tao Xiang, and Shaogang Gong. 2017. Semantic Autoencoder for Zero-Shot Learning. arXiv preprint arXiv:1704.08345 (2017).Google Scholar
Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In CVPR. IEEE, 951--958.Google Scholar
Jingjing Li, Erpeng Chen, Zhengming Ding, Lei Zhu, Ke Lu, and Zi Huang. 2019 a. Cycle-consistent conditional adversarial transfer networks. In ACM MM. 747--755.Google Scholar
Jingjing Li, Erpeng Chen, Zhengming Ding, Lei Zhu, Ke Lu, and Heng Tao Shen. 2020. Maximum Density Divergence for Domain Adaptation. IEEE TPAMI (2020).Google Scholar
Jingjing Li, Mengmeng Jing, Ke Lu, Zhengming Ding, Lei Zhu, and Zi Huang. 2019 b. Leveraging the Invariant Side of Generative Zero-Shot Learning. In CVPR. 7402--7411.Google Scholar
Jingjing Li, Mengmeng Jing, Ke Lu, Lei Zhu, Yang Yang, and Zi Huang. 2019 c. Alleviating Feature Confusion for Generative Zero-shot Learning. In ACM MM. 1587--1595.Google Scholar
Jingjing Li, Mengmeng Jing, Ke Lu, Lei Zhu, Yang Yang, and Zi Huang. 2019 d. From Zero-Shot Learning to Cold-Start Recommendation. AAAI (2019).Google Scholar
Kai Li, Martin Renqiang Min, and Yun Fu. 2019 e. Rethinking Zero-Shot Learning: A Conditional Visual Classification Perspective. In ICCV. 3583--3592.Google Scholar
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. JMLR, Vol. 9, Nov (2008), 2579--2605.Google Scholar
Ashish Mishra, M Reddy, Anurag Mittal, and Hema A Murthy. 2018. A generative model for zero shot learning using conditional variational autoencoders. In CVPR.Google Scholar
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2019. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2019).Google Scholar
Genevieve Patterson and James Hays. 2012. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR. IEEE, 2751--2758.Google Scholar
Bernardino Romera-Paredes and Philip Torr. 2015. An embarrassingly simple approach to zero-shot learning. In ICML. 2152--2161.Google Scholar
Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. 2019. Generalized zero-and few-shot learning via aligned variational autoencoders. In CVPR. 8247--8255.Google Scholar
Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. 2017. Learning robust visual-semantic embeddings. In ICCV. IEEE, 3591--3600.Google Scholar
V Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. 2018. Generalized zero-shot learning via synthesized examples. In CVPR.Google Scholar
Paul Viola and William M Wells III. 1997. Alignment by maximization of mutual information. IJCV, Vol. 24, 2 (1997), 137--154.Google ScholarDigital Library
Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018a. Feature generating networks for zero-shot learning. In CVPR.Google Scholar
Yongqin Xian, Bernt Schiele, and Zeynep Akata. 2018b. Zero-Shot Learning-A Comprehensive Evaluation of The Good, the Bad and the Ugly. TPAMI (2018).Google Scholar
Meng Ye and Yuhong Guo. 2019. Progressive Ensemble Networks for Zero-Shot Recognition. In CVPR . 11728--11736.Google Scholar
Li Zhang, Tao Xiang, Shaogang Gong, et almbox. 2017. Learning a deep embedding model for zero-shot learning. (2017).Google Scholar
Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. 2018. A generative adversarial approach for zero-shot learning from noisy texts. In CVPR .Google Scholar

Index Terms

Learning Modality-Invariant Latent Representations for Generalized Zero-shot Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition
  2. Machine learning
    1. Learning paradigms

Recommendations

Transductive Visual-Semantic Embedding for Zero-shot Learning
ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

Zero-shot learning (ZSL) aims to bridge the knowledge transfer via available semantic representations (e.g., attributes) between labeled source instances of seen classes and unlabelled target instances of unseen classes. Most existing ZSL approaches ...
Read More
Multi-label Generalized Zero-Shot Learning Using Identifiable Variational Autoencoders
Extended Reality
Abstract
Multi-label Zero-Shot Learning (ZSL) is an extension of traditional single-label ZSL, where the objective is to accurately classify images containing multiple unseen classes that are not available during training. Current techniques depends on ...
Read More
Transductive Multi-View Zero-Shot Learning
Most existing zero-shot learning approaches exploit transfer learning via an intermediate semantic representation shared between an annotated auxiliary dataset and a target dataset with different classes and no annotation. A projection from a low-level ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
generalized zsl
mutual information estimation
variational autoencoders
zero-shot learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 617
  Total Downloads
- Downloads (Last 12 months)87
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning Modality-Invariant Latent Representations for Generalized Zero-shot Learning

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Transductive Visual-Semantic Embedding for Zero-shot Learning

Multi-label Generalized Zero-Shot Learning Using Identifiable Variational Autoencoders

Transductive Multi-View Zero-Shot Learning