skip to main content
10.1145/3444685.3446283acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Graph-based variational auto-encoder for generalized zero-shot learning

Published: 03 May 2021 Publication History

Abstract

Zero-shot learning has been a highlighted research topic in both vision and language areas. Recently, generative methods have emerged as a new trend of zero-shot learning, which synthesizes unseen categories samples via generative models. However, the lack of fine-grained information in the synthesized samples makes it difficult to improve classification accuracy. It is also time-consuming and inefficient to synthesize samples and using them to train classifiers. To address such issues, we propose a novel Graph-based Variational Auto-Encoder for zero-shot learning. Specifically, we adopt knowledge graph to model the explicit inter-class relationships, and design a full graph convolution auto-encoder framework to generate the classifier from the distribution of the class-level semantic features on individual nodes. The encoder learns the latent representations of individual nodes, and the decoder generates the classifiers from latent representations of individual nodes. In contrast to synthesize samples, our proposed method directly generates classifiers from the distribution of the class-level semantic features for both seen and unseen categories, which is more straightforward, accurate and computationally efficient. We conduct extensive experiments and evaluate our method on the widely used large-scale ImageNet-21K dataset. Experimental results validate the efficacy of the proposed approach.

References

[1]
Yashas Annadani and Soma Biswas. 2018. Preserving semantic relations for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7603--7612.
[2]
Xu Bing, Naiyan Wang, Tianqi Chen, and Li Mu. 2015. Empirical Evaluation of Rectified Activations in Convolutional Network. Computer Science (2015).
[3]
Yannick Le Cacheux, Herve Le Borgne, and Michel Crucianu. 2019. Modeling Inter and Intra-Class Relations in the Triplet Loss for Zero-Shot Learning. In Proceedings of the IEEE International Conference on Computer Vision. 10333--10342.
[4]
Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. 2016. Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5327--5336.
[5]
Soravit Changpinyo, Wei-Lun Chao, and Fei Sha. 2017. Predicting visual exemplars of unseen classes for zero-shot learning. In Proceedings of the IEEE international conference on computer vision. 3476--3485.
[6]
Rafael Felix, Vijay BG Kumar, Ian Reid, and Gustavo Carneiro. 2018. Multi-modal cycle-consistent generalized zero-shot learning. In Proceedings of the European Conference on Computer Vision (ECCV). 21--37.
[7]
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems. 2121--2129.
[8]
Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019. I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8303--8311.
[9]
Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2020. Hierarchical LSTMs with Adaptive Attention for Visual Captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 5 (2020), 1112--1131.
[10]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.
[11]
David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. 2011. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30, 2 (2011), 129--150.
[12]
Tristan Hascoet, Yasuo Ariki, and Tetsuya Takiguchi. 2019. On zero-shot recognition of generic objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9553--9561.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[14]
Mengqiu Hu, Yang Yang, Fumin Shen, Ning Xie, Richang Hong, and Heng Tao Shen. 2018. Collective reconstructive embeddings for cross-modal hashing. IEEE Transactions on Image Processing 28, 6 (2018), 2770--2784.
[15]
Yanli Ji, Feixiang Xu, Yang Yang, Fumin Shen, Heng Tao Shen, and Wei-Shi Zheng. 2018. A large-scale rgb-d database for arbitrary-view human action recognition. In Proceedings of the 26th ACM international conference on Multimedia. 1510--1518.
[16]
Yanli Ji, Feixiang Xu, Yang Yang, Ning Xie, Heng Tao Shen, and Tatsuya Harada. 2019. Attention Transfer (ANT) Network for View-invariant Action Recognition. In Proceedings of the 27th ACM International Conference on Multimedia. 574--582.
[17]
Yanli Ji, Yue Zhan, Yang Yang, Xing Xu, Fumin Shen, and Heng Tao Shen. 2019. A Knowledge Map Guided Coarse-to-fine Action Recognition. IEEE Trans. Image Processing (2019).
[18]
Yanli Ji, Yue Zhan, Yang Yang, Xing Xu, Fumin Shen, and Heng Tao Shen. 2020. A Context Knowledge Map Guided Coarse-to-Fine Action Recognition. IEEE Transactions on Image Processing 29 (2020), 2742--2752.
[19]
Michael Kampffmeyer, Yinbo Chen, Xiaodan Liang, Hao Wang, Yujia Zhang, and Eric P Xing. 2019. Rethinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11487--11496.
[20]
Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. In European Conference on Computer Vision.
[21]
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[22]
Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In International Conference for Learning Representation.
[23]
Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. 2018. Generalized zero-shot learning via synthesized examples. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4281--4289.
[24]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579--2605.
[25]
George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39--41.
[26]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In Conference on Neural Information Processing Systems Workshop.
[27]
Liang Peng, Yang Yang, Zheng Wang, Zi Huang, and Heng Tao Shen. 2020. MRA-Net: Improving VQA via Multi-modal Relation Attention Network. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[28]
Liang Peng, Yang Yang, Xiaopeng Zhang, Yanli Ji, Huimin Lu, and Heng Tao Shen. 2020. Answer Again: Improving VQA with Cascaded-Answering Model. IEEE Transactions on Knowledge and Data Engineering (2020).
[29]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[30]
Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. 2019. Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8247--8255.
[31]
Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2020. Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Transactions on Knowledge and Data Engineering (2020).
[32]
Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2020. Exploiting Subspace Relation in Semantic Labels for Cross-modal Hashing. IEEE Transactions on Knowledge and Data Engineering (2020)
[33]
Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019. Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking. In Proceedings of the 27th ACM International Conference on Multimedia. 12--20.
[34]
Xiaolong Wang, Yufei Ye, and Abhinav Gupta. 2018. Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6857--6866.
[35]
Zheng Wang, Jie Zhou, Jing Ma, Jingjing Li, Jiangbo Ai, and Yang Yang. 2020. Discovering attractive segments in the user-generated video streams. Information Processing Management 57, 1 (2020), 102130.
[36]
Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, and Heng Tao Shen. 2020. Universal Weighting Metric Learning for Cross-Modal Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13005--13014.
[37]
Jiwei Wei, Yang Yang, Jingjing Li, Lei Zhu, Lin Zuo, and Heng Tao Shen. 2019. Residual Graph Convolutional Networks for Zero-Shot Learning. In Proceedings of the ACM Multimedia Asia. 1--6.
[38]
Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018. Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5542--5551.
[39]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1492--1500.
[40]
Xing Xu, Huimin Lu, Jingkuan Song, Yang Yang, Heng Tao Shen, and Xuelong Li. 2019. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE transactions on cybernetics (2019).
[41]
Xing Xu, Huimin Lu, Jingkuan Song, Yang Yang, Heng Tao Shen, and Xuelong Li. 2020. Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval. IEEE Transactions on Cybernetics 50, 6 (2020), 2400--2413.
[42]
Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. 2020. Cross-Modal Attention With Semantic Consistence for Image-Text Matching. IEEE Transactions on Neural Networks and Learning Systems (2020).
[43]
Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. 2020. Cross-Modal Attention With Semantic Consistence for Image-Text Matching. IEEE Transactions on Neural Networks and Learning Systems (2020)
[44]
Chenrui Zhang, Xiaoqing Lyu, and Zhi Tang. 2019. TGG: Transferable Graph Generation for Zero-shot and Few-shot Learning. In Proceedings of the 27th ACM International Conference on Multimedia. 1641--1649.
[45]
Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. 2018. A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1004--1013.
[46]
Yizhe Zhu, Jianwen Xie, Bingchen Liu, and Ahmed Elgammal. 2019. Learning feature-to-feature translator by alternating back-propagation for generative zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision. 9844--9854.

Cited By

View all
  • (2023)Adaptive Latent Graph Representation Learning for Image-Text MatchingIEEE Transactions on Image Processing10.1109/TIP.2022.322963132(471-482)Online publication date: 2023

Index Terms

  1. Graph-based variational auto-encoder for generalized zero-shot learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia
    March 2021
    512 pages
    ISBN:9781450383080
    DOI:10.1145/3444685
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 May 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. generalized zero-shot learning
    2. graph-based variational autoencoder
    3. large-scale dataset

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Dongguan Songshan Lake Introduction Program of Leading Innovative and Entrepreneurial Talents
    • Fundamental Research Funds for the Central Universities
    • Sichuan Science and Technology Program, China

    Conference

    MMAsia '20
    Sponsor:
    MMAsia '20: ACM Multimedia Asia
    March 7, 2021
    Virtual Event, Singapore

    Acceptance Rates

    Overall Acceptance Rate 59 of 204 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Adaptive Latent Graph Representation Learning for Image-Text MatchingIEEE Transactions on Image Processing10.1109/TIP.2022.322963132(471-482)Online publication date: 2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media