research-article

Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network

Authors:

Heng Tao ShenAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 17, Issue 1s

Article No.: 3, Pages 1 - 17

https://doi.org/10.1145/3424341

Published: 31 March 2021 Publication History

Abstract

Conventional cross-modal retrieval models mainly assume the same scope of the classes for both the training set and the testing set. This assumption limits their extensibility on zero-shot cross-modal retrieval (ZS-CMR), where the testing set consists of unseen classes that are disjoint with seen classes in the training set. The ZS-CMR task is more challenging due to the heterogeneous distributions of different modalities and the semantic inconsistency between seen and unseen classes. A few of recently proposed approaches are inspired by zero-shot learning to estimate the distribution underlying multimodal data by generative models and make the knowledge transfer from seen classes to unseen classes by leveraging class embeddings. However, directly borrowing the idea from zero-shot learning (ZSL) is not fully adaptive to the retrieval task, since the core of the retrieval task is learning the common space. To address the above issues, we propose a novel approach named Assembling AutoEncoder and Generative Adversarial Network (AAEGAN), which combines the strength of AutoEncoder (AE) and Generative Adversarial Network (GAN), to jointly incorporate common latent space learning, knowledge transfer, and feature synthesis for ZS-CMR. Besides, instead of utilizing class embeddings as common space, the AAEGAN approach maps all multimodal data into a learned latent space with the distribution alignment via three coupled AEs. We empirically show the remarkable improvement for ZS-CMR task and establish the state-of-the-art or competitive performance on four image-text retrieval datasets.

References

[1]

Zeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. 2016. Multi-cue zero-shot learning with strong supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 59–68.

[2]

Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. 2015. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2927–2936.

[3]

Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training generative adversarial networks. arXiv:1701.04862. Retrieved from https://arxiv.org/abs/1701.04862.

[4]

Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv:1701.07875. Retrieved from https://arxiv.org/abs/1701.07875.

[5]

Lamberto Ballan, Tiberio Uricchio, Lorenzo Seidenari, and Alberto Del Bimbo. 2014. A cross-media model for automatic image annotation. In Proceedings of the Annual ACM International Conference on Multimedia Retrieval (ICMR’14). 73:73–73:80.

Digital Library

[6]

Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, and Shih-Fu Chang. 2018. Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1043–1052.

[7]

Jingze Chi and Yuxin Peng. 2018. Dual adversarial networks for zero-shot cross-media retrieval. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’18). 256–262.

Digital Library

[8]

J. Chi and Y. Peng. 2020. Zero-shot cross-media embedding learning with dual adversarial distribution network. IEEE Trans. Circ. Syst. Vid. Technol. 30, 4 (2020), 1173–1187.

[9]

Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao. Zheng. 2009. NUS-WIDE: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Content-based Image and Video Retrieval (CIVR’09).

Digital Library

[10]

Adam Coates, Andrew Ng, and Honglak Lee. 2011. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. 215–223.

[11]

F. Feng, X. Wang, and R. Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM Multimedia Conference. 7–16.

Digital Library

[12]

Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vis. 106, 2 (2014), 210–233.

Digital Library

[13]

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C. Courville. 2017. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems. 5767–5777.

Digital Library

[14]

D. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16, 12 (2004), 2639–2664.

Digital Library

[15]

Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (2006), 504–507.

[16]

Xin Huang, Yuxin Peng, and Mingkuan Yuan. 2018. MHTN: Modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE Trans. Cybernet. 14, 6 (2018), 143–156.

[17]

Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan. 2015. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans. Multimedia 17, 3 (2015), 370–381.

Digital Library

[18]

Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114.

[19]

Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 951–958.

[20]

Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36, 3 (2014), 453–465.

Digital Library

[21]

A. B. L. Larsen, S. K. Sønderby, and O. Winther. 2015. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the International Conference on Machine Learning. 1558–1566.

Digital Library

[22]

Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning (ICML’14). 1188–1196.

Digital Library

[23]

D. Li, N. Dimitrova, M. Li, and I. K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of the ACM Multimedia Conference. 604–611.

Digital Library

[24]

Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of the ACM International Conference on Multimedia. 604–611.

Digital Library

[25]

Kaiyi Lin, Xing Xu, Lianli Gao, Zheng Wang, and Heng Tao Shen. 2020. Learning cross-aligned latent embeddings for zero-shot cross-modal retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’20). 11515–11522.

[26]

Ruoyu Liu, Yao Zhao, Liang Zheng, Shikui Wei, and Yi Yang. 2017. A new evaluation protocol and benchmarking results for extendable cross-media retrieval. arXiv:1703.03567. Retrieved from https://arxiv.org/abs/1703.03567.

[27]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781. Retrieved from https://arxiv.org/abs/1301.3781.

[28]

Y. Peng, X. Huang, and J. Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’16). 3846–3853.

Digital Library

[29]

Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. Trans. Multimedia Comput. Commu. Appl. 15, 1 (2019), 22:1–22:24.

Digital Library

[30]

Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans. Multimedia 20, 2 (2018), 405–420.

Digital Library

[31]

Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4094–4102.

Digital Library

[32]

C. Rashtchian, M. Young, P. Hodosh, and J. Hockenmaier. 2010. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 674–686.

Digital Library

[33]

N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’10). 251–260.

Digital Library

[34]

Bernardino Romera-Paredes and Philip Torr. 2015. An embarrassingly simple approach to zero-shot learning. In Proceedings of the International Conference on Machine Learning. 2152–2161.

Digital Library

[35]

Mert Bulent Sariyildiz and Ramazan Gokberk Cinbis. 2019. Gradient matching generative networks for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2168–2178.

[36]

Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2020. Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Trans. Knowl. Data Eng. (2020). 1--16. https://ieeexplore.ieee.org/document/8974240/.

[37]

Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, and Yuji Matsumoto. 2015. Ridge regression, hubness, and zero-shot learning. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 135–151.

Digital Library

[38]

K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556.

[39]

Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Ng. 2013. Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems. 935–943.

Digital Library

[40]

N. Srivastava and R. Salakhutdinov. 2012. Learning representations for multimodal data with deep belief nets. In Proceedings of the International Conference on Machine Learning Workshop.

[41]

N. Srivastava and R. Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In Advances in Neural Information Processing Systems. 2222–2230.

Digital Library

[42]

Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1199–1208.

[43]

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning. 1096–1103.

Digital Library

[44]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’17). 154–162.

Digital Library

[45]

K. Wang, R. He, L. Wang, W. Wang, and T. Tan. 2011. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 38, 10 (2011), 2010–2023.

Digital Library

[46]

K. Wang, R. He, W. Wang, L. Wang, and T. Tan. 2013. Learning coupled feature spaces for cross-modal matching. In Proceedings of the IEEE International Conference on Computer Vision. 2088–2095.

Digital Library

[47]

Yang Wang. 2020. Survey on deep multi-modal data analytics: Collaboration, rivalry and fusion. arXiv:2006.08159. Retrieved from https://arxiv.org/abs/2006.08159.

[48]

Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. 2017. Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval. IEEE Trans. Image Process. 26, 3 (2017), 1393–1404.

Digital Library

[49]

Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Trans. Cybernet. 47, 2 (2017), 449–460.

[50]

Lin Wu, Yang Wang, Junbin Gao, and Xue Li. 2018. Where-and-when to look: Deep siamese attention networks for video-based person re-identification. IEEE Trans. Multimedia 21, 6 (2018), 1412–1424.

Digital Library

[51]

Lin Wu, Yang Wang, and Ling Shao. 2019. Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Trans. Image Process. 28, 4 (2019), 1602–1612.

Digital Library

[52]

Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018. Feature generating networks for zero-shot learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’18). 5542–5551.

[53]

Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. 2019. f-VAEGAN-D2: A feature generating framework for any-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10275–10284.

[54]

Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2019. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (2019), 657–672.

Digital Library

[55]

Xing Xu, Kaiyi Lin, Lianli Gao, Huimin Lu, Heng Tao Shen, and Xuelong Li. 2020. Cross-modal common representations by private-shared subspaces separation. IEEE Trans. Cybernet. (2020), 1–14. https://ieeexplore.ieee.org/document/9165187.

[56]

Xing Xu, Huimin Lu, Jingkuan Song, Yang Yang, Heng Tao Shen, and Xuelong Li. 2020. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Trans. Cybernet. 50, 6 (2020), 2400–2413.

[57]

Xing Xu, Jingkuan Song, Huimin Lu, Yang Yang, Fumin Shen, and Zi Huang. 2018. Modal-adversarial semantic learning network for extendable cross-modal retrieval. In Proceedings of the ACM Annual International Conference on Multimedia Retrieval (ICMR’18). 46–54.

Digital Library

[58]

F. Yan and K. Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15). 3441–3450.

[59]

X. Zhai, Y. Peng, and J. Xiao. 2014. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circuits Syst. Vid. Technol. 24, 6 (2014), 965–978.

[60]

Li Zhang, Tao Xiang, and Shaogang Gong. 2017. Learning a deep embedding model for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2021–2030.

[61]

Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. 2018. A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1004–1013.

[62]

Y. Zhuang, Y. Wang, F. Wu, Y. Zhang, and W. Lu. 2013. Supervised coupled dictionary learning with group structures for multi-modal retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 1070–1076.

Digital Library

Cited By

Jha MBhandari A(2024)NSDIE: Noise Suppressing Dark Image Enhancement Using Multiscale Retinex and Low-Rank MinimizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363877220:6(1-22)Online publication date: 8-Mar-2024
https://dl.acm.org/doi/10.1145/3638772
Jiang HLuo ALiu XHan SLiu S(2024)LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion ModelsComputer Vision – ECCV 202410.1007/978-3-031-73195-2_10(161-179)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73195-2_10
Chen YJhong SHsia C(2023)Roadside Unit-based Unknown Object Detection in Adverse Weather Conditions for Smart Internet of VehiclesACM Transactions on Management Information Systems10.1145/355492313:4(1-21)Online publication date: 3-Jan-2023
https://dl.acm.org/doi/10.1145/3554923
Show More Cited By

Index Terms

Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval
2. Information systems
  1. Information retrieval

Recommendations

Multimodal Disentanglement Variational AutoEncoders for Zero-Shot Cross-Modal Retrieval
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Zero-Shot Cross-Modal Retrieval (ZS-CMR) has recently drawn increasing attention as it focuses on a practical retrieval scenario, i.e., the multimodal test set consists of unseen classes that are disjoint with seen classes in the training set. The ...
Correlated Features Synthesis and Alignment for Zero-shot Cross-modal Retrieval
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

The goal of cross-modal retrieval is to search for semantically similar instances in one modality by using a query from another modality. Existing approaches mainly consider the standard scenario that requires the source set for training and the target ...
Deep cross-modal discriminant adversarial learning for zero-shot sketch-based image retrieval
Abstract
Zero-shot sketch-based image retrieval (ZS-SBIR) is an extension of sketch-based image retrieval (SBIR) that aims to search relevant images with query sketches of the unseen categories. Most previous methods focus more on preserving semantic ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 17, Issue 1s

January 2021

353 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3453990

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 March 2021

Accepted: 01 September 2020

Revised: 01 August 2020

Received: 01 April 2020

Published in TOMM Volume 17, Issue 1s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Natural Science Foundation of China
Fundamental Research Funds for the Central Universities
Sichuan Science and Technology Program, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
440
Total Downloads

Downloads (Last 12 months)68
Downloads (Last 6 weeks)13

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jha MBhandari A(2024)NSDIE: Noise Suppressing Dark Image Enhancement Using Multiscale Retinex and Low-Rank MinimizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363877220:6(1-22)Online publication date: 8-Mar-2024
https://dl.acm.org/doi/10.1145/3638772
Jiang HLuo ALiu XHan SLiu S(2024)LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion ModelsComputer Vision – ECCV 202410.1007/978-3-031-73195-2_10(161-179)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73195-2_10
Chen YJhong SHsia C(2023)Roadside Unit-based Unknown Object Detection in Adverse Weather Conditions for Smart Internet of VehiclesACM Transactions on Management Information Systems10.1145/355492313:4(1-21)Online publication date: 3-Jan-2023
https://dl.acm.org/doi/10.1145/3554923
Wang HHe XLi ZYuan JLi S(2023)JDAN: Joint Detection and Association Network for Real-Time Online Multi-Object TrackingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353325319:1s(1-17)Online publication date: 3-Feb-2023
https://dl.acm.org/doi/10.1145/3533253
Gao QLi JZhao TWang Y(2022)Real-time Image Enhancement with Attention AggregationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356460719:2s(1-19)Online publication date: 26-Sep-2022
https://dl.acm.org/doi/10.1145/3564607
Zhang YPan YYao THuang RMei TChen C(2022)Boosting Scene Graph Generation with Visual Relation SaliencyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351404119:1(1-17)Online publication date: 17-Mar-2022
https://dl.acm.org/doi/10.1145/3514041
Shuai XWang XWang WYuan XXu X(2022)SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin TransformerMultiMedia Modeling10.1007/978-3-030-98358-1_35(443-454)Online publication date: 6-Jun-2022
https://dl.acm.org/doi/10.1007/978-3-030-98358-1_35
Long XHu RXu X(2021)Unsupervised Person Re-identification via Diversity and Salience Clustering2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC52423.2021.9658877(2761-2766)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1109/SMC52423.2021.9658877

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents