skip to main content
research-article

Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network

Published: 31 March 2021 Publication History

Abstract

Conventional cross-modal retrieval models mainly assume the same scope of the classes for both the training set and the testing set. This assumption limits their extensibility on zero-shot cross-modal retrieval (ZS-CMR), where the testing set consists of unseen classes that are disjoint with seen classes in the training set. The ZS-CMR task is more challenging due to the heterogeneous distributions of different modalities and the semantic inconsistency between seen and unseen classes. A few of recently proposed approaches are inspired by zero-shot learning to estimate the distribution underlying multimodal data by generative models and make the knowledge transfer from seen classes to unseen classes by leveraging class embeddings. However, directly borrowing the idea from zero-shot learning (ZSL) is not fully adaptive to the retrieval task, since the core of the retrieval task is learning the common space. To address the above issues, we propose a novel approach named Assembling AutoEncoder and Generative Adversarial Network (AAEGAN), which combines the strength of AutoEncoder (AE) and Generative Adversarial Network (GAN), to jointly incorporate common latent space learning, knowledge transfer, and feature synthesis for ZS-CMR. Besides, instead of utilizing class embeddings as common space, the AAEGAN approach maps all multimodal data into a learned latent space with the distribution alignment via three coupled AEs. We empirically show the remarkable improvement for ZS-CMR task and establish the state-of-the-art or competitive performance on four image-text retrieval datasets.

References

[1]
Zeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. 2016. Multi-cue zero-shot learning with strong supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 59–68.
[2]
Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. 2015. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2927–2936.
[3]
Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training generative adversarial networks. arXiv:1701.04862. Retrieved from https://arxiv.org/abs/1701.04862.
[4]
Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv:1701.07875. Retrieved from https://arxiv.org/abs/1701.07875.
[5]
Lamberto Ballan, Tiberio Uricchio, Lorenzo Seidenari, and Alberto Del Bimbo. 2014. A cross-media model for automatic image annotation. In Proceedings of the Annual ACM International Conference on Multimedia Retrieval (ICMR’14). 73:73–73:80.
[6]
Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, and Shih-Fu Chang. 2018. Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1043–1052.
[7]
Jingze Chi and Yuxin Peng. 2018. Dual adversarial networks for zero-shot cross-media retrieval. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’18). 256–262.
[8]
J. Chi and Y. Peng. 2020. Zero-shot cross-media embedding learning with dual adversarial distribution network. IEEE Trans. Circ. Syst. Vid. Technol. 30, 4 (2020), 1173–1187.
[9]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao. Zheng. 2009. NUS-WIDE: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Content-based Image and Video Retrieval (CIVR’09).
[10]
Adam Coates, Andrew Ng, and Honglak Lee. 2011. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. 215–223.
[11]
F. Feng, X. Wang, and R. Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM Multimedia Conference. 7–16.
[12]
Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vis. 106, 2 (2014), 210–233.
[13]
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C. Courville. 2017. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems. 5767–5777.
[14]
D. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16, 12 (2004), 2639–2664.
[15]
Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (2006), 504–507.
[16]
Xin Huang, Yuxin Peng, and Mingkuan Yuan. 2018. MHTN: Modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE Trans. Cybernet. 14, 6 (2018), 143–156.
[17]
Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan. 2015. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans. Multimedia 17, 3 (2015), 370–381.
[18]
Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114.
[19]
Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 951–958.
[20]
Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36, 3 (2014), 453–465.
[21]
A. B. L. Larsen, S. K. Sønderby, and O. Winther. 2015. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the International Conference on Machine Learning. 1558–1566.
[22]
Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning (ICML’14). 1188–1196.
[23]
D. Li, N. Dimitrova, M. Li, and I. K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of the ACM Multimedia Conference. 604–611.
[24]
Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of the ACM International Conference on Multimedia. 604–611.
[25]
Kaiyi Lin, Xing Xu, Lianli Gao, Zheng Wang, and Heng Tao Shen. 2020. Learning cross-aligned latent embeddings for zero-shot cross-modal retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’20). 11515–11522.
[26]
Ruoyu Liu, Yao Zhao, Liang Zheng, Shikui Wei, and Yi Yang. 2017. A new evaluation protocol and benchmarking results for extendable cross-media retrieval. arXiv:1703.03567. Retrieved from https://arxiv.org/abs/1703.03567.
[27]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781. Retrieved from https://arxiv.org/abs/1301.3781.
[28]
Y. Peng, X. Huang, and J. Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’16). 3846–3853.
[29]
Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. Trans. Multimedia Comput. Commu. Appl. 15, 1 (2019), 22:1–22:24.
[30]
Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans. Multimedia 20, 2 (2018), 405–420.
[31]
Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4094–4102.
[32]
C. Rashtchian, M. Young, P. Hodosh, and J. Hockenmaier. 2010. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 674–686.
[33]
N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’10). 251–260.
[34]
Bernardino Romera-Paredes and Philip Torr. 2015. An embarrassingly simple approach to zero-shot learning. In Proceedings of the International Conference on Machine Learning. 2152–2161.
[35]
Mert Bulent Sariyildiz and Ramazan Gokberk Cinbis. 2019. Gradient matching generative networks for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2168–2178.
[36]
Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2020. Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Trans. Knowl. Data Eng. (2020). 1--16. https://ieeexplore.ieee.org/document/8974240/.
[37]
Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, and Yuji Matsumoto. 2015. Ridge regression, hubness, and zero-shot learning. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 135–151.
[38]
K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556.
[39]
Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Ng. 2013. Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems. 935–943.
[40]
N. Srivastava and R. Salakhutdinov. 2012. Learning representations for multimodal data with deep belief nets. In Proceedings of the International Conference on Machine Learning Workshop.
[41]
N. Srivastava and R. Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In Advances in Neural Information Processing Systems. 2222–2230.
[42]
Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1199–1208.
[43]
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning. 1096–1103.
[44]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’17). 154–162.
[45]
K. Wang, R. He, L. Wang, W. Wang, and T. Tan. 2011. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 38, 10 (2011), 2010–2023.
[46]
K. Wang, R. He, W. Wang, L. Wang, and T. Tan. 2013. Learning coupled feature spaces for cross-modal matching. In Proceedings of the IEEE International Conference on Computer Vision. 2088–2095.
[47]
Yang Wang. 2020. Survey on deep multi-modal data analytics: Collaboration, rivalry and fusion. arXiv:2006.08159. Retrieved from https://arxiv.org/abs/2006.08159.
[48]
Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. 2017. Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval. IEEE Trans. Image Process. 26, 3 (2017), 1393–1404.
[49]
Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Trans. Cybernet. 47, 2 (2017), 449–460.
[50]
Lin Wu, Yang Wang, Junbin Gao, and Xue Li. 2018. Where-and-when to look: Deep siamese attention networks for video-based person re-identification. IEEE Trans. Multimedia 21, 6 (2018), 1412–1424.
[51]
Lin Wu, Yang Wang, and Ling Shao. 2019. Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Trans. Image Process. 28, 4 (2019), 1602–1612.
[52]
Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018. Feature generating networks for zero-shot learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’18). 5542–5551.
[53]
Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. 2019. f-VAEGAN-D2: A feature generating framework for any-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10275–10284.
[54]
Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2019. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (2019), 657–672.
[55]
Xing Xu, Kaiyi Lin, Lianli Gao, Huimin Lu, Heng Tao Shen, and Xuelong Li. 2020. Cross-modal common representations by private-shared subspaces separation. IEEE Trans. Cybernet. (2020), 1–14. https://ieeexplore.ieee.org/document/9165187.
[56]
Xing Xu, Huimin Lu, Jingkuan Song, Yang Yang, Heng Tao Shen, and Xuelong Li. 2020. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Trans. Cybernet. 50, 6 (2020), 2400–2413.
[57]
Xing Xu, Jingkuan Song, Huimin Lu, Yang Yang, Fumin Shen, and Zi Huang. 2018. Modal-adversarial semantic learning network for extendable cross-modal retrieval. In Proceedings of the ACM Annual International Conference on Multimedia Retrieval (ICMR’18). 46–54.
[58]
F. Yan and K. Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15). 3441–3450.
[59]
X. Zhai, Y. Peng, and J. Xiao. 2014. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circuits Syst. Vid. Technol. 24, 6 (2014), 965–978.
[60]
Li Zhang, Tao Xiang, and Shaogang Gong. 2017. Learning a deep embedding model for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2021–2030.
[61]
Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. 2018. A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1004–1013.
[62]
Y. Zhuang, Y. Wang, F. Wu, Y. Zhang, and W. Lu. 2013. Supervised coupled dictionary learning with group structures for multi-modal retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 1070–1076.

Cited By

View all
  • (2024)NSDIE: Noise Suppressing Dark Image Enhancement Using Multiscale Retinex and Low-Rank MinimizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363877220:6(1-22)Online publication date: 8-Mar-2024
  • (2024)LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion ModelsComputer Vision – ECCV 202410.1007/978-3-031-73195-2_10(161-179)Online publication date: 29-Sep-2024
  • (2023)Roadside Unit-based Unknown Object Detection in Adverse Weather Conditions for Smart Internet of VehiclesACM Transactions on Management Information Systems10.1145/355492313:4(1-21)Online publication date: 3-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 17, Issue 1s
January 2021
353 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3453990
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 March 2021
Accepted: 01 September 2020
Revised: 01 August 2020
Received: 01 April 2020
Published in TOMM Volume 17, Issue 1s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cross-modal retrieval
  2. zero-shot learning
  3. feature synthesis

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • Fundamental Research Funds for the Central Universities
  • Sichuan Science and Technology Program, China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)68
  • Downloads (Last 6 weeks)13
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)NSDIE: Noise Suppressing Dark Image Enhancement Using Multiscale Retinex and Low-Rank MinimizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363877220:6(1-22)Online publication date: 8-Mar-2024
  • (2024)LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion ModelsComputer Vision – ECCV 202410.1007/978-3-031-73195-2_10(161-179)Online publication date: 29-Sep-2024
  • (2023)Roadside Unit-based Unknown Object Detection in Adverse Weather Conditions for Smart Internet of VehiclesACM Transactions on Management Information Systems10.1145/355492313:4(1-21)Online publication date: 3-Jan-2023
  • (2023)JDAN: Joint Detection and Association Network for Real-Time Online Multi-Object TrackingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353325319:1s(1-17)Online publication date: 3-Feb-2023
  • (2022)Real-time Image Enhancement with Attention AggregationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356460719:2s(1-19)Online publication date: 26-Sep-2022
  • (2022)Boosting Scene Graph Generation with Visual Relation SaliencyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351404119:1(1-17)Online publication date: 17-Mar-2022
  • (2022)SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin TransformerMultiMedia Modeling10.1007/978-3-030-98358-1_35(443-454)Online publication date: 6-Jun-2022
  • (2021)Unsupervised Person Re-identification via Diversity and Salience Clustering2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC52423.2021.9658877(2761-2766)Online publication date: 17-Oct-2021

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media