skip to main content
10.1145/3477495.3532028acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Multimodal Disentanglement Variational AutoEncoders for Zero-Shot Cross-Modal Retrieval

Published: 07 July 2022 Publication History

Abstract

Zero-Shot Cross-Modal Retrieval (ZS-CMR) has recently drawn increasing attention as it focuses on a practical retrieval scenario, i.e., the multimodal test set consists of unseen classes that are disjoint with seen classes in the training set. The recently proposed methods typically adopt the generative model as the main framework to learn a joint latent embedding space to alleviate the modality gap. Generally, these methods largely rely on auxiliary semantic embeddings for knowledge transfer across classes and unconsciously neglect the effect of the data reconstruction manner in the adopted generative model. To address this issue, we propose a novel ZS-CMR model termed Multimodal Disentanglement Variational AutoEncoders (MDVAE), which consists of two coupled disentanglement variational autoencoders (DVAEs) and a fusion-exchange VAE (FVAE). Specifically, DVAE is developed to disentangle the original representations of each modality into modality-invariant and modality-specific features. FVAE is designed to fuse and exchange information of multimodal data by the reconstruction and alignment process without pre-extracted semantic embeddings. Moreover, an advanced counter-intuitive cross-reconstruction scheme is further proposed to enhance the informativeness and generalizability of the modality-invariant features for more effective knowledge transfer. The comprehensive experiments on four image-text retrieval and two image-sketch retrieval datasets consistently demonstrate that our method establishes the new state-of-the-art performance.

Supplementary Material

MP4 File (SIGIR22-fp0210.mp4)
Presentation video

References

[1]
Jia-Ren Chang, Yong-Sheng Chen, and Wei-Chen Chiu. 2021. Learning Facial Representations from the Cycle-consistency of Face. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9680--9689.
[2]
Jingze Chi and Yuxin Peng. 2018a. Dual Adversarial Networks for Zero-shot Cross-media Retrieval. In IJCAI. 663--669.
[3]
Jingze Chi and Yuxin Peng. 2018b. Dual Adversarial Networks for Zero-shot Cross-media Retrieval. In IJCAI. 256--262.
[4]
J. Chi and Y. Peng. 2020. Zero-shot Cross-media Embedding Learning with Dual Adversarial Distribution Network. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 4 (2020), 1173--1187.
[5]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao. Zheng. 2009. NUS-WIDE: A Real-World Web Image Database from National University of Singapore. In ACM CVIR.
[6]
Sounak Dey, Pau Riba, Anjan Dutta, Josep Llados, and Yi-Zhe Song. 2019. Doodle to search: Practical zero-shot sketch-based image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2179--2188.
[7]
Anjan Dutta and Zeynep Akata. 2019. Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5089--5098.
[8]
Titir Dutta and Soma Biswas. 2019 a. Generalized Zero-Shot Cross-Modal Retrieval. IEEE Trans. Image Process., Vol. 28, 12 (2019), 5953--5962.
[9]
Titir Dutta and Soma Biswas. 2019 b. Style-Guided Zero-Shot Sketch-based Image Retrieval. In British Machine Vision Conference 2019. 209--213.
[10]
Mathias Eitz, Kristian Hildebrand, Tamy Boubekeur, and Marc Alexa. 2010a. An evaluation of descriptors for large-scale image retrieval from sketched feature lines. Computers & Graphics, Vol. 34, 5 (2010), 482--498.
[11]
Mathias Eitz, Kristian Hildebrand, Tamy Boubekeur, and Marc Alexa. 2010b. Sketch-based image retrieval: Benchmark and bag-of-features descriptors. IEEE Transactions on Visualization and Computer Graphics, Vol. 17, 11 (2010), 1624--1636.
[12]
Naama Hadad, Lior Wolf, and Moni Shahar. 2018. A two-step disentanglement method. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 772--780.
[13]
Xin Huang, Yuxin Peng, and Mingkuan Yuan. 2018. MHTN: Modal-adversarial Hybrid Transfer Network for Cross-modal Retrieval. IEEE Trans. Cybernetics, Vol. 14, 6 (2018), 143--156.
[14]
HyeongJoo Hwang, Geon-Hyeong Kim, Seunghoon Hong, and Kee-Eung Kim. 2020. Variational Interaction Information Maximization for Cross-domain Disentanglement. Advances in Neural Information Processing Systems, Vol. 33 (2020).
[15]
Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan. 2015. Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval. IEEE Trans. Multimedia, Vol. 17, 3 (2015), 370--381.
[16]
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[17]
Elyor Kodirov, Tao Xiang, and Shaogang Gong. 2017. Semantic autoencoder for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3174--3183.
[18]
Seunghun Lee, Sunghyun Cho, and Sunghoon Im. 2021. DRANet: Disentangling Representation and Adaptation Networks for Unsupervised Cross-Domain Adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15252--15261.
[19]
Kaiyi Lin, Xing Xu, Lianli Gao, Zheng Wang, and Heng Tao Shen. 2020. Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 11515--11522.
[20]
Li Liu, Fumin Shen, Yuming Shen, Xianglong Liu, and Ling Shao. 2017. Deep sketch hashing: Fast free-hand sketch-based image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2862--2871.
[21]
Qing Liu, Lingxi Xie, Huiyu Wang, and Alan L Yuille. 2019. Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 3662--3671.
[22]
L. Maaten and G. Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, Vol. 9 (2008), 2579--2605.
[23]
Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network. IEEE Transactions on Multimedia, Vol. 20, 2 (2018), 405--420.
[24]
C. Rashtchian, M. Young, P. Hodosh, and J. Hockenmaier. 2010. Collecting Image Annotations Using Amazon's Mechanical Turk. In NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. 674--686.
[25]
N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In ACM MM. 251--260.
[26]
Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. 2014. Learning to disentangle factors of variation with manifold interaction. In International conference on machine learning. PMLR, 1431--1439.
[27]
Jose M Saavedra, Juan Manuel Barrios, and S Orand. 2015. Sketch based Image Retrieval using Learned KeyShapes (LKS). In Proceedings of the British Machine Vision Conference 2015, Vol. 1. 1--11.
[28]
Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. 2016. The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG), Vol. 35, 4 (2016), 1--12.
[29]
Yuming Shen, Li Liu, Fumin Shen, and Ling Shao. 2018. Zero-shot sketch-image hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3598--3607.
[30]
K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, Vol. abs/1409.1556 (2014).
[31]
Yu Sugiyama and Keiji Yanai. 2021. Cross-Modal Recipe Embeddings by Disentangling Recipe Contents and Dish Styles. In Proceedings of the 29th ACM International Conference on Multimedia. 2501--2509.
[32]
Jialin Tian, Xing Xu, Zheng Wang, Fumin Shen, and Xin Liu. 2021. Relationship-Preserving Knowledge Distillation for Zero-Shot Sketch Based Image Retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 5473--5481.
[33]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In ACM MM. 154--162.
[34]
K. Wang, R. He, W. Wang, L. Wang, and T. Tan. 2013. Learning coupled feature spaces for cross-modal matching. In IEEE International Conference on Computer Vision. 2088--2095.
[35]
Zhipeng Wang, Hao Wang, Jiexi Yan, Aming Wu, and Cheng Deng. 2021. Domain-Smoothing Network for Zero-Shot Sketch-Based Image Retrieval. arxiv: 2106.11841 [cs.CV]
[36]
Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-Modal Retrieval With CNN Visual Features: A New Baseline. IEEE Trans. Cybernetics, Vol. 47, 2 (2017), 449--460.
[37]
Lin Wu, Yang Wang, and Ling Shao. 2019. Cycle-Consistent Deep Generative Hashing for Cross-Modal Retrieval. IEEE Trans. Image Processing, Vol. 28, 4 (2019), 1602--1612.
[38]
Xing Xu, Kaiyi Lin, Lianli Gao, Huimin Lu, Heng Tao Shen, and Xuelong Li. 2020. Cross-Modal Common Representations by Private-Shared Subspaces Separation. IEEE Transactions on Cybernetics (2020), 1--14.
[39]
Xing Xu, Kaiyi Lin, Huimin Lu, Lianli Gao, and Heng Tao Shen. 2020 a. Correlated Features Synthesis and Alignment for Zero-shot Cross-modal Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1419--1428.
[40]
Xing Xu, Kaiyi Lin, Yang Yang, Alan Hanjalic, and Heng Tao Shen. 2020 b. Joint Feature Synthesis and Embedding: Adversarial Cross-modal Retrieval Revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[41]
Xing Xu, Huimin Lu, Jingkuan Song, Yang Yang, Heng Tao Shen, and Xuelong Li. 2020 c. Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval. IEEE Trans. Cybern., Vol. 50, 6 (2020), 2400--2413.
[42]
Xing Xu, Jingkuan Song, Huimin Lu, Yang Yang, Fumin Shen, and Zi Huang. 2018. Modal-adversarial Semantic Learning Network for Extendable Cross-modal Retrieval. In ACM ICMR. 46--54.
[43]
Xing Xu, Jialin Tian, Kaiyi Lin, Huimin Lu, Jie Shao, and Heng Tao Shen. 2021. Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 17, 1s (2021), 1--17.
[44]
Xinxun Xu, Muli Yang, Yanhua Yang, and Hao Wang. [n.,d.]. Progressive Domain-Independent Feature Decomposition Network for Zero-Shot Sketch-Based Image Retrieval. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20. 984--990.
[45]
F. Yan and K. Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR. 3441--3450.
[46]
Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang. 2021. CausalVAE: disentangled representation learning via neural structural causal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9593--9602.
[47]
Yang Yang, Yadan Luo, Weilun Chen, Fumin Shen, Jie Shao, and Heng Tao Shen. 2016. Zero-shot hashing via transferring supervised knowledge. In Proceedings of the 24th ACM International Conference on Multimedia. 1286--1295.
[48]
Sasi Kiran Yelamarthi, Shiva Krishna Reddy, Ashish Mishra, and Anurag Mittal. 2018. A zero-shot framework for sketch based image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 300--317.
[49]
Qian Yu, Feng Liu, Yi-Zhe Song, Tao Xiang, Timothy M Hospedales, and Chen-Change Loy. 2016. Sketch me that shoe. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 799--807.
[50]
Qian Yu, Yongxin Yang, Feng Liu, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. 2017. Sketch-a-net: A deep neural network that beats humans. International journal of computer vision, Vol. 122, 3 (2017), 411--425.

Cited By

View all
  • (2025)Domain disentanglement and fusion based on hyperbolic neural networks for zero-shot sketch-based image retrievalInformation Processing & Management10.1016/j.ipm.2024.10396362:1(103963)Online publication date: Jan-2025
  • (2024)Unsupervised Cross-Domain Image Retrieval with Semantic-Attended Mixture-of-ExpertsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657826(197-207)Online publication date: 10-Jul-2024
  • (2024)CFIR: Fast and Effective Long-Text To Image Retrieval for Large CorporaProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657741(2188-2198)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2022
3569 pages
ISBN:9781450387323
DOI:10.1145/3477495
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal retrieval
  2. disentanglement
  3. reconstruction
  4. variational autoencoder
  5. zero-shot learning

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • Meituan
  • Sichuan Science and Technology Program

Conference

SIGIR '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)263
  • Downloads (Last 6 weeks)47
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Domain disentanglement and fusion based on hyperbolic neural networks for zero-shot sketch-based image retrievalInformation Processing & Management10.1016/j.ipm.2024.10396362:1(103963)Online publication date: Jan-2025
  • (2024)Unsupervised Cross-Domain Image Retrieval with Semantic-Attended Mixture-of-ExpertsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657826(197-207)Online publication date: 10-Jul-2024
  • (2024)CFIR: Fast and Effective Long-Text To Image Retrieval for Large CorporaProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657741(2188-2198)Online publication date: 10-Jul-2024
  • (2024)Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation With InterpretabilityIEEE Transactions on Multimedia10.1109/TMM.2024.336987526(7543-7554)Online publication date: 28-Feb-2024
  • (2024)Cross-Modal Retrieval: A Systematic Review of Methods and Future DirectionsProceedings of the IEEE10.1109/JPROC.2024.3525147112:11(1716-1754)Online publication date: Nov-2024
  • (2024)Cross-modal retrieval based on Attention Embedded Variational Auto-Encoder2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651513(1-7)Online publication date: 30-Jun-2024
  • (2024)A Systematic Literature Review of Deep Learning Approaches for Sketch-Based Image Retrieval: Datasets, Metrics, and Future DirectionsIEEE Access10.1109/ACCESS.2024.335793912(14847-14869)Online publication date: 2024
  • (2024)A novel attention-based cross-modal transfer learning framework for predicting cardiovascular diseaseComputers in Biology and Medicine10.1016/j.compbiomed.2024.107977170:COnline publication date: 25-Jun-2024
  • (2024)Disentangled Variational Autoencoder for Social RecommendationNeural Processing Letters10.1007/s11063-024-11607-y56:3Online publication date: 29-Apr-2024
  • (2023)Interpretability for reliable, efficient, and self-cognitive DNNsNeurocomputing10.1016/j.neucom.2023.126267545:COnline publication date: 7-Aug-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media