skip to main content
10.1145/3477495.3532061acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Structure-Aware Semantic-Aligned Network for Universal Cross-Domain Retrieval

Published: 07 July 2022 Publication History

Abstract

The goal of cross-domain retrieval (CDR) is to search for instances of the same category in one domain by using a query from another domain. Existing CDR approaches mainly consider the standard scenario that the cross-domain data for both training and testing come from the same categories and underlying distributions. However, these methods cannot be well extended to the newly emerging task of universal cross-domain retrieval (UCDR), where the testing data belong to the domain and categories not present during training. Compared to CDR, the UCDR task is more challenging due to (1) visually diverse data from multi-source domains, (2) the domain shift between seen and unseen domains, and (3) the semantic shift across seen and unseen categories. To tackle these problems, we propose a novel model termed Structure-Aware Semantic-Aligned Network (SASA) to align the heterogeneous representations of multi-source domains without loss of generalizability for the UCDR task. Specifically, we leverage the advanced Vision Transformer (ViT) as the backbone and devise a distillation-alignment ViT (DAViT) with a novel token-based strategy, which incorporates two complementary distillation and alignment tokens into the ViT architecture. In addition, the distillation token is devised to improve the generalizability of our model by structure information preservation and the alignment token is used to improve discriminativeness with trainable categorical prototypes. Extensive experiments on three large-scale benchmarks, i.e., Sketchy, TU-Berlin, and DomainNet, demonstrate the superiority of our SASA method over the state-of-the-art UCDR and ZS-SBIR methods.

Supplementary Material

MP4 File (SIGIR2022-fp0257.mp4)
Presentation video

References

[1]
Zeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. 2016. Multi-cue zero-shot learning with strong supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 59--68.
[2]
Ihab Al Kabary and Heiko Schuldt. 2012. Sketch-Based Image Similarity Search with a Pen and Paper Interface. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, 1014.
[3]
Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. 2019. Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371 (2019).
[4]
David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. 2019. Mixmatch: A holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249 (2019).
[5]
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882 (2020).
[6]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging Properties in Self-Supervised Vision Transformers. arxiv: 2104.14294 [cs.CV]
[7]
Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, and Matthieu Cord. 2018. Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 35--44.
[8]
Zhi Chen, Sen Wang, Jingjing Li, and Zi Huang. 2020. Rethinking Generative Zero-Shot Learning: An Ensemble Learning Perspective for Recognising Visual Patches. In Proceedings of the 28th ACM International Conference on Multimedia. 3413--3421.
[9]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[10]
Sounak Dey, Pau Riba, Anjan Dutta, Josep Llados, and Yi-Zhe Song. 2019. Doodle to search: Practical zero-shot sketch-based image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2179--2188.
[11]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arxiv: 2010.11929 [cs.CV]
[12]
Anjan Dutta and Zeynep Akata. 2019. Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5089--5098.
[13]
Titir Dutta and Soma Biswas. 2019. Style-Guided Zero-Shot Sketch-based Image Retrieval. In British Machine Vision Conference 2019. 209--213.
[14]
Mathias Eitz, Kristian Hildebrand, Tamy Boubekeur, and Marc Alexa. 2010. An evaluation of descriptors for large-scale image retrieval from sketched feature lines. Computers & Graphics, Vol. 34, 5 (2010), 482--498.
[15]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning. PMLR, 1126--1135.
[16]
Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang, and David Balduzzi. 2015. Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE international conference on computer vision. 2551--2559.
[17]
Andrew Graves and Mounia Lalmas. 2002. Video Retrieval Using an MPEG-7 Based Inference Network. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 339--346.
[18]
Weixiang Hong, Kaixiang Ji, Jiajia Liu, Jian Wang, Jingdong Chen, and Wei Chu. 2021. GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval. 1379--1388.
[19]
Fei Huang, Yong Cheng, Cheng Jin, Yuejie Zhang, and Tao Zhang. 2017. Deep Multimodal Embedding Model for Fine-Grained Sketch-Based Image Retrieval. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, 929--932.
[20]
HyeongJoo Hwang, Geon-Hyeong Kim, Seunghoon Hong, and Kee-Eung Kim. 2020. Variational Interaction Information Maximization for Cross-domain Disentanglement. Advances in Neural Information Processing Systems, Vol. 33 (2020).
[21]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et almbox. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, Vol. 114, 13 (2017), 3521--3526.
[22]
Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 951--958.
[23]
Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2013. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, 3 (2013), 453--465.
[24]
Jingjing Li, Mengmeng Jing, Ke Lu, Lei Zhu, Yang Yang, and Zi Huang. 2019 a. Alleviating feature confusion for generative zero-shot learning. In Proceedings of the 27th ACM International Conference on Multimedia. 1587--1595.
[25]
Yiying Li, Yongxin Yang, Wei Zhou, and Timothy Hospedales. 2019 b. Feature-critic networks for heterogeneous domain generalization. In International Conference on Machine Learning. PMLR, 3915--3924.
[26]
Li Liu, Fumin Shen, Yuming Shen, Xianglong Liu, and Ling Shao. 2017. Deep sketch hashing: Fast free-hand sketch-based image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2862--2871.
[27]
Qing Liu, Lingxi Xie, Huiyu Wang, and Alan L Yuille. 2019. Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 3662--3671.
[28]
Yang Long, Li Liu, Ling Shao, Fumin Shen, Guiguang Ding, and Jungong Han. 2017. From zero-shot learning to conventional supervised classification: Unseen visual data synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1627--1636.
[29]
Peng Lu, Gao Huang, Hangyu Lin, Wenming Yang, Guodong Guo, and Yanwei Fu. 2021. Domain-Aware SE Network for Sketch-based Image Retrieval with Multiplicative Euclidean Margin Softmax. In Proceedings of the 29th ACM International Conference on Multimedia. 3418--3426.
[30]
Ziqian Lu, Zheming Lu, Yunlong Yu, and Zonghui Wang. 2022. Learn More from Less: Generalized Zero-Shot Learning with Severely Limited Labeled Data. Neurocomputing (2022).
[31]
Massimiliano Mancini, Zeynep Akata, Elisa Ricci, and Barbara Caputo. 2020. Towards recognizing unseen categories in unseen domains. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIII 16. Springer, 466--483.
[32]
Udit Maniyar, KJ Joseph, Aniket Anand Deshmukh, Urun Dogan, and Vineeth Balasubramanian. 2020. Zero Shot Domain Generalization. In BMVC. 1--12.
[33]
Hoang-Van Nguyen, Francesco Gelli, and Soujanya Poria. 2021. DOZEN: Cross-Domain Zero Shot Named Entity Recognition with Knowledge Graph. Association for Computing Machinery, 1642--1646.
[34]
Soumava Paul, Titir Dutta, and Soma Biswas. 2021. Universal Cross-Domain Retrieval: Generalizing Across Classes and Domains. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12056--12064.
[35]
Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. 2019. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1406--1415.
[36]
Hao Ren, Ziqiang Zheng, Yang Wu, Hong Lu, Yang Yang, and Sai-Kit Yeung. 2021. ACNet: Approaching-and-Centralizing Network for Zero-Shot Sketch-Based Image Retrieval. arXiv preprint arXiv:2111.12757 (2021).
[37]
Bernardino Romera-Paredes and Philip Torr. 2015. An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning. 2152--2161.
[38]
Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. 2016. The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG), Vol. 35, 4 (2016), 1--12.
[39]
Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. 2019. Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8247--8255.
[40]
Yuming Shen, Li Liu, Fumin Shen, and Ling Shao. 2018. Zero-shot sketch-image hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3598--3607.
[41]
Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, and Yuji Matsumoto. 2015. Ridge regression, hubness, and zero-shot learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 135--151.
[42]
Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical Networks for Few-shot Learning. Advances in Neural Information Processing Systems, Vol. 30 (2017), 4077--4087.
[43]
Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. 2020. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685 (2020).
[44]
Marvin Teichmann, Andre Araujo, Menglong Zhu, and Jack Sim. 2019. Detect-to-retrieve: Efficient regional aggregation for image search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5109--5118.
[45]
William Thong, Pascal Mettes, and Cees GM Snoek. 2020. Open cross-domain visual search. Computer Vision and Image Understanding, Vol. 200 (2020), 103045.
[46]
Jialin Tian, Xing Xu, Zheng Wang, Fumin Shen, and Xin Liu. 2021. Relationship-Preserving Knowledge Distillation for Zero-Shot Sketch Based Image Retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 5473--5481.
[47]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 10347--10357.
[48]
Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. 2018. Additive margin softmax for face verification. IEEE Signal Processing Letters, Vol. 25, 7 (2018), 926--930.
[49]
Shujun Wang, Lequan Yu, Caizi Li, Chi-Wing Fu, and Pheng-Ann Heng. 2020. Learning from extrinsic and intrinsic supervisions for domain generalization. In European Conference on Computer Vision. Springer, 159--176.
[50]
Tingyu Wang, Zhedong Zheng, Chenggang Yan, Jiyong Zhang, Yaoqi Sun, Bolun Zhenga, and Yi Yang. 2021 c. Each part matters: Local patterns facilitate cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology (2021).
[51]
Wenjie Wang, Yufeng Shi, Shiming Chen, Qinmu Peng, Feng Zheng, and Xinge You. 2021 a. Norm-guided Adaptive Visual Embedding for Zero-Shot Sketch-Based Image Retrieval. In International Joint Conference on Artificial Intelligence. 1106--1112.
[52]
Zhipeng Wang, Hao Wang, Jiexi Yan, Aming Wu, and Cheng Deng. 2021 b. Domain-Smoothing Network for Zero-Shot Sketch-Based Image Retrieval. arxiv: 2106.11841 [cs.CV]
[53]
Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. 2020. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2575--2584.
[54]
Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018. Feature generating networks for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5542--5551.
[55]
X. Xu, K. Lin, L. Gao, H. Lu, H. T. Shen, and X. Li. 2020. Cross-Modal Common Representations by Private-Shared Subspaces Separation. IEEE Transactions on Cybernetics (2020), 1--14.
[56]
Xing Xu, Kaiyi Lin, Huimin Lu, Lianli Gao, and Heng Tao Shen. 2020 a. Correlated Features Synthesis and Alignment for Zero-shot Cross-modal Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1419--1428.
[57]
X. Xu, K. Lin, Y. Yang, A. Hanjalic, and H. Shen. 2020 b. Joint Feature Synthesis and Embedding: Adversarial Cross-modal Retrieval Revisited. IEEE Transactions on Pattern Analysis & Machine Intelligence (2020), 1--18.
[58]
Xing Xu, Huimin Lu, Jingkuan Song, Yang Yang, Heng Tao Shen, and Xuelong Li. 2020 c. Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval. IEEE Trans. Cybern., Vol. 50, 6 (2020), 2400--2413.
[59]
Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. 2020 d. Cross-Modal Attention With Semantic Consistence for Image-Text Matching. IEEE Trans. Neural Networks Learn. Syst., Vol. 31, 12 (2020), 5412--5425.
[60]
Gang Yang, Jinlu Liu, and Xirong Li. 2018. Imagination Based Sample Construction for Zero-Shot Learning. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, 941--944.
[61]
Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval. 1339--1348.
[62]
Yunlong Yu, Zhong Ji, Jungong Han, and Zhongfei Zhang. 2020. Episode-Based Prototype Generating Network for Zero-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14035--14044.
[63]
Zhixiong Zeng, Shuai Wang, Nan Xu, and Wenji Mao. 2021. PAN: Prototype-Based Adaptive Network for Robust Cross-Modal Retrieval. 1125--1134.
[64]
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017a. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).
[65]
Li Zhang, Tao Xiang, and Shaogang Gong. 2017b. Learning a deep embedding model for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2021--2030.
[66]
Zhedong Zheng, Yunchao Wei, and Yi Yang. 2020. University-1652: A multi-view multi-source benchmark for drone-based geo-localization. In Proceedings of the 28th ACM international conference on Multimedia. 1395--1403.
[67]
Jiawen Zhu, Xing Xu, Fumin Shen, Roy Ka-Wei Lee, Zheng Wang, and Heng Tao Shen. 2020. Ocean: A Dual Learning Approach For Generalized Zero-Shot Sketch-Based Image Retrieval. In 2020 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.

Cited By

View all
  • (2024)Retrieval across any domains via large-scale pre-trained modelProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694374(55901-55912)Online publication date: 21-Jul-2024
  • (2024)ProS: Prompting-to-Simulate Generalized Knowledge for Universal Cross-Domain Retrieval2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01637(17292-17301)Online publication date: 16-Jun-2024
  • (2024)A Systematic Literature Review of Deep Learning Approaches for Sketch-Based Image Retrieval: Datasets, Metrics, and Future DirectionsIEEE Access10.1109/ACCESS.2024.335793912(14847-14869)Online publication date: 2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2022
3569 pages
ISBN:9781450387323
DOI:10.1145/3477495
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. alignment
  2. knowledge distillation
  3. universal cross-domain retrieval
  4. zero-shot learning

Qualifiers

  • Research-article

Funding Sources

Conference

SIGIR '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)86
  • Downloads (Last 6 weeks)17
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Retrieval across any domains via large-scale pre-trained modelProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694374(55901-55912)Online publication date: 21-Jul-2024
  • (2024)ProS: Prompting-to-Simulate Generalized Knowledge for Universal Cross-Domain Retrieval2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01637(17292-17301)Online publication date: 16-Jun-2024
  • (2024)A Systematic Literature Review of Deep Learning Approaches for Sketch-Based Image Retrieval: Datasets, Metrics, and Future DirectionsIEEE Access10.1109/ACCESS.2024.335793912(14847-14869)Online publication date: 2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media