skip to main content
10.1145/3581783.3611949acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning Style-Invariant Robust Representation for Generalizable Visual Instance Retrieval

Published: 27 October 2023 Publication History

Abstract

Visual Instance Retrieval (VIR) is a hot research topic for its wide application in real world, such as object re-identification in smart city scenarios. However, due to the limited style diversity in source training data, most existing VIR models always fail to generalize well to unseen domain. How to improve the generalizability of VIR models has received increasing attention in most recent years.
In this paper, we pay attention to the Single Domain Generalization (SDG) based VIR task, a more challenging but practical problem, where the model is only trained on single domain data and directly evaluated on unseen target domain without any fine-tuning or adaptations. In this case, the limited style variance in training data may cause the model learning incorrect reliance on the superficial style feature and reduce the generalizability of the model. To address this issue, we propose a novel Style-Invariant robust Representation Learning (SIRL) method for the challenging task, which mainly aims to first diversify the training data with style augmentation, and then enforce the model to learn style-invariant features. Specifically, we first design an adversarial style synthesis module which learns to synthesize diverse augmented samples with adversarially learned styles. Then, we devise an invariant feature learning module to minimize cross-domain feature inconsistency between source images and style-augmented images for capturing domain-invariant instance features. In this way, we can prevent the model from over-exploiting semantic content-independent cues (e.g., color) as shortcut features, thereby estimating the pairwise instance similarity more robustly. We integrate our SIRL method with SOTA VIR networks and evaluate its effectiveness on several public benchmark datasets. Extensive experiments clearly show that the SIRL method can substantially improve the generalizability of existing VIR networks in the challenging SDG-VIR setting.

Supplemental Material

MP4 File
In this work, we have proposed an effective method for the Single Domain Generalization (SDG) Visual Instance Retrieval (VIR) task. Our method is plug-and-play and can be easily integrated with any baseline models. Extensive experimental results have shown that our method consistently advances the learning of robust style-invariant features and substantially improves the generalization performance of VIR networks on unseen target domain datasets.

References

[1]
Yan Bai, Jile Jiao, Wang Ce, Jun Liu, Yihang Lou, Xuetao Feng, and Ling-Yu Duan. 2021. Person30k: A dual-meta generalization network for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2123--2132.
[2]
Tianyu Chang, Xun Yang, Tianzhu Zhang, and Meng Wang. 2023. Domain Generalized Stereo Matching via Hierarchical Visual Transformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9559--9568.
[3]
Chen Chen, Zeju Li, Cheng Ouyang, Matthew Sinclair, Wenjia Bai, and Daniel Rueckert. 2022. MaxStyle: Adversarial Style Composition for Robust Medical Image Segmentation. In Medical Image Computing and Computer Assisted Intervention--MICCAI 2022: 25th International Conference, Singapore, September 18-22, 2022, Proceedings, Part V. Springer, 151--161.
[4]
Seokeon Choi, Taekyung Kim, Minki Jeong, Hyoungseob Park, and Changick Kim. 2021. Meta batch-instance normalization for generalizable person re-identification. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. 3425--3435.
[5]
Rong Dai, Li Shen, Fengxiang He, Xinmei Tian, and Dacheng Tao. 2022. DisPFL: Towards Communication-Efficient Personalized Federated Learning via Decentralized Sparse Training. In International Conference on Machine Learning. PMLR, 4587--4604.
[6]
Jianhua Deng, Yang Hao, Muhammad Saddam Khokhar, Rajesh Kumar, Jingye Cai, Jay Kumar, and Muhammad Umar Aftab. 2021. Trends in vehicle re-identification past, present, and future: A comprehensive review. Mathematics, Vol. 9, 24 (2021), 3162.
[7]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual Encoding for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44 (2021), 4065--4080.
[8]
Tiantian Gong, Kaixiang Chen, Liyan Zhang, and Junsheng Wang. 2023. Debiased Contrastive Curriculum Learning for Progressive Generalizable Person Re-identification. IEEE Transactions on Circuits and Systems for Video Technology (2023).
[9]
Lingxiao He, Xingyu Liao, Wu Liu, Xinchen Liu, Peng Cheng, and Tao Mei. 2020. Fastreid: A pytorch toolbox for general instance re-identification. arXiv preprint arXiv:2006.02631 (2020).
[10]
Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision. 1501--1510.
[11]
Bo Jiang, Xixi Wang, Aihua Zheng, Jin Tang, and Bin Luo. 2021. Ph-gcn: Person retrieval with part-based hierarchical graph convolutional network. IEEE Transactions on Multimedia, Vol. 24 (2021), 3218--3228.
[12]
Bingliang Jiao, Lingqiao Liu, Liying Gao, Guosheng Lin, Lu Yang, Shizhou Zhang, Peng Wang, and Yanning Zhang. 2022. Dynamically Transformed Instance Normalization Network for Generalizable Person Re-Identification. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XIV. Springer, 285--301.
[13]
Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen, and Li Zhang. 2020. Style normalization and restitution for generalizable person re-identification. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3143--3152.
[14]
Juwon Kang, Sohyun Lee, Namyup Kim, and Suha Kwak. 2022. Style Neophile: Constantly Seeking Novel Styles for Domain Generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7130--7140.
[15]
Pirazh Khorramshahi, Amit Kumar, Neehar Peri, Sai Saketh Rambhatla, Jun-Cheng Chen, and Rama Chellappa. 2019. A dual-path model with adaptive attention for vehicle re-identification. In Proceedings of the IEEE/CVF international conference on computer vision. 6132--6141.
[16]
Devinder Kumar, Parthipan Siva, Paul Marchwica, and Alexander Wong. 2020. Unsupervised domain adaptation in person re-id via k-reciprocal clustering and large-scale heterogeneous environment synthesis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2645--2654.
[17]
Sangrok Lee, Eunsoo Park, Hongsuk Yi, and Sang Hun Lee. 2020. Strdan: Synthetic-to-real domain adaptation network for vehicle re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 608--609.
[18]
Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. 2014. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 152--159.
[19]
Xiaotong Li, Yongxing Dai, Yixiao Ge, Jun Liu, Ying Shan, and Ling-Yu Duan. 2022. Uncertainty modeling for out-of-distribution generalization. arXiv preprint arXiv:2202.03958 (2022).
[20]
Shengcai Liao and Ling Shao. 2020. Interpretable and generalizable person re-identification with query-adaptive convolution and temporal lifting. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XI 16. Springer, 456--474.
[21]
Shengcai Liao and Ling Shao. 2021. Transmatcher: Deep image matching through transformers for generalizable person re-identification. Advances in Neural Information Processing Systems, Vol. 34 (2021), 1992--2003.
[22]
Shengcai Liao and Ling Shao. 2022. Graph sampling based deep metric learning for generalizable person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7359--7368.
[23]
Xinchen Liu, Wu Liu, Huadong Ma, and Huiyuan Fu. 2016. Large-scale vehicle re-identification in urban surveillance videos. In 2016 IEEE international conference on multimedia and expo (ICME). IEEE, 1--6.
[24]
Xueliang Liu, Xun Yang, Meng Wang, and Richang Hong. 2020. Deep Neighborhood Component Analysis for Visual Similarity Modeling. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 11 (2020), 1--15.
[25]
Yihang Lou, Yan Bai, Jun Liu, Shiqi Wang, and Lingyu Duan. 2019. Veri-wild: A large dataset and a new method for vehicle re-identification in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3235--3243.
[26]
Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. 2019. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 0-0.
[27]
Fangrui Lv, Jian Liang, Shuang Li, Bin Zang, Chi Harold Liu, Ziteng Wang, and Di Liu. 2022. Causality inspired representation learning for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8046--8056.
[28]
Hao Ni, Jingkuan Song, Xiaopeng Luo, Feng Zheng, Wen Li, and Heng Tao Shen. 2022. Meta distribution alignment for generalizable person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2487--2496.
[29]
Xingang Pan, Ping Luo, Jianping Shi, and Xiaoou Tang. 2018. Two at once: Enhancing learning and generalization capacities via ibn-net. In Proceedings of the European Conference on Computer Vision (ECCV). 464--479.
[30]
Jinjia Peng, Yang Wang, Huibing Wang, Zhao Zhang, Xianping Fu, and Meng Wang. 2020. Unsupervised vehicle re-identification with progressive adaptation. arXiv preprint arXiv:2006.11486 (2020).
[31]
Xuelin Qian, Yanwei Fu, Tao Xiang, Yu-Gang Jiang, and Xiangyang Xue. 2019. Leader-based multi-scale attention deep architecture for person re-identification. IEEE transactions on pattern analysis and machine intelligence, Vol. 42, 2 (2019), 371--385.
[32]
Filip Radenović, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondvr ej Chum. 2018. Revisiting oxford and paris: Large-scale image retrieval benchmarking. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5706--5715.
[33]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.
[34]
Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European conference on computer vision (ECCV). 480--496.
[35]
Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. 2018. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM international conference on Multimedia. 274--282.
[36]
Ruoyu Wang, Mingyang Yi, Zhitang Chen, and Shengyu Zhu. 2022. Out-of-distribution generalization with causal invariant transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 375--385.
[37]
Xiang Wang, Hongye Jin, An Zhang, Xiangnan He, Tong Xu, and Tat-Seng Chua. 2020a. Disentangled graph collaborative filtering. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. 1001--1010.
[38]
Yanan Wang, Shengcai Liao, and Ling Shao. 2020b. Surpassing real-world source training data: Random 3d characters for generalizable person re-identification. In Proceedings of the 28th ACM international conference on multimedia. 3422--3430.
[39]
Zijian Wang, Yadan Luo, Ruihong Qiu, Zi Huang, and Mahsa Baktashmotlagh. 2021. Learning to diversify for single domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 834--843.
[40]
Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. 2018. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 79--88.
[41]
Boqiang Xu, Jian Liang, Lingxiao He, and Zhenan Sun. 2022. Mimic Embedding via Adaptive Aggregation: Learning Generalizable Person Re-identification. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XIV. Springer, 372--388.
[42]
Jing Xu, Rui Zhao, Feng Zhu, Huaming Wang, and Wanli Ouyang. 2018. Attention-aware compositional network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2119--2128.
[43]
Qinwei Xu, Ruipeng Zhang, Ya Zhang, Yanfeng Wang, and Qi Tian. 2021. A fourier-based framework for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14383--14392.
[44]
Chenggang Yan, Biao Gong, Yuxuan Wei, and Yue Gao. 2020. Deep multi-view enhancement hashing for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 4 (2020), 1445--1451.
[45]
Cheng Yan, Guansong Pang, Xiao Bai, Changhong Liu, Xin Ning, Lin Gu, and Jun Zhou. 2021. Beyond triplet loss: person re-identification with fine-grained difference-aware pairwise loss. IEEE Transactions on Multimedia, Vol. 24 (2021), 1665--1677.
[46]
Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded Video Moment Retrieval with Causal Intervention. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (2021).
[47]
Xun Yang, Xiangnan He, Xiang Wang, Yunshan Ma, Fuli Feng, Meng Wang, and Tat-Seng Chua. 2019a. Interpretable Fashion Matching with Rich Attributes. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (2019).
[48]
Xun Yang, Xueliang Liu, Meng Jian, Xinjian Gao, and Meng Wang. 2020. Weakly-supervised video object grounding by exploring spatio-temporal contexts. In Proceedings of the 28th ACM international conference on multimedia. 1939--1947.
[49]
Xun Yang, Meng Wang, and Dacheng Tao. 2017. Person re-identification with metric learning using privileged information. IEEE Transactions on Image Processing, Vol. 27, 2 (2017), 791--805.
[50]
Xun Yang, Meng Wang, and Dacheng Tao. 2018. Person Re-Identification With Metric Learning Using Privileged Information. IEEE transactions on image processing: a publication of the IEEE Signal Processing Society, Vol. 27 2 (2018), 791--805.
[51]
Xun Yang, Shanshan Wang, Jian Dong, Jianfeng Dong, Meng Wang, and Tat-Seng Chua. 2022. Video moment retrieval with cross-modal neural architecture search. IEEE Transactions on Image Processing, Vol. 31 (2022), 1204--1216.
[52]
X Yang, P Zhou, and M Wang. 2019b. Person Reidentification via Structural Deep Metric Learning. IEEE Transactions on Neural Networks and Learning Systems, Vol. 30, 10 (2019), 2987--2998.
[53]
Ye Yuan, Wuyang Chen, Tianlong Chen, Yang Yang, Zhou Ren, Zhangyang Wang, and Gang Hua. 2020. Calibrated domain-invariant learning for highly generalizable large scale re-identification. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 3589--3598.
[54]
Pengyi Zhang, Huanzhang Dou, Yunlong Yu, and Xi Li. 2022a. Adaptive Cross-domain Learning for Generalizable Person Re-identification. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XIV. Springer, 215--232.
[55]
Yabin Zhang, Bin Deng, Ruihuang Li, Kui Jia, and Lei Zhang. 2023. Adversarial Style Augmentation for Domain Generalization. arXiv preprint arXiv:2301.12643 (2023).
[56]
Yabin Zhang, Minghan Li, Ruihuang Li, Kui Jia, and Lei Zhang. 2022b. Exact feature distribution matching for arbitrary style transfer and domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8035--8045.
[57]
Yuyang Zhao, Zhun Zhong, Fengxiang Yang, Zhiming Luo, Yaojin Lin, Shaozi Li, and Nicu Sebe. 2021. Learning to generalize unseen domains via memory-based multi-source meta-learning for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6277--6286.
[58]
Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision. 1116--1124.
[59]
Zhun Zhong, Yuyang Zhao, Gim Hee Lee, and Nicu Sebe. 2022. Adversarial style augmentation for domain generalized urban-scene segmentation. arXiv preprint arXiv:2207.04892 (2022).
[60]
Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. 2017. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1318--1327.
[61]
Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, and Tao Xiang. 2021a. Learning generalisable omni-scale representations for person re-identification. IEEE transactions on pattern analysis and machine intelligence, Vol. 44, 9 (2021), 5056--5069.
[62]
Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, and Tao Xiang. 2020. Learning to generate novel domains for domain generalization. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XVI 16. Springer, 561--578.
[63]
Kaiyang Zhou, Yongxin Yang, Yu Qiao, and Tao Xiang. 2021b. Domain generalization with mixstyle. arXiv preprint arXiv:2104.02008 (2021).
[64]
Zijie Zhuang, Longhui Wei, Lingxi Xie, Tianyu Zhang, Hengheng Zhang, Haozhe Wu, Haizhou Ai, and Qi Tian. 2020. Rethinking the distribution gap of person re-identification with camera-based batch normalization. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XII 16. Springer, 140--157.

Cited By

View all
  • (2024)Cross-Lingual Cross-Modal Retrieval With Noise-Robust Fine-TuningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340006036:11(5860-5873)Online publication date: 1-Nov-2024
  • (2024)Learning Hierarchical Visual Transformation for Domain Generalizable Visual Matching and RecognitionInternational Journal of Computer Vision10.1007/s11263-024-02106-7132:11(4823-4849)Online publication date: 27-May-2024
  • (2024)A Robust Person Shape Representation via Grassmann Channel PoolingPattern Recognition10.1007/978-3-031-78186-5_30(455-474)Online publication date: 30-Nov-2024

Index Terms

  1. Learning Style-Invariant Robust Representation for Generalizable Visual Instance Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. domain generalization
    2. object reidentification
    3. style synthesis
    4. visual instance retrieval

    Qualifiers

    • Research-article

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)146
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cross-Lingual Cross-Modal Retrieval With Noise-Robust Fine-TuningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340006036:11(5860-5873)Online publication date: 1-Nov-2024
    • (2024)Learning Hierarchical Visual Transformation for Domain Generalizable Visual Matching and RecognitionInternational Journal of Computer Vision10.1007/s11263-024-02106-7132:11(4823-4849)Online publication date: 27-May-2024
    • (2024)A Robust Person Shape Representation via Grassmann Channel PoolingPattern Recognition10.1007/978-3-031-78186-5_30(455-474)Online publication date: 30-Nov-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media