skip to main content
10.1145/3581783.3611916acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

A Baseline Investigation: Transformer-based Cross-view Baseline for Text-based Person Search

Published: 27 October 2023 Publication History

Abstract

This paper investigates a baseline approach for text-based person search by using a transformer-based framework. Existing methods usually treat the visual and textual features as independent entities for speeding up the model inference process. However, the attention to the same images should be changed according to different texts. In this paper, we use a commonly employed framework with a fused feature as the baseline, which overcomes the misalignment problem introduced by fixed features. A thorough investigation is conducted in this paper. Moreover, we propose Cross-View Matching (CVM) to provide challenging, positive text-image pairs that enable the model to learn cross-view meta-information. Furthermore, we suggest a novel evaluation process to reduce the inference time and GPU memory demand. The experiments are conducted on CUHK-PEDES, ICFG-PEDES, and RSTPReid benchmarks. Through extensive parameter analysis, the potentials of a transformer-based framework are fully explored. Although the proposed scheme is a simple framework, it achieves significant performance improvements compared with other state-of-the-art methods.

References

[1]
Dapeng Chen, Hongsheng Li, Xihui Liu, Yantao Shen, Jing Shao, Zejian Yuan, and Xiaogang Wang. 2018. Improving deep visual representation for person re-identification by global and local image-language association. In Proceedings of the European Conference on Computer Vision. 54--70.
[2]
Y. Chen, R. Huang, H. Chang, C. Tan, T. Xue, and B. Ma. 2021a. Cross-Modal Knowledge Adaptation for Language-Based Person Search. IEEE Transactions on Image Processing, Vol. 30 (2021), 4057--4069. https://doi.org/10.1109/TIP.2021.3068825
[3]
Yuhao Chen, Guoqing Zhang, Yujiang Lu, Zhenxing Wang, Yuhui Zheng, and Ruili Wang. 2021b. TIPCB: A Simple but Effective Part-based Convolutional Baseline for Text-based Person Search. arXiv preprint arXiv:2105.11628 (2021).
[4]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision. Springer, 104--120.
[5]
Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. 2021. Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification. arXiv preprint arXiv:2107.12666 (2021).
[6]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, and Sylvain Gelly. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations. 1--22.
[7]
Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Zicheng Liu, and Michael Zeng. 2021. An Empirical Study of Training End-to-End Vision-and-Language Transformers. arXiv preprint arXiv:2111.02387 (2021).
[8]
Ammarah Farooq, Muhammad Awais, Josef Kittler, and Syed Safwan Khalid. 2022. AXM-Net: Implicit Cross-modal Feature Alignment for Person Re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence.
[9]
Ammarah Farooq, Muhammad Awais, Fei Yan, Josef Kittler, Ali Akbari, and Syed Safwan Khalid. 2020. A Convolutional Baseline for Person Re-Identification Using Vision and Language Descriptions. arXiv preprint arXiv:2003.00808 (2020).
[10]
Douglas Gray, Shane Brennan, and Hai Tao. 2007. Evaluating appearance models for recognition, reacquisition, and tracking. In Proceedings of the IEEE International Workshop on Performance Evaluation for Tracking and Surveillance, Vol. 3. Citeseer, 1--7.
[11]
Xiao Han, Sen He, Li Zhang, and Tao Xiang. 2021. Text-Based Person Search with Limited Data. In Proceedings of the British Machine Vision Conference. 1--13.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[13]
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020).
[14]
Zhong Ji, Junhua Hu, Deyin Liu, Lin Yuanbo Wu, and Ye Zhao. 2022. Asymmetric Cross-Scale Alignment for Text-Based Person Search. IEEE Transactions on Multimedia (2022).
[15]
Ya Jing, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang, and Tieniu Tan. 2020. Pose-guided multi-granularity attention network for text-based person search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11189--11196.
[16]
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning. 5583--5594.
[17]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, and David A Shamma. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.
[18]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision. 201--216.
[19]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. In Proceedings of the Advances in Neural Information Processing Systems. 1--12.
[20]
Shiping Li, Min Cao, and Min Zhang. 2022. Learning Semantic-Aligned Feature Representation for Text-based Person Search. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2724--2728.
[21]
Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1970--1979.
[22]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740--755.
[23]
Jiawei Liu, Zheng-Jun Zha, Richang Hong, Meng Wang, and Yongdong Zhang. 2019b. Deep adversarial graph attention convolution network for text-based person search. In Proceedings of the ACM international conference on Multimedia. 665--673.
[24]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019a. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[25]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations. 1--16.
[26]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems. 1--11.
[27]
Tinghuai Ma, Mingming Yang, Huan Rong, Yurong Qian, Yuan Tian, and Najla Al-Nabhan. 2021. Dual-path CNN with Max Gated block for text-based person re-identification. Image and Vision Computing, Vol. 111 (2021), 104168.
[28]
Kai Niu, Linjiang Huang, Yan Huang, Peng Wang, Liang Wang, and Yanning Zhang. 2022. Cross-modal Co-occurrence Attributes Alignments for Person Search by Language. In Proceedings of the 30th ACM International Conference on Multimedia. 4426--4434.
[29]
Kai Niu, Yan Huang, Wanli Ouyang, and Liang Wang. 2020b. Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Transactions on Image Processing, Vol. 29 (2020), 5542--5556.
[30]
Kai Niu, Yan Huang, and Liang Wang. 2020a. Textual dependency embedding for person search by language. In Proceedings of the ACM international conference on Multimedia. 4032--4040.
[31]
Vicente Ordonez, Girish Kulkarni, and Tamara Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 24. 1143--1151.
[32]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, and Luca Antiga. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).
[33]
Nikolaos Sarafianos, Xiang Xu, and Ioannis A Kakadiaris. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE/CVF international conference on computer vision. 5814--5824.
[34]
Zhiyin Shao, Xinyu Zhang, Meng Fang, Zhifeng Lin, Jian Wang, and Changxing Ding. 2022. Learning Granularity-Unified Representations for Text-to-Image Person Re-identification. In Proceedings of the 30th ACM International Conference on Multimedia. 5566--5574.
[35]
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2556--2565.
[36]
Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. 2021. How Much Can CLIP Benefit Vision-and-Language Tasks? arXiv preprint arXiv:2107.06383 (2021).
[37]
Guanshuo Wang, Fufu Yu, Junjie Li, Qiong Jia, and Shouhong Ding. 2023. Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search. arXiv preprint arXiv:2303.04497 (2023).
[38]
Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. 2020. ViTAA: Visual-textual attributes alignment in person search by natural language. In Proceedings of the European Conference on Computer Vision. Springer, 402--420.
[39]
Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. 2022a. CAIBC: Capturing All-round Information Beyond Color for Text-based Person Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 5314--5322.
[40]
Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. 2022b. Look Before You Leap: Improving Text-based Person Retrieval by Learning A Consistent Cross-modal Common Manifold. In Proceedings of the 30th ACM International Conference on Multimedia. 1984--1992.
[41]
Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. 2018. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 79--88.
[42]
Yushuang Wu, Zizheng Yan, Xiaoguang Han, Guanbin Li, Changqing Zou, and Shuguang Cui. 2021. LapsCore: Language-Guided Person Search via Color Reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1624--1633.
[43]
Xianghao Zang, Ge Li, and Wei Gao. 2022a. Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval. IEEE Transactions on Industrial Informatics, Vol. 18 (2022), 8776--8785.
[44]
Xianghao Zang, Ge Li, Wei Gao, and Xiujun Shu. 2021. Learning to disentangle scenes for person re-identification. Image and Vision Computing, Vol. 116 (2021), 104330.
[45]
Xianghao Zang, Ge Li, Wei Gao, and Xiujun Shu. 2022b. Exploiting robust unsupervised video person re?identification. IET Image Processing, Vol. 16 (2022), 729--741. https://doi.org/10.1049/ipr2.12380
[46]
Zheng-Jun Zha, Jiawei Liu, Di Chen, and Feng Wu. 2020. Adversarial attribute-text embedding for person search with natural language query. IEEE Transactions on Multimedia, Vol. 22, 7 (2020), 1836--1846.
[47]
Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision. 686--701.
[48]
Shizhen Zhao, Changxin Gao, Yuanjie Shao, Wei-Shi Zheng, and Nong Sang. 2021. Weakly Supervised Text-Based Person Re-Identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11395--11404.
[49]
Kecheng Zheng, Wu Liu, Jiawei Liu, Zheng-Jun Zha, and Tao Mei. 2020a. Hierarchical gumbel attention network for text-based person search. In Proceedings of the ACM international conference on Multimedia. 3441--3449.
[50]
Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision. 1116--1124.
[51]
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020b. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 16, 2 (2020), 1--23.
[52]
Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. 2021. DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval. In Proceedings of the ACM international conference on Multimedia. 209--217.

Cited By

View all
  • (2024)Prototypical Prompting for Text-to-image Person Re-identificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681165(2331-2340)Online publication date: 28-Oct-2024
  • (2024)Open-Source Projects for 3D Point CloudsDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_9(255-272)Online publication date: 10-Oct-2024
  • (2024)Point Cloud-Language Multi-modal LearningDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_8(227-254)Online publication date: 10-Oct-2024
  • Show More Cited By

Index Terms

  1. A Baseline Investigation: Transformer-based Cross-view Baseline for Text-based Person Search

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-view matching
    2. feature fusion
    3. text-base person search

    Qualifiers

    • Research-article

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)104
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 14 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Prototypical Prompting for Text-to-image Person Re-identificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681165(2331-2340)Online publication date: 28-Oct-2024
    • (2024)Open-Source Projects for 3D Point CloudsDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_9(255-272)Online publication date: 10-Oct-2024
    • (2024)Point Cloud-Language Multi-modal LearningDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_8(227-254)Online publication date: 10-Oct-2024
    • (2024)Point Cloud Pre-trained Models and Large ModelsDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_7(195-225)Online publication date: 10-Oct-2024
    • (2024)Deep-Learning-Based Point Cloud Analysis IIDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_6(163-193)Online publication date: 10-Oct-2024
    • (2024)Deep-Learning-Based Point Cloud Analysis IDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_5(131-162)Online publication date: 10-Oct-2024
    • (2024)Deep-Learning-Based Point Cloud Enhancement IIDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_4(99-130)Online publication date: 10-Oct-2024
    • (2024)Deep-Learning-based Point Cloud Enhancement IDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_3(71-97)Online publication date: 10-Oct-2024
    • (2024)Learning Basics for 3D Point CloudsDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_2(29-70)Online publication date: 10-Oct-2024
    • (2024)Future Work on Deep Learning-Based Point Cloud TechnologiesDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_11(301-315)Online publication date: 10-Oct-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media