research-article

A Baseline Investigation: Transformer-based Cross-view Baseline for Text-based Person Search

Authors:

Hao SunAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 7737 - 7746

https://doi.org/10.1145/3581783.3611916

Published: 27 October 2023 Publication History

Abstract

This paper investigates a baseline approach for text-based person search by using a transformer-based framework. Existing methods usually treat the visual and textual features as independent entities for speeding up the model inference process. However, the attention to the same images should be changed according to different texts. In this paper, we use a commonly employed framework with a fused feature as the baseline, which overcomes the misalignment problem introduced by fixed features. A thorough investigation is conducted in this paper. Moreover, we propose Cross-View Matching (CVM) to provide challenging, positive text-image pairs that enable the model to learn cross-view meta-information. Furthermore, we suggest a novel evaluation process to reduce the inference time and GPU memory demand. The experiments are conducted on CUHK-PEDES, ICFG-PEDES, and RSTPReid benchmarks. Through extensive parameter analysis, the potentials of a transformer-based framework are fully explored. Although the proposed scheme is a simple framework, it achieves significant performance improvements compared with other state-of-the-art methods.

References

[1]

Dapeng Chen, Hongsheng Li, Xihui Liu, Yantao Shen, Jing Shao, Zejian Yuan, and Xiaogang Wang. 2018. Improving deep visual representation for person re-identification by global and local image-language association. In Proceedings of the European Conference on Computer Vision. 54--70.

Digital Library

[2]

Y. Chen, R. Huang, H. Chang, C. Tan, T. Xue, and B. Ma. 2021a. Cross-Modal Knowledge Adaptation for Language-Based Person Search. IEEE Transactions on Image Processing, Vol. 30 (2021), 4057--4069. https://doi.org/10.1109/TIP.2021.3068825

Digital Library

[3]

Yuhao Chen, Guoqing Zhang, Yujiang Lu, Zhenxing Wang, Yuhui Zheng, and Ruili Wang. 2021b. TIPCB: A Simple but Effective Part-based Convolutional Baseline for Text-based Person Search. arXiv preprint arXiv:2105.11628 (2021).

[4]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision. Springer, 104--120.

Digital Library

[5]

Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. 2021. Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification. arXiv preprint arXiv:2107.12666 (2021).

[6]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, and Sylvain Gelly. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations. 1--22.

[7]

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Zicheng Liu, and Michael Zeng. 2021. An Empirical Study of Training End-to-End Vision-and-Language Transformers. arXiv preprint arXiv:2111.02387 (2021).

[8]

Ammarah Farooq, Muhammad Awais, Josef Kittler, and Syed Safwan Khalid. 2022. AXM-Net: Implicit Cross-modal Feature Alignment for Person Re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence.

[9]

Ammarah Farooq, Muhammad Awais, Fei Yan, Josef Kittler, Ali Akbari, and Syed Safwan Khalid. 2020. A Convolutional Baseline for Person Re-Identification Using Vision and Language Descriptions. arXiv preprint arXiv:2003.00808 (2020).

[10]

Douglas Gray, Shane Brennan, and Hai Tao. 2007. Evaluating appearance models for recognition, reacquisition, and tracking. In Proceedings of the IEEE International Workshop on Performance Evaluation for Tracking and Surveillance, Vol. 3. Citeseer, 1--7.

[11]

Xiao Han, Sen He, Li Zhang, and Tao Xiang. 2021. Text-Based Person Search with Limited Data. In Proceedings of the British Machine Vision Conference. 1--13.

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[13]

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020).

[14]

Zhong Ji, Junhua Hu, Deyin Liu, Lin Yuanbo Wu, and Ye Zhao. 2022. Asymmetric Cross-Scale Alignment for Text-Based Person Search. IEEE Transactions on Multimedia (2022).

[15]

Ya Jing, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang, and Tieniu Tan. 2020. Pose-guided multi-granularity attention network for text-based person search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11189--11196.

[16]

Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning. 5583--5594.

[17]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, and David A Shamma. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.

Digital Library

[18]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision. 201--216.

Digital Library

[19]

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. In Proceedings of the Advances in Neural Information Processing Systems. 1--12.

[20]

Shiping Li, Min Cao, and Min Zhang. 2022. Learning Semantic-Aligned Feature Representation for Text-based Person Search. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2724--2728.

[21]

Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1970--1979.

[22]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740--755.

[23]

Jiawei Liu, Zheng-Jun Zha, Richang Hong, Meng Wang, and Yongdong Zhang. 2019b. Deep adversarial graph attention convolution network for text-based person search. In Proceedings of the ACM international conference on Multimedia. 665--673.

Digital Library

[24]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019a. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[25]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations. 1--16.

[26]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems. 1--11.

[27]

Tinghuai Ma, Mingming Yang, Huan Rong, Yurong Qian, Yuan Tian, and Najla Al-Nabhan. 2021. Dual-path CNN with Max Gated block for text-based person re-identification. Image and Vision Computing, Vol. 111 (2021), 104168.

[28]

Kai Niu, Linjiang Huang, Yan Huang, Peng Wang, Liang Wang, and Yanning Zhang. 2022. Cross-modal Co-occurrence Attributes Alignments for Person Search by Language. In Proceedings of the 30th ACM International Conference on Multimedia. 4426--4434.

Digital Library

[29]

Kai Niu, Yan Huang, Wanli Ouyang, and Liang Wang. 2020b. Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Transactions on Image Processing, Vol. 29 (2020), 5542--5556.

[30]

Kai Niu, Yan Huang, and Liang Wang. 2020a. Textual dependency embedding for person search by language. In Proceedings of the ACM international conference on Multimedia. 4032--4040.

Digital Library

[31]

Vicente Ordonez, Girish Kulkarni, and Tamara Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 24. 1143--1151.

[32]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, and Luca Antiga. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).

[33]

Nikolaos Sarafianos, Xiang Xu, and Ioannis A Kakadiaris. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE/CVF international conference on computer vision. 5814--5824.

[34]

Zhiyin Shao, Xinyu Zhang, Meng Fang, Zhifeng Lin, Jian Wang, and Changxing Ding. 2022. Learning Granularity-Unified Representations for Text-to-Image Person Re-identification. In Proceedings of the 30th ACM International Conference on Multimedia. 5566--5574.

Digital Library

[35]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2556--2565.

[36]

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. 2021. How Much Can CLIP Benefit Vision-and-Language Tasks? arXiv preprint arXiv:2107.06383 (2021).

[37]

Guanshuo Wang, Fufu Yu, Junjie Li, Qiong Jia, and Shouhong Ding. 2023. Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search. arXiv preprint arXiv:2303.04497 (2023).

[38]

Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. 2020. ViTAA: Visual-textual attributes alignment in person search by natural language. In Proceedings of the European Conference on Computer Vision. Springer, 402--420.

Digital Library

[39]

Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. 2022a. CAIBC: Capturing All-round Information Beyond Color for Text-based Person Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 5314--5322.

Digital Library

[40]

Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. 2022b. Look Before You Leap: Improving Text-based Person Retrieval by Learning A Consistent Cross-modal Common Manifold. In Proceedings of the 30th ACM International Conference on Multimedia. 1984--1992.

Digital Library

[41]

Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. 2018. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 79--88.

[42]

Yushuang Wu, Zizheng Yan, Xiaoguang Han, Guanbin Li, Changqing Zou, and Shuguang Cui. 2021. LapsCore: Language-Guided Person Search via Color Reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1624--1633.

[43]

Xianghao Zang, Ge Li, and Wei Gao. 2022a. Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval. IEEE Transactions on Industrial Informatics, Vol. 18 (2022), 8776--8785.

[44]

Xianghao Zang, Ge Li, Wei Gao, and Xiujun Shu. 2021. Learning to disentangle scenes for person re-identification. Image and Vision Computing, Vol. 116 (2021), 104330.

Digital Library

[45]

Xianghao Zang, Ge Li, Wei Gao, and Xiujun Shu. 2022b. Exploiting robust unsupervised video person re?identification. IET Image Processing, Vol. 16 (2022), 729--741. https://doi.org/10.1049/ipr2.12380

[46]

Zheng-Jun Zha, Jiawei Liu, Di Chen, and Feng Wu. 2020. Adversarial attribute-text embedding for person search with natural language query. IEEE Transactions on Multimedia, Vol. 22, 7 (2020), 1836--1846.

[47]

Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision. 686--701.

Digital Library

[48]

Shizhen Zhao, Changxin Gao, Yuanjie Shao, Wei-Shi Zheng, and Nong Sang. 2021. Weakly Supervised Text-Based Person Re-Identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11395--11404.

[49]

Kecheng Zheng, Wu Liu, Jiawei Liu, Zheng-Jun Zha, and Tao Mei. 2020a. Hierarchical gumbel attention network for text-based person search. In Proceedings of the ACM international conference on Multimedia. 3441--3449.

Digital Library

[50]

Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision. 1116--1124.

Digital Library

[51]

Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020b. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 16, 2 (2020), 1--23.

Digital Library

[52]

Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. 2021. DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval. In Proceedings of the ACM international conference on Multimedia. 209--217.

Digital Library

Cited By

Yan SLiu JDong NZhang LTang JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Prototypical Prompting for Text-to-image Person Re-identificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681165(2331-2340)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681165
Gao WLi GGao WLi G(2024)Open-Source Projects for 3D Point CloudsDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_9(255-272)Online publication date: 10-Oct-2024
https://doi.org/10.1007/978-981-97-9570-3_9
Gao WLi GGao WLi G(2024)Point Cloud-Language Multi-modal LearningDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_8(227-254)Online publication date: 10-Oct-2024
https://doi.org/10.1007/978-981-97-9570-3_8
Show More Cited By

Index Terms

A Baseline Investigation: Transformer-based Cross-view Baseline for Text-based Person Search
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search

Recommendations

A Strong Geometric Baseline for Cross-View Matching of Multi-person 3D Pose Estimation from Multi-view Images
Image Analysis and Processing – ICIAP 2022
Abstract
Reconstructing the 3D pose from multiple people viewed from several cameras is a complex problem. After the 2D human pose estimation is performed in each image separately (e.g. by a deep learning algorithm), the next step is matching the ...
Cross-view graph embedding
ACCV'12: Proceedings of the 11th Asian conference on Computer Vision - Volume Part II

Recently, more and more approaches are emerging to solve the cross-view matching problem where reference samples and query samples are from different views. In this paper, inspired by Graph Embedding, we propose a unified framework for these cross-view ...
DIVOTrack: A Novel Dataset and Baseline Method for Cross-View Multi-Object Tracking in DIVerse Open Scenes
Abstract
Cross-view multi-object tracking aims to link objects between frames and camera views with substantial overlaps. Although cross-view multi-object tracking has received increased attention in recent years, existing datasets still have several ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
199
Total Downloads

Downloads (Last 12 months)104
Downloads (Last 6 weeks)7

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yan SLiu JDong NZhang LTang JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Prototypical Prompting for Text-to-image Person Re-identificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681165(2331-2340)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681165
Gao WLi GGao WLi G(2024)Open-Source Projects for 3D Point CloudsDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_9(255-272)Online publication date: 10-Oct-2024
https://doi.org/10.1007/978-981-97-9570-3_9
Gao WLi GGao WLi G(2024)Point Cloud-Language Multi-modal LearningDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_8(227-254)Online publication date: 10-Oct-2024
https://doi.org/10.1007/978-981-97-9570-3_8
Gao WLi GGao WLi G(2024)Point Cloud Pre-trained Models and Large ModelsDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_7(195-225)Online publication date: 10-Oct-2024
https://doi.org/10.1007/978-981-97-9570-3_7
Gao WLi GGao WLi G(2024)Deep-Learning-Based Point Cloud Analysis IIDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_6(163-193)Online publication date: 10-Oct-2024
https://doi.org/10.1007/978-981-97-9570-3_6
Gao WLi GGao WLi G(2024)Deep-Learning-Based Point Cloud Analysis IDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_5(131-162)Online publication date: 10-Oct-2024
https://doi.org/10.1007/978-981-97-9570-3_5
Gao WLi GGao WLi G(2024)Deep-Learning-Based Point Cloud Enhancement IIDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_4(99-130)Online publication date: 10-Oct-2024
https://doi.org/10.1007/978-981-97-9570-3_4
Gao WLi GGao WLi G(2024)Deep-Learning-based Point Cloud Enhancement IDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_3(71-97)Online publication date: 10-Oct-2024
https://doi.org/10.1007/978-981-97-9570-3_3
Gao WLi GGao WLi G(2024)Learning Basics for 3D Point CloudsDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_2(29-70)Online publication date: 10-Oct-2024
https://doi.org/10.1007/978-981-97-9570-3_2
Gao WLi GGao WLi G(2024)Future Work on Deep Learning-Based Point Cloud TechnologiesDeep Learning for 3D Point Clouds10.1007/978-981-97-9570-3_11(301-315)Online publication date: 10-Oct-2024
https://doi.org/10.1007/978-981-97-9570-3_11
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten