research-article

Text-Guided Visual Feature Refinement for Text-Based Person Search

Authors:

Bingliang Jiao,

Peng WangAuthors Info & Claims

ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

Pages 118 - 126

https://doi.org/10.1145/3460426.3463652

Published: 01 September 2021 Publication History

Abstract

Text-based person search is a task to retrieve the corresponding person in a large-scale image database given a textual description, which has important value in various fields like video surveillance. In the inferring phase, language descriptions, serving as queries, guide to search the corresponding person images. Most existing methods apply cross-modal signals to guide feature refinement. However, they employ visual features from the gallery to refine textual features, which may cause high similarity between unmatched pairs. Besides, the similarity-based cross-modal attention could disturb the choice of interested areas for descriptions. In this paper, we analyze the deficiency of previous methods and carefully design a Text-guided Visual Feature Refinement network (TVFR), which utilizes text as reference to refine visual representations. Firstly, we divide each visual feature into several horizontal stripes for fine-grained refinement. After that, we employ a text-based filter generation module to generate description-customized filters, which are used to indicate the corresponding stripes mentioned in the textual input. Thereafter, we employ a text-guided visual feature refinement module to fuse part-level visual features adaptively for each description. In experiments, we validate our TVFR through extensive experiments on CUHK-PEDES, which is the only available dataset for text-based person search. To the best of our knowledge, the TVFR outperforms other state-of-the-art methods.

References

[1]

Dapeng Chen, Hongsheng Li, Xihui Liu, Yantao Shen, Jing Shao, Zejian Yuan, and Xiaogang Wang. 2018. Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association. In Proceedings of the European Conference on Computer Vision (ECCV). 54--70.

Digital Library

[2]

Tianlang Chen, Chenliang Xu, and Jiebo Luo. 2018. Improving text-based person search by spatial matching and adaptive threshold. In 2018 IEEE Winter Conference on Applications of Computer Vision. 1879--1887.

[3]

Xuesong Chen, Canmiao Fu, Yong Zhao, Feng Zheng, Jingkuan Song, Rongrong Ji, and Yi Yang. 2020. Salience-guided cascaded suppression network for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3300--3310.

[4]

YingCong Chen, WeiShi Zheng, and Jianhuang Lai. 2015. Mirror Representation for Modeling View-Specific Transform in Person Re-Identification. In Proceedings of the 24th International Conference on Artificial Intelligence. 3402--3408.

Digital Library

[5]

Jia Deng,Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248--255.

[6]

Yixiao Ge, Zhuowan Li, Haiyu Zhao, Guojun Yin, Shuai Yi, Xiaogang Wang, and Hongsheng Li. 2018. FD-GAN: Pose-guided feature distilling GAN for robust person re-identification. arXiv preprint arXiv:1810.02936 (2018).

[7]

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).

[8]

Xu Huijuan and Saenko Kate. 2016. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. In Proceedings of the European Conference on Computer Vision (ECCV). 451--466.

[9]

Ya Jing, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang, and Tieniu Tan. 2020. Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search. In Proceedings of the AAAI Conference on Artificial Intelligence. 11189--11196.

[10]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.

[11]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[12]

Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision. 1890--1899.

[13]

Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1970--1979.

[14]

Zhang Li, Tao Xiang, and Shaogang Gong. 2016. Learning a Discriminative Null Space for Person Re-Identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1239--1248.

[15]

Yu Licheng, Lin Zhe, Shen Xiaohui, Yang Jimei, Lu Xin, Bansal Mohit, and Berg Tamara L. 2018. MAttNet: Modular Attention Network for Referring Expression Comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1307--1315.

[16]

Jinxian Liu, Bingbing Ni, Yichao Yan, Peng Zhou, Shuo Cheng, and Jianguo Hu. 2018. Pose transferrable person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4099--4108.

[17]

Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, and Hongsheng Li. 2019. Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1950--1959.

[18]

Martinel, Niki, Abir Das, Christian Micheloni, and Amit K. Roy-Chowdhury. 2016. Temporal Model Adaptation for Person Re-Identification. In Proceedings of the European Conference on Computer Vision (ECCV). 858--877.

[19]

Kai Niu, Yan Huang,Wanli Ouyang, and LiangWang. 2020. Improving description based person re-identification by multi-granularity image-text alignments. IEEE Transactions on Image Processing 29 (2020), 5542--5556.

[20]

Nikolaos Sarafianos, Xiang Xu, and Ioannis A Kakadiaris. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5814--5824.

[21]

MSaquib Sarfraz, Arne Schumann, Andreas Eberle, and Rainer Stiefelhagen. 2018. A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 420--429.

[22]

Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45, 11 (1997), 2673--2681.

Digital Library

[23]

Yang Shi, Tommaso Furlanello, Sheng Zha, and Animashree Anandkumar. 2018. Question type guided attention in visual question answering. In Proceedings of the European Conference on Computer Vision (ECCV). 151--166.

Digital Library

[24]

Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, and Kyoung Mu Lee. 2018. Part-aligned bilinear representations for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV). 402--419.

Digital Library

[25]

Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and ShengjinWang. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European conference on computer vision (ECCV). 480--496.

Digital Library

[26]

GuanshuoWang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. 2018. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM international conference on Multimedia. 274--282.

[27]

Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1960--1968.

[28]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29.

[29]

Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven C.H. Hoi. 2021. Deep Learning for Person Re-identification: A Survey and Outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1--1.

[30]

Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV). 686--701.

Digital Library

[31]

Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision. 1116--1124.

Digital Library

[32]

Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2017. Dual-path convolutional image-text embedding with instance loss. arXiv preprint arXiv:1711.05535 (2017).

[33]

Qin Zhou, Heng Fan, Shibao Zheng, Hang Su, Xinzhe Li, Shuang Wu, and Haibin Ling. 2018. Graph correspondence transfer for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence.

Cited By

Niu KLiu YLong YHuang YWang LZhang Y(2024)An Overview of Text-Based Person Search: Recent Advances and Future DirectionsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337637334:9(7803-7819)Online publication date: Sep-2024
https://doi.org/10.1109/TCSVT.2024.3376373
Irene SJohn Prakash ARhymend Uthariaraj V(2024)Person search over security video surveillance systems using deep learning methodsImage and Vision Computing10.1016/j.imavis.2024.104930143:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.imavis.2024.104930
Niu KHuang THuang LWang LZhang Y(2023)Improving Inconspicuous Attributes Modeling for Person Search by LanguageIEEE Transactions on Image Processing10.1109/TIP.2023.328542632(3429-3441)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TIP.2023.3285426
Show More Cited By

Index Terms

Text-Guided Visual Feature Refinement for Text-Based Person Search
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search

Recommendations

DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Many previous methods on text-based person retrieval tasks are devoted to learning a latent common space mapping, with the purpose of extracting modality-invariant features from both visual and textual modality. Nevertheless, due to the complexity of ...
Improving embedding learning by virtual attribute decoupling for text-based person search
Abstract
This paper considers the problem of text-based person search, which aims to find the target person based on a query textual description. Previous methods commonly focus on learning shared image-text embeddings, but largely ignore the effect of ...
Text-based Person Search without Parallel Image-Text Data
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

August 2021

715 pages

ISBN:9781450384636

DOI:10.1145/3460426

General Chairs:
Wen-Huang Cheng
National Yang Ming Chiao Tung University, Taiwan
,
Mohan Kankanhalli
National University of Singapore, Singapore
,
Meng Wang
Hefei University of Technology, China
,
Program Chairs:
Wei-Ta Chu
National Cheng Kung University, Taiwan
,
Jiaying Liu
Peking University, China
,
Marcel Worring
University of Amsterdam, Netherlands

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the National Natural Science Foundation of China
the Ministry of Science and Technology of China
the Open Projects Program of National Laboratory of Pattern Recognition

Conference

ICMR '21

Sponsor:

SIGMM

ICMR '21: International Conference on Multimedia Retrieval

August 21 - 24, 2021

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
285
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)5

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Niu KLiu YLong YHuang YWang LZhang Y(2024)An Overview of Text-Based Person Search: Recent Advances and Future DirectionsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337637334:9(7803-7819)Online publication date: Sep-2024
https://doi.org/10.1109/TCSVT.2024.3376373
Irene SJohn Prakash ARhymend Uthariaraj V(2024)Person search over security video surveillance systems using deep learning methodsImage and Vision Computing10.1016/j.imavis.2024.104930143:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.imavis.2024.104930
Niu KHuang THuang LWang LZhang Y(2023)Improving Inconspicuous Attributes Modeling for Person Search by LanguageIEEE Transactions on Image Processing10.1109/TIP.2023.328542632(3429-3441)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TIP.2023.3285426
Gao LNiu KJiao BWang PZhang Y(2023)Addressing Information Inequality for Text-Based Person Search via Pedestrian-Centric Visual Denoising and Bias-Aware AlignmentsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.327371933:12(7884-7899)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1109/TCSVT.2023.3273719
Liu QHe XTeng QQing LChen H(2023)BDNetPattern Recognition10.1016/j.patcog.2023.109636141:COnline publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1016/j.patcog.2023.109636

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten