skip to main content
10.1145/3460426.3463652acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Text-Guided Visual Feature Refinement for Text-Based Person Search

Published: 01 September 2021 Publication History

Abstract

Text-based person search is a task to retrieve the corresponding person in a large-scale image database given a textual description, which has important value in various fields like video surveillance. In the inferring phase, language descriptions, serving as queries, guide to search the corresponding person images. Most existing methods apply cross-modal signals to guide feature refinement. However, they employ visual features from the gallery to refine textual features, which may cause high similarity between unmatched pairs. Besides, the similarity-based cross-modal attention could disturb the choice of interested areas for descriptions. In this paper, we analyze the deficiency of previous methods and carefully design a Text-guided Visual Feature Refinement network (TVFR), which utilizes text as reference to refine visual representations. Firstly, we divide each visual feature into several horizontal stripes for fine-grained refinement. After that, we employ a text-based filter generation module to generate description-customized filters, which are used to indicate the corresponding stripes mentioned in the textual input. Thereafter, we employ a text-guided visual feature refinement module to fuse part-level visual features adaptively for each description. In experiments, we validate our TVFR through extensive experiments on CUHK-PEDES, which is the only available dataset for text-based person search. To the best of our knowledge, the TVFR outperforms other state-of-the-art methods.

References

[1]
Dapeng Chen, Hongsheng Li, Xihui Liu, Yantao Shen, Jing Shao, Zejian Yuan, and Xiaogang Wang. 2018. Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association. In Proceedings of the European Conference on Computer Vision (ECCV). 54--70.
[2]
Tianlang Chen, Chenliang Xu, and Jiebo Luo. 2018. Improving text-based person search by spatial matching and adaptive threshold. In 2018 IEEE Winter Conference on Applications of Computer Vision. 1879--1887.
[3]
Xuesong Chen, Canmiao Fu, Yong Zhao, Feng Zheng, Jingkuan Song, Rongrong Ji, and Yi Yang. 2020. Salience-guided cascaded suppression network for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3300--3310.
[4]
YingCong Chen, WeiShi Zheng, and Jianhuang Lai. 2015. Mirror Representation for Modeling View-Specific Transform in Person Re-Identification. In Proceedings of the 24th International Conference on Artificial Intelligence. 3402--3408.
[5]
Jia Deng,Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248--255.
[6]
Yixiao Ge, Zhuowan Li, Haiyu Zhao, Guojun Yin, Shuai Yi, Xiaogang Wang, and Hongsheng Li. 2018. FD-GAN: Pose-guided feature distilling GAN for robust person re-identification. arXiv preprint arXiv:1810.02936 (2018).
[7]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
[8]
Xu Huijuan and Saenko Kate. 2016. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. In Proceedings of the European Conference on Computer Vision (ECCV). 451--466.
[9]
Ya Jing, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang, and Tieniu Tan. 2020. Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search. In Proceedings of the AAAI Conference on Artificial Intelligence. 11189--11196.
[10]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.
[11]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[12]
Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision. 1890--1899.
[13]
Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1970--1979.
[14]
Zhang Li, Tao Xiang, and Shaogang Gong. 2016. Learning a Discriminative Null Space for Person Re-Identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1239--1248.
[15]
Yu Licheng, Lin Zhe, Shen Xiaohui, Yang Jimei, Lu Xin, Bansal Mohit, and Berg Tamara L. 2018. MAttNet: Modular Attention Network for Referring Expression Comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1307--1315.
[16]
Jinxian Liu, Bingbing Ni, Yichao Yan, Peng Zhou, Shuo Cheng, and Jianguo Hu. 2018. Pose transferrable person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4099--4108.
[17]
Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, and Hongsheng Li. 2019. Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1950--1959.
[18]
Martinel, Niki, Abir Das, Christian Micheloni, and Amit K. Roy-Chowdhury. 2016. Temporal Model Adaptation for Person Re-Identification. In Proceedings of the European Conference on Computer Vision (ECCV). 858--877.
[19]
Kai Niu, Yan Huang,Wanli Ouyang, and LiangWang. 2020. Improving description based person re-identification by multi-granularity image-text alignments. IEEE Transactions on Image Processing 29 (2020), 5542--5556.
[20]
Nikolaos Sarafianos, Xiang Xu, and Ioannis A Kakadiaris. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5814--5824.
[21]
MSaquib Sarfraz, Arne Schumann, Andreas Eberle, and Rainer Stiefelhagen. 2018. A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 420--429.
[22]
Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45, 11 (1997), 2673--2681.
[23]
Yang Shi, Tommaso Furlanello, Sheng Zha, and Animashree Anandkumar. 2018. Question type guided attention in visual question answering. In Proceedings of the European Conference on Computer Vision (ECCV). 151--166.
[24]
Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, and Kyoung Mu Lee. 2018. Part-aligned bilinear representations for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV). 402--419.
[25]
Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and ShengjinWang. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European conference on computer vision (ECCV). 480--496.
[26]
GuanshuoWang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. 2018. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM international conference on Multimedia. 274--282.
[27]
Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1960--1968.
[28]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29.
[29]
Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven C.H. Hoi. 2021. Deep Learning for Person Re-identification: A Survey and Outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1--1.
[30]
Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV). 686--701.
[31]
Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision. 1116--1124.
[32]
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2017. Dual-path convolutional image-text embedding with instance loss. arXiv preprint arXiv:1711.05535 (2017).
[33]
Qin Zhou, Heng Fan, Shibao Zheng, Hang Su, Xinzhe Li, Shuang Wu, and Haibin Ling. 2018. Graph correspondence transfer for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence.

Cited By

View all
  • (2024)An Overview of Text-Based Person Search: Recent Advances and Future DirectionsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337637334:9(7803-7819)Online publication date: Sep-2024
  • (2024)Person search over security video surveillance systems using deep learning methodsImage and Vision Computing10.1016/j.imavis.2024.104930143:COnline publication date: 2-Jul-2024
  • (2023)Improving Inconspicuous Attributes Modeling for Person Search by LanguageIEEE Transactions on Image Processing10.1109/TIP.2023.328542632(3429-3441)Online publication date: 1-Jan-2023
  • Show More Cited By

Index Terms

  1. Text-Guided Visual Feature Refinement for Text-Based Person Search

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval
    August 2021
    715 pages
    ISBN:9781450384636
    DOI:10.1145/3460426
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-modal retrieval
    2. text-based person search
    3. text-guided visual feature refinement

    Qualifiers

    • Research-article

    Funding Sources

    • the National Natural Science Foundation of China
    • the Ministry of Science and Technology of China
    • the Open Projects Program of National Laboratory of Pattern Recognition

    Conference

    ICMR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 830 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)35
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An Overview of Text-Based Person Search: Recent Advances and Future DirectionsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337637334:9(7803-7819)Online publication date: Sep-2024
    • (2024)Person search over security video surveillance systems using deep learning methodsImage and Vision Computing10.1016/j.imavis.2024.104930143:COnline publication date: 2-Jul-2024
    • (2023)Improving Inconspicuous Attributes Modeling for Person Search by LanguageIEEE Transactions on Image Processing10.1109/TIP.2023.328542632(3429-3441)Online publication date: 1-Jan-2023
    • (2023)Addressing Information Inequality for Text-Based Person Search via Pedestrian-Centric Visual Denoising and Bias-Aware AlignmentsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.327371933:12(7884-7899)Online publication date: 8-May-2023
    • (2023)BDNetPattern Recognition10.1016/j.patcog.2023.109636141:COnline publication date: 1-Sep-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media