skip to main content
10.1145/3503161.3547753acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Cross-modal Co-occurrence Attributes Alignments for Person Search by Language

Published: 10 October 2022 Publication History

Abstract

Person search by language refers to retrieving the interested pedestrian images based on a free-form natural language description, which has important applications in smart video surveillance. Although great efforts have been made to align images with sentences, the challenge of reporting bias, i.e., attributes are only partially matched across modalities, still incurs large noise and influences the accurate retrieval seriously. To address this challenge, we propose a novel cross-modal matching method named Cross-modal Co-occurrence Attributes Alignments (C2A2), which can better deal with noise and obtain significant improvements in retrieval performance for person search by language. First, we construct visual and textual attribute dictionaries relying on matrix decomposition, and carry out cross-modal alignments using denoising reconstruction features to address the noise from pedestrian-unrelated elements. Second, we re-gather pixels of image and words of sentence under the guidance of learned attribute dictionaries, to adaptively constitute more discriminative co-occurrence attributes in both modalities. And the re-gathered co-occurrence attributes are carefully captured by imposing explicit cross-modal one-to-one alignments which consider relations across modalities, better alleviating the noise from non-correspondence attributes. The whole C_2A_2 method can be trained end-to-end without any pre-processing, i.e., requiring negligible additional computation overheads. It significantly outperforms the existing solutions, and finally achieves the new state-of-the-art retrieval performance on two large-scale benchmarks, CUHK-PEDES and RSTPReid datasets.

Supplementary Material

MP4 File (MM22-fp0046.mp4)
Person search by language refers to retrieving the interested pedestrian images based on a free-form natural language sentence. To address the challenge of reporting bias, we propose a novel method named Cross-modal Co-occurrence Attributes Alignments. First, we construct visual and textual attribute dictionaries relying on matrix decomposition, and carry out cross-modal alignments using denoising reconstruction features to address the noise from pedestrian-unrelated elements. Second, we re-gather pixels of image and words of sentence under the guidance of learned attribute dictionaries, to adaptively constitute more discriminative co-occurrence attributes in both modalities, which can better alleviate the noise from non-correspondence attributes by considering relations across modalities. The proposed method significantly outperforms the existing solutions, and finally achieves the new state-of-the-art retrieval performance on two large-scale benchmarks, CUHK-PEDES and RSTPReid datasets.

References

[1]
Dapeng Chen, Hongsheng Li, Xihui Liu, Yantao Shen, Jing Shao, Zejian Yuan, and Xiaogang Wang. 2018a. Improving Deep Visual Representation for Person Re-identification by Global and Local Image-Language Association. In European Conference on Computer Vision.
[2]
Tianlang Chen, Chenliang Xu, and Jiebo Luo. 2018b. Improving Text-based Person Search by Spatial Matching and Adaptive Threshold. In Winter Conference on Applications of Computer Vision.
[3]
Yuhao Chen, Guoqing Zhang, Yujiang Lu, Zhenxing Wang, Yuhui Zheng, and Ruili Wang. 2021. TIPCB: A Simple but Effective Part-based Convolutional Baseline for Text-based Person Search. arXiv:2105.11628 (2021).
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2018).
[5]
Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. 2021. Semantically Self-aligned Network for Text-to-Image Part-aware Person Re-identification. arXiv:2107.12666 (2021).
[6]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings With Hard Negatives. In British Machine Vision Conference.
[7]
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. 2013. Devise: A Deep Visual-Semantic Embedding Model. In Neural Information Processing Systems.
[8]
Chenyang Gao, Guanyu Cai, Xinyang Jiang, Feng Zheng, Jun Zhang, Yifei Gong, Pai Peng, Xiaowei Guo, and Xing Sun. 2021a. Contextual Non-local Alignment over Full-scale Representation for Text-based Person Search. arXiv:2101.03036 (2021).
[9]
Liying Gao, Kai Niu, Zehong Ma, Bingliang Jiao, Tonghao Tan, and Peng Wang. 2021b. Text-guided Visual Feature Refinement for Text-based Person Search. In ACM International Conference on Multimedia Retrieval.
[10]
Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, and Zhouchen Lin. 2021. Is Attention Better Than Matrix Decomposition? arXiv:2109.04553 (2021).
[11]
Jonathan Gordon and Benjamin Van Durme. 2013. Reporting Bias and Knowledge Acquisition. In Workshop on Automated Knowledge Base Construction.
[12]
Ke Han, Yan Huang, Zerui Chen, Liang Wang, and Tieniu Tan. 2020. Prediction and Recovery for Adaptive Low-resolution Person Re-identification. In European Conference on Computer Vision.
[13]
Ke Han, Yan Huang, Chunfeng Song, Liang Wang, and Tieniu Tan. 2021. Adaptive Super-resolution for Person Re-identification With Low-resolution Images. Pattern Recognition, Vol. 114 (2021), 107682.
[14]
Ismail Haritaoglu, David Harwood, and Larry S. Davis. 2000. W/sup 4: Real-time Surveillance of People and Their Activities. IEEE Transactions of Pattern Analysis and Machine Intelligence, Vol. 22, 8 (2000), 809--830.
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[16]
Weiming Hu, Dan Xie, Zhouyu Fu, Wenrong Zeng, and Steve Maybank. 2007. Semantic-based Surveillance Video Retrieval. IEEE Transactions on Image Processing, Vol. 16, 4 (2007), 1168--1181.
[17]
Linjiang Huang, Yan Huang, Wanli Ouyang, and Liang Wang. 2020. Part-level Graph Convolutional Network for Skeleton-based Action Recognition. In AAAI Conference on Artificial Intelligence.
[18]
Linjiang Huang, Liang Wang, and Hongsheng Li. 2021. Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization. In IEEE/CVF International Conference on Computer Vision.
[19]
Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning Semantic Concepts and Order for Image and Sentence Matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[20]
Daniel D Lee and H Sebastian Seung. 1999. Learning the Parts of Objects by Non-negative Matrix Factorization. Nature, Vol. 401, 6755 (1999), 788--791.
[21]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In European Conference on Computer Vision.
[22]
Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017a. Identity-aware Textual-Visual Matching With Latent Co-attention. In IEEE/CVF International Conference on Computer Vision.
[23]
Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017b. Person Search With Natural Language Description. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[24]
Liang Lin, Yongyi Lu, Yan Pan, and Xiaowu Chen. 2012. Integrating Graph Partitioning and Matching for Trajectory Analysis in Video Surveillance. IEEE Transactions on Image Processing, Vol. 21, 12 (2012), 4844--4857.
[25]
Yang Liu, Qingchao Chen, and Samuel Albanie. 2021. Adaptive Cross-modal Prototypes for Cross-domain Visual-Language Retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[26]
Kai Niu, Yan Huang, Wanli Ouyang, and Liang Wang. 2020b. Improving Description-based Person Re-identification by Multi-granularity Image-Text Alignments. IEEE Transactions on Image Processing, Vol. 29 (2020), 5542--5556.
[27]
Kai Niu, Yan Huang, and Liang Wang. 2019. Fusing Two Directions in Cross-domain Adaption for Real Life Person Search by Language. In IEEE/CVF International Conference on Computer Vision Workshops.
[28]
Kai Niu, Yan Huang, and Liang Wang. 2020a. Textual Dependency Embedding for Person Search by Language. In ACM International Conference on Multimedia.
[29]
Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning Deep Representations of Fine-grained Visual Descriptions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[30]
Nikolaos Sarafianos, Xiang Xu, and Ioannis A. Kakadiaris. 2019. Adversarial Representation Learning for Text-to-Image Matching. In IEEE/CVF International Conference on Computer Vision.
[31]
D Seung and L Lee. 2001. Algorithms for Non-negative Matrix Factorization. Neural Information Processing Systems.
[32]
Chi Su, Jianing Li, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. 2017. Pose-driven Deep Convolutional Model for Person Re-identification. In IEEE/CVF International Conference on Computer Vision.
[33]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and Tell: A Neural Image Caption Generator. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[34]
Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. 2020. ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language. In European Conference on Computer Vision.
[35]
Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. 2018. Person Transfer GAN to Bridge Domain Gap for Person Re-identification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[36]
Yushuang Wu, Zizheng Yan, Xiaoguang Han, Guanbin Li, Changqing Zou, and Shuguang Cui. 2021. LapsCore: Language-guided Person Search via Color Reasoning. In IEEE/CVF International Conference on Computer Vision.
[37]
Shiyang Yan, Li Yu, and Yuan Xie. 2021. Discrete-continuous Action Space Policy Gradient-based Attention for Image-Text Matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[38]
Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z. Li. 2020. Context-aware Attention Network for Image-Text Retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[39]
Xianguo Zhang, Tiejun Huang, Yonghong Tian, and Wen Gao. 2014. Background-Modeling-based Adaptive Prediction for Surveillance Video Coding. IEEE Transactions on Image Processing, Vol. 23, 2 (2014), 769--784.
[40]
Ying Zhang and Huchuan Lu. 2018. Deep Cross-modal Projection Learning for Image-Text Matching. In European Conference on Computer Vision.
[41]
Kecheng Zheng, Wu Liu, Jiawei Liu, Zheng-Jun Zha, and Tao Mei. 2020a. Hierarchical Gumbel Attention Network for Text-based Person Search. In ACM International Conference on Multimedia.
[42]
Liang Zheng, Yujia Huang, Huchuan Lu, and Yi Yang. 2019. Pose-invariant Embedding for Deep Person Re-identification. IEEE Transactions on Image Processing, Vol. 28, 9 (2019), 4500--4509.
[43]
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020b. Dual-path Convolutional Image-Text Embeddings With Instance Loss. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 16, 2 (2020), 1--23.
[44]
Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. 2021. DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval. In ACM International Conference on Multimedia.

Cited By

View all
  • (2025)From attributes to natural language: A survey and foresight on text-based person re-identificationInformation Fusion10.1016/j.inffus.2024.102879118(102879)Online publication date: Jun-2025
  • (2024)Cross-modal generation and alignment via attribute-guided prompt for unsupervised text-based person retrievalProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/116(1047-1055)Online publication date: 3-Aug-2024
  • (2024)Accurate and Lightweight Learning for Specific Domain Image-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681280(9719-9728)Online publication date: 28-Oct-2024
  • Show More Cited By

Index Terms

  1. Cross-modal Co-occurrence Attributes Alignments for Person Search by Language

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-modal retrieval
    2. matrix decomposition
    3. person search by language

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)74
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)From attributes to natural language: A survey and foresight on text-based person re-identificationInformation Fusion10.1016/j.inffus.2024.102879118(102879)Online publication date: Jun-2025
    • (2024)Cross-modal generation and alignment via attribute-guided prompt for unsupervised text-based person retrievalProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/116(1047-1055)Online publication date: 3-Aug-2024
    • (2024)Accurate and Lightweight Learning for Specific Domain Image-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681280(9719-9728)Online publication date: 28-Oct-2024
    • (2024)Prototypical Prompting for Text-to-image Person Re-identificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681165(2331-2340)Online publication date: 28-Oct-2024
    • (2024)Image-Specific Information Suppression and Implicit Local Alignment for Text-Based Person SearchIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.331011835:12(17973-17986)Online publication date: Dec-2024
    • (2024)Comprehensive Attribute Prediction Learning for Person Search by LanguageIEEE Transactions on Image Processing10.1109/TIP.2024.337283233(1990-2003)Online publication date: 2024
    • (2024)VGSG: Vision-Guided Semantic-Group Network for Text-Based Person SearchIEEE Transactions on Image Processing10.1109/TIP.2023.333765333(163-176)Online publication date: 1-Jan-2024
    • (2024)Fine-Granularity Alignment for Text-Based Person Retrieval Via Semantics-Centric Visual DivisionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.339283134:9(8242-8252)Online publication date: Sep-2024
    • (2024)An Overview of Text-Based Person Search: Recent Advances and Future DirectionsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337637334:9(7803-7819)Online publication date: Sep-2024
    • (2024)LAIP: Learning Local Alignment from Image-Phrase Modeling for Text-based Person Search2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688134(1-10)Online publication date: 15-Jul-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media