Relation-aware aggregation network with auxiliary guidance for text-based person search

Zeng, Pengpeng; Jing, Shuaiqi; Song, Jingkuan; Fan, Kaixuan; Li, Xiangpeng; We, Liansuo; Guo, Yuan

doi:10.1007/s11280-021-00953-9

Relation-aware aggregation network with auxiliary guidance for text-based person search

Published: 22 October 2021

Volume 25, pages 1565–1582, (2022)
Cite this article

World Wide Web Aims and scope Submit manuscript

Pengpeng Zeng ORCID: orcid.org/0000-0002-0672-3790¹,
Shuaiqi Jing¹,
Jingkuan Song^1,2,
Kaixuan Fan¹,
Xiangpeng Li¹,
Liansuo We² &
…
Yuan Guo²

434 Accesses
1 Altmetric
Explore all metrics

Abstract

In this paper, we propose a novel Relation-aware Aggregation Network with Auxiliary Guidance for text-based person search, namely RANAG. Existing works are still hard to capture the detailed appearance of a person and compute the similarity between images and texts. RANAN is designed to address the above problem from two aspects: relation-aware visual and additional auxiliary signal. Specifically, we introduce a Relation-aware Aggregation Network (RAN) that exploits the relation between the person and local objects. Then, we propose three auxiliary tasks to acquire additional knowledge of semantic representations. Each task has a respective objective: identifying the gender of the pedestrian in the image, distinguishing the images of the similar pedestrian, and aligning the semantic information between description and image. In addition, some data augmentation methods we explored can further improve the performance. Extensive experiments demonstrate that our model provides superior performance than the state-of-the-art methods on the CUHK-PEDES dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning shared features from specific and ambiguous descriptions for text-based person search

Article 27 March 2024

Cross-Modal Semantic Alignment Learning for Text-Based Person Search

Enhancing Text-Image Person Retrieval Through Nuances Varied Sample

References

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In: CVPR, pp. 6077–6086 (2018)
Chen, D., Li, H., Liu, X., Shen, Y., Shao, J., Yuan, Z., Wang, X.: Improving Deep Visual Representation for Person Re-Identification by Global and Local Image-Language Association. In: ECCV, pp. 56–73 (2018)
Chen, T., Xu, C., Luo, J.: Improving Text-Based Person Search by Spatial Matching and Adaptive Threshold. In: WACV, pp. 1879–1887 (2018)
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: Devise: a Deep Visual-Semantic Embedding Model. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q. (eds.) NIPS, pp. 2121–2129 (2013)
Gao, L., Zeng, P., Song, J., Li, Y., Liu, W., Mei, T., Shen, H.T.: Structured Two-Stream Attention Network for Video Question Answering. In: AAAI, pp. 6391–6398 (2019)
Gao, Z., Gao, L.S., Zhang, H., Cheng, Z., Hong, R.: Deep Spatial Pyramid Features Collaborative Reconstruction for Partial Person Reid. In: ACM MM, pp. 1879–1887 (2019)
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised Representation Learning by Predicting Image Rotations. In: ICLR. Openreview.Net (2018)
Guo, Y., Zhang, J., Gao, L.: Exploiting long-term temporal dynamics for video captioning. World Wide Web 22(2), 735–749 (2019)
Article Google Scholar
Hao, Y., Wang, N., Gao, X., Li, J., Wang, X.: Dual-Alignment Feature Embedding for Cross-Modality Person Re-Identification. In: Amsaleg, L., Huet, B., Larson, M.A., Gravier, G., Hung, H., Ngo, C., Ooi, W.T. (eds.) ACM MM, pp. 57–65 (2019)
Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search. In: AAAI (2020)
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv (2014)
Li, K., Qi, G., Ye, J., Hua, K.A.: Linear subspace ranking hashing for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 39(9), 1825–1838 (2017)
Article Google Scholar
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual Semantic Reasoning for Image-Text Matching. In: ICCV, pp. 4653–4661 (2019)
Li, S., Bak, S., Carr, P., Wang, X.: Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-Identification. In: CVPR, pp. 369–378 (2018)
Li, S., Xiao, T., Li, H., Yang, W., Wang, X.: Identity-Aware Textual-Visual Matching with Latent Co-Attention. In: ICCV, pp. 1908–1917 (2017)
Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person Search with Natural Language Description. In: CVPR, pp. 5187–5196 (2017)
Li, X., Zhou, Z., Chen, L., Gao, L.: Residual attention-based LSTM for video captioning. World Wide Web 22(2), 621–636 (2019)
Article Google Scholar
Li, Y., Yao, H., Duan, L., Yao, H., Xu, C.: Adaptive Feature Fusion via Graph Neural Network for Person Re-Identification. In: Amsaleg, L., Huet, B., Larson, M.A., Gravier, G., Hung, H., Ngo, C., Ooi, W.T. (eds.) ACM MM, pp. 2115–2123 (2019)
Liu, C., Mao, Z., Liu, A., Zhang, T., Wang, B., Zhang, Y.: Focus Your Attention: a Bidirectional Focal Attention Network for Image-Text Matching. In: Amsaleg, L., Huet, B., Larson, M.A., Gravier, G., Hung, H., Ngo, C., Ooi, W.T. (eds.) ACM MM, pp. 3–11 (2019)
Liu, J., Zha, Z., Hong, R., Wang, M., Zhang, Y.: Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search. In: Amsaleg, L., Huet, B., Larson, M.A., Gravier, G., Hung, H., Ngo, C., Ooi, W.T. (eds.) ACM MM, pp. 665–673 (2019)
Liu, X., Wang, Z., Shao, J., Wang, X., Li, H.: Improving Referring Expression Grounding with Cross-Modal Attention-Guided Erasing. In: CVPR, pp. 1950–1959 (2019)
Mandal, D., Rao, P., Biswas, S.: Semi-supervised cross-modal retrieval with label prediction. IEEE Trans. Multim. 22(9), 2345–2353 (2020)
Article Google Scholar
McLaughlin, N., Martinez del Rincon, J., Miller, P.: Recurrent Convolutional Network for Video-Based Person Re-Identification. In: CVPR, pp. 1325–1334 (2016)
Noroozi, M., Favaro, P.: Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In: ECCV, pp. 69–84 (2016)
O’Hare, N., Smeaton, A.F.: Context-aware person identification in personal photo collections. IEEE Trans. Multim. 11(2), 220–228 (2009)
Article Google Scholar
Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.A.: Context Encoders: Feature Learning by Inpainting. In: CVPR, pp. 2536–2544 (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-Cnn: Towards Real-Time Object Detection with Region Proposal Networks. In: NIPS, pp. 91–99 (2015)
Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial Representation Learning for Text-To-Image Matching. In: ICCV, pp. 5813–5823 (2019)
Song, J., Yang, Y., Song, Y., Xiang, T., Hospedales, T.M.: Generalizable Person Re-Identification by Domain-Invariant Mapping Network. In: CVPR, pp. 719–728 (2019)
Song, J., Zeng, P., Gao, L., Shen, H.T.: From Pixels to Objects: Cubic Visual Attention for Visual Question Answering. In: Lang, J. (ed.) IJCAI, pp. 906–912 (2018)
Sunderrajan, S., Manjunath, B.S.: Context-aware hypergraph modeling for re-identification and summarization. IEEE Trans. Multim. 18(1), 51–63 (2016)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is All You Need. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) NeurIPS, pp. 5998–6008 (2017)
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R.J., Darrell, T., Saenko, K.: Sequence to Sequence - Video to Text. In: ICCV, pp. 4534–4542 (2015)
Wang, H., Zhang, Y., Ji, Z., Pang, Y., Ma, L.: Consensus-Aware Visual-Semantic Embedding for Image-Text Matching. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J. (eds.) ECCV, vol. 12369, pp 18–34. Springer (2020)
Wang, T., Xu, X., Yang, Y., Hanjalic, A., Shen, H.T., Song, J.: Matching Images and Text with Multi-Modal Tensor Fusion and Re-Ranking. In: Amsaleg, L., Huet, B., Larson, M.A., Gravier, G., Hung, H., Ngo, C., Ooi, W.T. (eds.) ACM MM, pp. 12–20 (2019)
Wang, Y., Bo, C., Wang, D., Wang, S., Qi, Y., Lu, H.: Language Person Search with Mutually Connected Classification Loss. In: ICASSP, pp. 2057–2061 (2019)
Wang, Z., Fang, Z., Wang, J., Yang, Y.: Vitaa: Visual-Textual Attributes Alignment in Person Search by Natural Language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J. (eds.) ECCV, vol. 12357, pp 402–420 (2020)
Xu, X., He, L., Lu, H., Gao, L., Ji, Y.: Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22(2), 657–672 (2019)
Article Google Scholar
Zha, Z., Liu, J., Chen, D., Wu, F.: Adversarial attribute-text embedding for person search with natural language query. IEEE Trans. Multim. 22 (7), 1836–1846 (2020)
Article Google Scholar
Zhang, Q., Lei, Z., Zhang, Z., Li, S.Z.: Context-Aware Attention Network for Image-Text Retrieval. In: CVPR, pp. 3533–3542. IEEE (2020)
Zhang, R., Isola, P., Efros, A.A.: Colorful Image Colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV, pp. 649–666 (2016)
Zhang, Y., Lu, H.: Deep Cross-Modal Projection Learning for Image-Text Matching. In: ECCV, pp. 707–723 (2018)
Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Shen, Y.: Dual-path convolutional image-text embedding. arXiv:1711.05535 (2017)
Zhou, Z., Huang, Y., Wang, W., Wang, L., Tan, T.: See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-Based Person Re-Identification. In: CVPR, pp. 6776–6785 (2017)
Zhu, F., Zhu, Y., Chang, X., Liang, X.: Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks. In: CVPR (2020)
Zou, F., Bai, X., Luan, C., Li, K., Wang, Y., Ling, H.: Semi-supervised cross-modal learning for cross modal retrieval and image annotation. World Wide Web 22(2), 825–841 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Pengpeng Zeng, Shuaiqi Jing, Jingkuan Song, Kaixuan Fan & Xiangpeng Li
Qiqihar University, Qiqihar, China
Jingkuan Song, Liansuo We & Yuan Guo

Authors

Pengpeng Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Shuaiqi Jing
View author publications
You can also search for this author in PubMed Google Scholar
Jingkuan Song
View author publications
You can also search for this author in PubMed Google Scholar
Kaixuan Fan
View author publications
You can also search for this author in PubMed Google Scholar
Xiangpeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Liansuo We
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuaiqi Jing.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Synthetic Media on the Web

Guest Editors: Huimin Lu, Xing Xu, Jože Guna, and Gautam Srivastava

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zeng, P., Jing, S., Song, J. et al. Relation-aware aggregation network with auxiliary guidance for text-based person search. World Wide Web 25, 1565–1582 (2022). https://doi.org/10.1007/s11280-021-00953-9

Download citation

Received: 26 December 2020
Revised: 18 June 2021
Accepted: 13 September 2021
Published: 22 October 2021
Issue Date: July 2022
DOI: https://doi.org/10.1007/s11280-021-00953-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Relation-aware aggregation network with auxiliary guidance for text-based person search

Abstract

Access this article

Similar content being viewed by others

Learning shared features from specific and ambiguous descriptions for text-based person search

Cross-Modal Semantic Alignment Learning for Text-Based Person Search

Enhancing Text-Image Person Retrieval Through Nuances Varied Sample

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation