Skip to main content
Log in

Relation-aware aggregation network with auxiliary guidance for text-based person search

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

In this paper, we propose a novel Relation-aware Aggregation Network with Auxiliary Guidance for text-based person search, namely RANAG. Existing works are still hard to capture the detailed appearance of a person and compute the similarity between images and texts. RANAN is designed to address the above problem from two aspects: relation-aware visual and additional auxiliary signal. Specifically, we introduce a Relation-aware Aggregation Network (RAN) that exploits the relation between the person and local objects. Then, we propose three auxiliary tasks to acquire additional knowledge of semantic representations. Each task has a respective objective: identifying the gender of the pedestrian in the image, distinguishing the images of the similar pedestrian, and aligning the semantic information between description and image. In addition, some data augmentation methods we explored can further improve the performance. Extensive experiments demonstrate that our model provides superior performance than the state-of-the-art methods on the CUHK-PEDES dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4

Similar content being viewed by others

References

  1. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In: CVPR, pp. 6077–6086 (2018)

  2. Chen, D., Li, H., Liu, X., Shen, Y., Shao, J., Yuan, Z., Wang, X.: Improving Deep Visual Representation for Person Re-Identification by Global and Local Image-Language Association. In: ECCV, pp. 56–73 (2018)

  3. Chen, T., Xu, C., Luo, J.: Improving Text-Based Person Search by Spatial Matching and Adaptive Threshold. In: WACV, pp. 1879–1887 (2018)

  4. Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: Devise: a Deep Visual-Semantic Embedding Model. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q. (eds.) NIPS, pp. 2121–2129 (2013)

  5. Gao, L., Zeng, P., Song, J., Li, Y., Liu, W., Mei, T., Shen, H.T.: Structured Two-Stream Attention Network for Video Question Answering. In: AAAI, pp. 6391–6398 (2019)

  6. Gao, Z., Gao, L.S., Zhang, H., Cheng, Z., Hong, R.: Deep Spatial Pyramid Features Collaborative Reconstruction for Partial Person Reid. In: ACM MM, pp. 1879–1887 (2019)

  7. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised Representation Learning by Predicting Image Rotations. In: ICLR. Openreview.Net (2018)

  8. Guo, Y., Zhang, J., Gao, L.: Exploiting long-term temporal dynamics for video captioning. World Wide Web 22(2), 735–749 (2019)

    Article  Google Scholar 

  9. Hao, Y., Wang, N., Gao, X., Li, J., Wang, X.: Dual-Alignment Feature Embedding for Cross-Modality Person Re-Identification. In: Amsaleg, L., Huet, B., Larson, M.A., Gravier, G., Hung, H., Ngo, C., Ooi, W.T. (eds.) ACM MM, pp. 57–65 (2019)

  10. Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search. In: AAAI (2020)

  11. Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv (2014)

  12. Li, K., Qi, G., Ye, J., Hua, K.A.: Linear subspace ranking hashing for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 39(9), 1825–1838 (2017)

    Article  Google Scholar 

  13. Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual Semantic Reasoning for Image-Text Matching. In: ICCV, pp. 4653–4661 (2019)

  14. Li, S., Bak, S., Carr, P., Wang, X.: Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-Identification. In: CVPR, pp. 369–378 (2018)

  15. Li, S., Xiao, T., Li, H., Yang, W., Wang, X.: Identity-Aware Textual-Visual Matching with Latent Co-Attention. In: ICCV, pp. 1908–1917 (2017)

  16. Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person Search with Natural Language Description. In: CVPR, pp. 5187–5196 (2017)

  17. Li, X., Zhou, Z., Chen, L., Gao, L.: Residual attention-based LSTM for video captioning. World Wide Web 22(2), 621–636 (2019)

    Article  Google Scholar 

  18. Li, Y., Yao, H., Duan, L., Yao, H., Xu, C.: Adaptive Feature Fusion via Graph Neural Network for Person Re-Identification. In: Amsaleg, L., Huet, B., Larson, M.A., Gravier, G., Hung, H., Ngo, C., Ooi, W.T. (eds.) ACM MM, pp. 2115–2123 (2019)

  19. Liu, C., Mao, Z., Liu, A., Zhang, T., Wang, B., Zhang, Y.: Focus Your Attention: a Bidirectional Focal Attention Network for Image-Text Matching. In: Amsaleg, L., Huet, B., Larson, M.A., Gravier, G., Hung, H., Ngo, C., Ooi, W.T. (eds.) ACM MM, pp. 3–11 (2019)

  20. Liu, J., Zha, Z., Hong, R., Wang, M., Zhang, Y.: Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search. In: Amsaleg, L., Huet, B., Larson, M.A., Gravier, G., Hung, H., Ngo, C., Ooi, W.T. (eds.) ACM MM, pp. 665–673 (2019)

  21. Liu, X., Wang, Z., Shao, J., Wang, X., Li, H.: Improving Referring Expression Grounding with Cross-Modal Attention-Guided Erasing. In: CVPR, pp. 1950–1959 (2019)

  22. Mandal, D., Rao, P., Biswas, S.: Semi-supervised cross-modal retrieval with label prediction. IEEE Trans. Multim. 22(9), 2345–2353 (2020)

    Article  Google Scholar 

  23. McLaughlin, N., Martinez del Rincon, J., Miller, P.: Recurrent Convolutional Network for Video-Based Person Re-Identification. In: CVPR, pp. 1325–1334 (2016)

  24. Noroozi, M., Favaro, P.: Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In: ECCV, pp. 69–84 (2016)

  25. O’Hare, N., Smeaton, A.F.: Context-aware person identification in personal photo collections. IEEE Trans. Multim. 11(2), 220–228 (2009)

    Article  Google Scholar 

  26. Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.A.: Context Encoders: Feature Learning by Inpainting. In: CVPR, pp. 2536–2544 (2016)

  27. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-Cnn: Towards Real-Time Object Detection with Region Proposal Networks. In: NIPS, pp. 91–99 (2015)

  28. Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial Representation Learning for Text-To-Image Matching. In: ICCV, pp. 5813–5823 (2019)

  29. Song, J., Yang, Y., Song, Y., Xiang, T., Hospedales, T.M.: Generalizable Person Re-Identification by Domain-Invariant Mapping Network. In: CVPR, pp. 719–728 (2019)

  30. Song, J., Zeng, P., Gao, L., Shen, H.T.: From Pixels to Objects: Cubic Visual Attention for Visual Question Answering. In: Lang, J. (ed.) IJCAI, pp. 906–912 (2018)

  31. Sunderrajan, S., Manjunath, B.S.: Context-aware hypergraph modeling for re-identification and summarization. IEEE Trans. Multim. 18(1), 51–63 (2016)

    Article  Google Scholar 

  32. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is All You Need. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) NeurIPS, pp. 5998–6008 (2017)

  33. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R.J., Darrell, T., Saenko, K.: Sequence to Sequence - Video to Text. In: ICCV, pp. 4534–4542 (2015)

  34. Wang, H., Zhang, Y., Ji, Z., Pang, Y., Ma, L.: Consensus-Aware Visual-Semantic Embedding for Image-Text Matching. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J. (eds.) ECCV, vol. 12369, pp 18–34. Springer (2020)

  35. Wang, T., Xu, X., Yang, Y., Hanjalic, A., Shen, H.T., Song, J.: Matching Images and Text with Multi-Modal Tensor Fusion and Re-Ranking. In: Amsaleg, L., Huet, B., Larson, M.A., Gravier, G., Hung, H., Ngo, C., Ooi, W.T. (eds.) ACM MM, pp. 12–20 (2019)

  36. Wang, Y., Bo, C., Wang, D., Wang, S., Qi, Y., Lu, H.: Language Person Search with Mutually Connected Classification Loss. In: ICASSP, pp. 2057–2061 (2019)

  37. Wang, Z., Fang, Z., Wang, J., Yang, Y.: Vitaa: Visual-Textual Attributes Alignment in Person Search by Natural Language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J. (eds.) ECCV, vol. 12357, pp 402–420 (2020)

  38. Xu, X., He, L., Lu, H., Gao, L., Ji, Y.: Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22(2), 657–672 (2019)

    Article  Google Scholar 

  39. Zha, Z., Liu, J., Chen, D., Wu, F.: Adversarial attribute-text embedding for person search with natural language query. IEEE Trans. Multim. 22 (7), 1836–1846 (2020)

    Article  Google Scholar 

  40. Zhang, Q., Lei, Z., Zhang, Z., Li, S.Z.: Context-Aware Attention Network for Image-Text Retrieval. In: CVPR, pp. 3533–3542. IEEE (2020)

  41. Zhang, R., Isola, P., Efros, A.A.: Colorful Image Colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV, pp. 649–666 (2016)

  42. Zhang, Y., Lu, H.: Deep Cross-Modal Projection Learning for Image-Text Matching. In: ECCV, pp. 707–723 (2018)

  43. Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Shen, Y.: Dual-path convolutional image-text embedding. arXiv:1711.05535 (2017)

  44. Zhou, Z., Huang, Y., Wang, W., Wang, L., Tan, T.: See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-Based Person Re-Identification. In: CVPR, pp. 6776–6785 (2017)

  45. Zhu, F., Zhu, Y., Chang, X., Liang, X.: Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks. In: CVPR (2020)

  46. Zou, F., Bai, X., Luan, C., Li, K., Wang, Y., Ling, H.: Semi-supervised cross-modal learning for cross modal retrieval and image annotation. World Wide Web 22(2), 825–841 (2019)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuaiqi Jing.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Synthetic Media on the Web

Guest Editors: Huimin Lu, Xing Xu, Jože Guna, and Gautam Srivastava

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zeng, P., Jing, S., Song, J. et al. Relation-aware aggregation network with auxiliary guidance for text-based person search. World Wide Web 25, 1565–1582 (2022). https://doi.org/10.1007/s11280-021-00953-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-021-00953-9

Keywords

Navigation