Abstract
Text-based person Re-ID aims to find the target person’s image from the image gallery under the condition that the text description about the target person is known. Since there is vast modal difference between the image and text, how to effectively match the semantic features of the image-text is extremely important. Existing schemes mainly consider how to extract more accurate text representation or more complete image representation but ignore multi-granularity feature matching. In this paper, we proposed a Text Attention to Multi-level Network model by Pose and attention-guided for Text-based Person Re-ID(PAMN). Specifically, we firstly design a pose-guided image feature extracting model and an attention-driven text semantics representation model to respectively learn multi-granularity features of image and text, and then employ cross-modal projection matching to align them from different granularities so as to obtain high matching accuracy. Experimental validation is performed on the standard datasets CUHK-pedes and the newly proposed datasets ICFG-pedes, and the experimental results show that the performance of our PAMN is better than other existing methods.
This work is supported by National Natural Science Foundation of China (Nos. 62266009, 61866004, 62276073, 61966004, 61962007), Guangxi Natural Science Foundation (Nos. 2018GXNSFDA281009, 2019GXNSFDA245018, 2018GXNSFDA294001), Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing, and Guangxi “Bagui Scholar” Teams for Innovation and Research Project.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Chen, Y., Zhang, G., Lu, Y., Wang, Z., Zheng, Y.: TIPCB: a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494, 171–181 (2022)
Ding, Z., Ding, C., Shao, Z., Tao, D.: Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv preprint arXiv:2107.12666 (2021)
Gao, C., et al.: Contextual non-local alignment over full-scale representation for text-based person search. arXiv preprint arXiv:2101.03036 (2021)
Gao, P., et al.: Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6639–6648 (2019)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Ji, Z., Wang, H., Han, J., Pang, Y.: Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5754–5763 (2019)
Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi-granularity attention network for text-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11189–11196 (2020)
Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1970–1979 (2017)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Miao, J., Wu, Y., Liu, P., Ding, Y., Yang, Y.: Pose-guided feature alignment for occluded person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 542–551 (2019)
Nam, H., Ha, J.W., Kim, J.: Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 299–307 (2017)
Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5814–5824 (2019)
Sun, Y., Zheng, L., Yang, Y., Tian, Q., Wang, S.: Beyond part models: person retrieval with refined part pooling (and A strong convolutional baseline). In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 501–518. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_30
Wang, Z., Fang, Z., Wang, J., Yang, Y.: ViTAA: visual-textual attributes alignment in person search by natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 402–420. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_24
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, X., Zhang, C., Li, Z., Wang, Z. (2023). Multi-level Network Based on Text Attention and Pose-Guided for Person Re-ID. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1794. Springer, Singapore. https://doi.org/10.1007/978-981-99-1648-1_9
Download citation
DOI: https://doi.org/10.1007/978-981-99-1648-1_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1647-4
Online ISBN: 978-981-99-1648-1
eBook Packages: Computer ScienceComputer Science (R0)