Abstract
This paper considers the problem of text-based person search, which aims to find the target person based on a query textual description. Previous methods commonly focus on learning shared image-text embeddings, but largely ignore the effect of pedestrian attributes. Attributes are fine-grained information, which provide mid-level semantics and have been demonstrated to be effective in traditional image-based person search. However, in text-based person search, it is hard to incorporate attribute information to learn discriminative image-text embeddings, because (1) the description of attributes could be various at different texts and (2) it is hard to decouple attributes-related information without the help of attribute annotations. In this paper, we propose an improving embedding learning by virtual attribute decoupling (iVAD) model for learning modality-invariant image-text embeddings. To the best of our knowledge, this is the first work which performs unsupervised attribute decoupling in text-based person search task. In the iVAD, we first propose a novel virtual attribute decoupling (VAD) module which uses an encoder-decoder embedding learning structure to decompose attribute information from image and text. In this module, we regard the pedestrian attributes as a hidden vector and obtain attribute-related embeddings. In addition, different from previous works which separates attribute learning from image-text embedding learning, we propose a hierarchical feature embedding framework. We incorporate the attribute-related embeddings into learned image-text embeddings by an attribute-enhanced feature embedding (AEFE) module. The proposed AEFE module can utilize attribute information to improve discriminability of learned features. Extensive evaluations demonstrate the superiority of our method over a wide variety of state-of-the-art methods on the CUHK-PEDES dataset. The experimental results on Caltech-UCSD Birds (CUB), Oxford-102 Flowers (Flowers) and Flickr30K verify the effectiveness of the proposed approach. A further visualization shows that the proposed iVAD model can effectively discover the co-occurring pedestrian attributes in corresponded image-text pairs.
Similar content being viewed by others
References
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In: Proceedings of USENIX OSDI, pp 265–283
Aggarwal S, RADHAKRISHNAN VB, Chakraborty A (2020) Text-based person search via attribute-aided matching. In: Proceedings of the IEEE winter conference on applications of computer vision (wacv), pp 2617–2625
Arevalo J, Solorio T, Montes-y Gomez M, González FA (2020) Gated multimodal networks. Neural Comput Appl, pp 1–20
Borlea ID, Precup RE, Borlea AB, Iercan D (2021) A unified form of fuzzy c-means and k-means algorithms and its partitional implementation. Knowledge-Based Syst 214:106731
Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7291–7299
Chen D, Li H, Liu X, Shen Y, Shao J, Yuan Z, Wang X (2018) Improving deep visual representation for person re-identification by global and local image-language association. In: Proceedings of the European conference on computer vision (ECCV), pp 54–70
Chen T, Xu C, Luo J (2018) Improving text-based person search by spatial matching and adaptive threshold. In: Proceedings of the IEEE winter conference on applications of computer vision (WACV), pp 1879–1887
Deng J, Dong W, Socher R, Li L, Li K, Feifei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 248–255
Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the North American chapter of the association for computational linguistics (NAACL)
Dong Q, Gong S, Zhu X (2019) Person search by text attribute query as zero-shot learning. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 3652–3661
Douglas G, Shane B, Hai T (2007) Evaluating appearance models for recognition, reacquisition, and tracking. In: Proceedings of the IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, pp 1–7
Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) Vse++: Improving visual-semantic embeddings with hard negatives. In: Proceedings of the British machine vision conference (BMVC)
Fayyaz M, Yasmin M, Sharif M, Shah JH, Raza M, Iqbal T (2019) Person re-identification with features-based clustering and deep features. Neural Comput Appl, pp 1–22
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics (ICAIS), pp 249–256. JMLR workshop and conference proceedings
Harris Zellig S (1981) Distributional structure. Word 10(2–3):146–162
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 770–778
Howard GA, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv: Computer Vision and Pattern Recognition
Hu H, Gu J, Zhang Z, Dai J, Wei Y (2018) Relation networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 3588–3597
Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Islam K (2020) Person search: new paradigm of person re-identification: a survey and outlook of recent works. Image Vis Comput 101:103970
Jing Y, Si C, Wang J, Wang W, Wang L, Tan T (2020) Pose-guided multi-granularity attention network for text-based person search. In: Proceedings of the AAAI conference on artificial intelligence (AAAI) vol 34, pp 11189–11196
Jing Y, Wang W, Wang L, Tan T (2021) Learning aligned image-text representations using graph attentive relational network. IEEE Trans Image Process 30:1840–1852
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. In: Proceedings of the international conference on learning representations (ICLR)
Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Layne R, Hospedales TM, Gong S (2012) Person re-identification by attributes. In: proceedings of the british machine vision conference (BMVC)
Li S, Xiao T, Li H, Yang W, Wang X (2017) Identity-aware textual-visual matching with latent co-attention. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 1908–1917
Li S, Xiao T, Li H, Zhou B, Yue D, Wang X (2017) Person search with natural language description. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 5187–5196
Li W, Zhao R, Wang X (2012) Human reidentification with transferred metric learning. In: Proceedings of the Conference on Asian conference on computer vision (ACCV)
Li W, Zhao R, Xiao T, Wang X (2014) Deepreid: Deep filter pairing neural network for person re-identification. In: Proceedings of the IEEE Conference on computer vision and pattern recognition (CVPR)
Lin Y, Zheng L, Zheng Z, Wu Y, Hu Z, Yan C, Yang Y (2019) Improving person re-identification by attribute and identity learning. Pattern Recogn 95:151–161
Lin Z, Feng M, Santos CND, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. In: Proceedings of the international conference on learning representations (ICLR)
Liu J, Zha ZJ, Hong R, Wang M, Zhang Y (2019) Deep adversarial graph attention convolution network for text-based person search. In: Proceedings of the ACM international conference on Multimedia (ACM MM), pp 665–673
Liu Y, Guo Y, Bakker EM, Lew MS (2017) Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 4107–4116
Ma J, Zhang H, Chow TW (2019) Multilabel classification with label-specific features and classifiers: A coarse-and fine-tuned framework. IEEE transactions on cybernetics
Mansouri N, Ammar S, Kessentini Y (2021) Re-ranking person re-identification using attributes learning. Neural Comput Appl, pp 1–17
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Proc Adv Neural Inf Process Syst (NeurIPS) 26:3111–3119
Mohammed MA, Ghani M, Mostafa SA, Ibrahim DA (2017) Using scatter search algorithm in implementing examination timetabling problem. J Eng Appl Sci 12(18):4792–4800
Mohammed MA, Gunasekaran SS, Mostafa SA, Mustafa A, Abd Ghani MK (2018) Implementing an agent-based multi-natural language anti-spam model. In: 2018 International symposium on agent, multi-agent systems and robotics (ISAMSR), pp 1–5. IEEE
Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 299–307
Niu K, Huang Y, Ouyang W, Wang L (2020) Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans Image Process (TIP) 29:5542–5556
Reed S, Akata Z, Lee H, Schiele B (2016) Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 49–58
Reed S, Akata Z, Lee H, Schiele B (2016) Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell (TPAMI) 2015:91–99
Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE international conference on computer vision (ICCV)
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process (TSP) 45(11):2673–2681
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the international conference on learning representations (ICLR)
Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1979–1988
Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 1979–1988
Su C, Zhang S, Xing J, Gao W, Tian Q (2016) Deep attributes driven multi-camera person re-identification. In: Proceedings of the European conference on computer vision (ECCV), pp 475–491
Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 5693–5703
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Proceeding of the advances in neural information processing systems (NeurIPS)
Wang C, Luo Z, Zhong Z, Li S (2021) Divide-and-merge the embedding space for cross-modality person search. Neurocomputing 463:388–399
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7794–7803
Wang Y, Bo C, Wang D, Wang S, Qi Y, Lu H (2019) Language person search with mutually connected classification loss. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2057–2061
Wang Z, Fang Z, Wang J, Yang Y (2020) Vitaa: Visual-textual attributes alignment in person search by natural language. In: Proceedings of the European conference on computer vision (ECCV), pp 402–420. Springer, Berlin
Wang Z, Wang Z, Zheng Y, Wu Y, Zeng W, Satoh S (2020) Beyond intra-modality: a survey of heterogeneous person re-identification. In: Proceedings of the international joint conference on artificial intelligence (IJCAI) (Survey Track)
Wen Y, Zhang K, Li Z, Qiao Y (2019) A comprehensive study on center loss for deep face recognition. Int J Comput Vis (IJCV) 127(6):668–683
Wu Y, Wang L, Cui F, Zhai H, Dong B, Wang JY (2018) Cross-model convolutional neural network for multiple modality data representation. Neural Comput Appl 30(8):2343–2353
Xiao T, Li S, Wang B, Lin L, Wang X (2016) End-to-end deep learning for person search. Vision and Pattern Recognition arXiv: Computer
Yin Z, Zheng W, Wu A, Yu H, Wan H, Guo X, Huang F, Lai J (2018) Adversarial attribute-image person re-identification. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 1100–1106
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2(1):67–78
Zall R, Kangavari MR (2019) On the construction of multi-relational classifier based on canonical correlation analysis. Int J Artif Intell 17(2):23–43
Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. In: Proceedings of the international conference on machine learning (ICML), pp 7354–7363. PMLR
Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European conference on computer vision (ECCV), pp 686–701
Zheng L, Shen L, Tian L, Wang S, Bu J, Tian Q (2015) Person re-identification meets image search. arXiv preprint arXiv:1502.02171
Zheng L, Yang Y, Hauptmann AG (2016) Person re-identification: past, present and future. arXiv:1610.02984
Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen YD (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Acknowledgements
This work is supported by the National Nature Science Foundation of China (No. 61876159, No. 61806172, No. 62076116, No. U1705286) and the China Postdoctoral Science Foundation Grant (No. 2019M652257).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, C., Luo, Z., Lin, Y. et al. Improving embedding learning by virtual attribute decoupling for text-based person search. Neural Comput & Applic 34, 5625–5647 (2022). https://doi.org/10.1007/s00521-021-06734-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-06734-9