Keywords

1 Introduction

Recently, we have witnessed numerous practical breakthroughs in person modeling related tasks, e.g., pedestrian detection  [2, 5, 47], pedestrian attribute recognition  [30, 40] and person re-identification  [13, 26, 59]. Person search  [15, 25] as an aggregation of the aforementioned tasks thus gains increasing research attention. Comparing with searching by image queries, person search by natural language  [6, 24, 25, 52] makes the retrieving procedure more user-friendly with increased flexibility due to a supporting of open-form natural language queries. Meanwhile, learning robust visual-textual associations becomes increasingly critical, which calls an urgent demand for a representation learning schema that is able to fully exploit both modalities.

Fig. 1.
figure 1

In a case when two persons exhibit similar appearance attributes, it is hard to discriminate them merely by full-body appearance. Instead of matching the textual descriptions with the images at global-level, we decompose both image and text into attribute components and conduct a fine-grained matching strategy.

Relevant studies in person modeling related research points out the critical role of the discriminative representations, especially of the local fragments in both image and text. For the former,  [38, 58] learn the pose-related features from the key points map of human, while  [20, 27] leverage the body-part features by auxiliary segmentation-based supervision. For the latter,  [24, 25, 55] decompose the complex sentences into noun phrases, and  [23, 52] directly adopt the attribute-specific annotations to learn fine-grained attribute related features. Moving forward, attribute specific features from image and text are even requisite for person search by natural language task, and how to effectively couple them stays an open question. We seek insight from a fatally flawed case that lingers in most of the current visual-language systems in Fig. 1, termed as “malpositioned matching”. For example, tasks like textual grounding  [33, 35], VQA  [1], and image-text retrieval  [18, 36] are measuring the similarities or mutual information across modalities in a holistic fashion by answering: are the feature vectors of image and text match with each other? That way, when users input “a girl in white shirt and black skirt” as retrieval query, the model is not able to distinguish the nuances of the two images as shown in Fig. 1, where the false positive one actually shows “black shirt and white skirt”. As both the distinct color visual cues (“white” and “black”) exist in the images, overall matching without the ability of referring them to specific appearance attributes prevents the model from discriminating them as needed. Such cases exist extensively in almost all cross-modal tasks, which pose an indispensable challenge for a system to tackle with the ability of fine-grained interplay between image and text.

Here, we put forward a novel Visual-Textual Attributes Alignment model (ViTAA). For feature extraction, we fully exploit both visual and textual attribute representations. Specifically, we leverage segmentation labels to drive the attribute-aware feature learning from the input image. As shown in Fig. 3, we design multiple local branches, each of which is responsible to predict one particular attribute visual feature. This process is guided by the supervision on segmentation annotations, so the features are intrinsically aligned through the label information. We then use a generic natural language parser to extract attribute-related phrases, which at the same time remove the complex syntax in natural language and redundant non-informative descriptions. Building upon this, we adopt a contrastive learning schema to learn a joint embedding space for both visual and textual attributes. Meanwhile, we also notice that there may exist common attributes across different person identities (e.g., two different persons may wear similar “black shirt”). To thoroughly exploits these cases during training, we propagate a novel sampling method to mine surrogate positive examples which largely enriches the sampling space, and also provides us with valid informative samples for the sake of overcoming convergence problem in metric learning.

To this end, we argue and show that the benefits of the attribute alignment to person search model go well beyond the obvious. As the images used for person search tasks often contain a large variance on appearance (e.g., varying poses or viewpoints, with/without occlusion, and with cluttered background), the abstracted attribute-specific features could naturally help to resolve the ambiguity in feature representations. Also, searching by appearance attributes innately brings interpretability for the retrieving task and enables the attribute specific retrieval. It is worth mentioning that, there exist few very recent efforts that attempt to utilize the local fragments in both visual and textual modalities  [8, 49] and hierarchically align them  [3, 6]. The pairing schema of visual features and textual phrases in these methods are all based on the same identity, where they neglect the cues that exist across different identities. Comparing with them, ours is a more comprehensive modeling method that fully exploits the identical attributes from different persons thus greatly helps the alignment learning.

To validate these speculations, extensive experiments are conducted to our ViTAA model on the task of 1) person search by natural language and 2) by attribute, showing that our proposed model is capable of linking specific visual cues with specific words/phrases. More concretely, our ViTAA achieves a promising results across all these tasks. Further qualitative analysis verifies that our alignment learning successfully learns the fine-grained level correspondence across the visual and textual attributes. To summarize our contributions:

  • We design an attribute-aware representation learning framework to extract and align both visual and textual features for the task of person search by natural language. To the best of our knowledge, this is the first to adopt both semantic segmentation as well as natural language parsing to facilitate a semantically aligned representation learning.

  • We design a novel cross-modal alignment learning schema based on contrastive learning which can adaptively highlight the informative samples during the alignment learning. Meanwhile, an unsupervised data sampling method is proposed to facilitate the construction of learning pairs by exploiting more surrogate positive samples across different person identities.

  • We evaluate the superiority of ViTAA over other state-of-the-art methods for the person search by natural language task. We also conduct qualitative analysis to demonstrate the interpretability of ViTAA.

2 Related Work

Person Search. Given the form of the querying data, current person search tasks can be categorized into two major thrusts: searching by images (termed as Person Re-Identification), and person search by textual descriptions. Typical person re-identification (re-id) methods  [13, 26, 59] are formulated as retrieving the candidate that shows highest correlation with the query in the image galleries. However, a clear and valid image query is not always available in the real scenario, thus largely impedes the applications of re-id tasks. Recently, researchers alter their attention to re-id by textual descriptions: identifying the target person by using free-form natural languages  [3, 24, 25]. Meanwhile, it also comes with great challenges as requiring the model to deal with the complex syntax from the long and free-form descriptive sentence, and the inconsistent interpretations of low-quality surveillance images. To tackle these, methods like  [4, 24, 25] employ attention mechanism to build relation module between visual and textual representations, while  [55, 60] propose cross-modal objective functions for joint embedding learning. Dense visual feature is extracted in  [32] by cropping the input image for learning a regional-level matching schema. Beyond this,  [19] introduces pose estimation information for delicate human body-part parsing.

Attribute Representations. Adopting appropriate feature representations is of crucial importance for learning and retrieving from both image and text. Previous efforts in person search by natural language unanimously use holistic features of the person, which omit the partial visual cues from attributes at the fine-grained level. Multiple re-id systems have focused on the processing of body-part regions for visual feature learning, which can be summarized as: hand-craft horizontal stripes or grid  [26, 43, 46], attention mechanism  [37, 45], and auxiliary information including keypoints  [41, 50], human parsing mask  [20, 27] and dense semantic estimation  [56]. Among these methods, the auxiliary information usually provides more accurate partition results on localizing human parts and facilitating body-part attribute representations thanks to the multi-task training or the auxiliary networks. However, only few work  [14] pay attention to the accessories (such as the backpack) which could be the potential contextual cues for accurate person retrieval. As the corresponding components to specific visual cues, textual attribute phrases are usually provided as ground-truth labels or can be extracted from sentences through identifying the noun phrases with sentence parsing. Many of them use textual attributes as auxiliary label information to complement the content of image features  [23, 29, 39]. Recently, a few attempts leverage textual attribute as query for person retrieval  [6, 52].  [52] imposes an attribute-guided attention mechanism to capture the holistic appearance of person.  [6] proposes a hierarchical matching model that can jointly learn global category-level and local attribute-level embedding.

Visual-Semantic Embedding. Works in vision and language propagate the notion of visual semantic embedding, with a goal to learn joint feature space for both visual inputs and their correspondent textual annotations  [10, 53]. Such mechanism plays a core role in a series of cross-modal tasks, e.g., image/video captioning  [7, 21, 51], image retrieval through natural language  [8, 49, 55], and vision question answering  [1]. Conventional joint embedding learning framework adopts two-branch architecture  [8,9,10, 53, 55], where one extracts image features and the other encodes textual descriptions. The extracted cross-modal embedding features are learned through carefully designed objective functions.

3 Our Approach

Our network is composed of an image stream and a language stream (see Fig. 3), with the intention to encode inputs from both modalities for a visual-textual embedding learning. To be specific, given a person image \(\mathcal {I}\) and its textual description \(\mathcal {T}\), we first use the image stream to extract a global visual representation \(\textit{\textbf{v}}_0\), and a stack of local visual representations of \(N_{att}\) attributes \(\{\textit{\textbf{v}}_1, ... ,\textit{\textbf{v}}_{N_{att}}\}\), \(\textit{\textbf{v}}_i\in \mathbb {R}^{d}\). Similarly, we follow the language stream to extract overall textual embedding \(\textit{\textbf{t}}_0\), then decompose the whole sentence using standard natural language parser  [22] into a list of the attribute phrases, and encode them as \(\{\textit{\textbf{t}}_1, ... ,\textit{\textbf{t}}_{N_{att}}\}\), \(\textit{\textbf{t}}_i\in \mathbb {R}^d\). Our core contribution is the cross-modal alignment learning that matches each visual component \(\textit{\textbf{v}}_a\) with its corresponding textual phrase \(\textit{\textbf{t}}_a\), along with the global representation matching \(\big <\textit{\textbf{v}}_0, \textit{\textbf{t}}_0\big>\) for the person search by natural language task.

3.1 The Image Stream

We adopt the sub-network of ResNet-50 (conv1, conv2_x, conv3_x, conv4_x)  [17] as the backbone to extract feature maps \(\textit{\textbf{F}}\) from the input image. Then, we introduce a global branch \(\mathcal {F}^{glb}\), and multiple local branches \(\mathcal {F}^{loc}_{a}\) to generate global visual features \(\textit{\textbf{v}}_{0} = \mathcal {F}^{glb}(\textit{\textbf{F}})\), and attribute visual features \(\{\textit{\textbf{v}}_{1}\dots \textit{\textbf{v}}_{N_{att}}\}\) respectively, where \(\textit{\textbf{v}}_{a} = \mathcal {F}^{loc}_{a}(\textit{\textbf{F}})\). The network architectures are shown in Table 1. On the top of all the local branches is an auxiliary segmentation layer supervising each local branch to generate the segmentation map of one specific attribute category (shown in Fig. 3). Intuitively, we argue that the additional auxiliary task acts as a knowledge regulator that diversifies each local branch to present attribute-specific features.

Our segmentation layer utilizes the architecture of a lightweight MaskHead  [16] and can be removed during inference phase to reduce the computational cost. The remaining unsolved problem is that parsed annotations are not available in all person search datasets. To address that, we first train a human parsing network with HRNet  [42] as an off-the-shelf tool, where the HRNet is jointly trained on multiple human parsing datasets: MHPv2  [57], ATR  [28], and VIPeR  [44]. We then use the attribute category predictions as our segmentation annotations (illustrated in Fig. 2). With these annotations, local branches receive the supervision needed from the segmentation task to learn attribute-specific features. Essentially, we are distilling the attribute information from a well-trained human parsing networks to the lightweight segmentation layer through joint trainingFootnote 1.

Fig. 2.
figure 2

Attribute annotation generated by human parsing network. Torsos are labeled as background since there is no corresponding textual descriptions.

Discussion. Using attribute feature has the following advantages over the global features. 1) The textual annotations in person search by natural language task describe the person mostly by their dressing/body appearances, where the attribute features perfectly fit the situation. 2) Attribute aligning avoids the “malpositioned matching” cases as shown in Fig. 1: using segmentation to regularize feature learning equips the model to be resilient over the diverse human poses or viewpoints, and also robust to the background noises.

3.2 The Language Stream

Given the raw textual description, our language stream first parses and extracts noun phrases w.r.t. each attribute through the Stanford POS tagger  [31], and then feeds them into a language network to obtain the sentence-level as well as the phrase-level embeddings. We adopt a bi-directional LSTM to generate the global textual embedding \(\textit{\textbf{t}}_0\) and the local textual embedding. Meanwhile, we adopt a dictionary clustering approach to categorize the novel noun phrases in the sentence to specific attribute phrases as in  [8]. Concretely, we manually collect a list of words per attribute category, e.g., “shirt”, “jersey”, “polo” to represent the upper-body category, and use the average-pooled word vectors  [12] of them as the anchor embedding \(\mathbf {d}_a\), and form the dictionary \(\mathbf {D} = [\mathbf {d}_1, ... ,\mathbf {d}_{N_{att}}]\), where \(N_{att}\) is the total number of attributes. Building upon that, we assign the noun phrase to the category that has the highest cosine similarity, and form the local textual embedding {\(\textit{\textbf{t}}_1\dots \textit{\textbf{t}}_N\)}. Different from previous works like  [32, 56], we include accessory as one type of attribute as well, which serves as a crucial matching clue in many cases.

Fig. 3.
figure 3

Illustrative diagram of our ViTAA network, which includes an image stream (left) and a language stream (right). Our image stream first encodes the person image and extract both global and attribute representations. The local branch is additional supervised by an auxiliary segmentation layer where the annotations are acquired by an off-the-shell human parsing network. In the meanwhile, the textual description is parsed and decomposed into attribute atoms, and encoded by a weight-shared Bi-LSTM. We train our ViTAA jointly under global/attribute align loss in an end-to-end manner.

3.3 Visual-Textual Alignment Learning

Once we extract the global and attribute features, the key objective for the next stage is to learn a joint embedding space across the visual and the textual modalities, where the visual cues are tightly matched with the given textual description. Mathematically, we formulate our learning objective as a contrastive learning task that takes input as triplets, i.e., \(\big <\textit{\textbf{v}}^i, \textit{\textbf{t}}^+, \textit{\textbf{t}}^-\big>\) and \(\big <\textit{\textbf{t}}^i, \textit{\textbf{v}}^+, \textit{\textbf{v}}^-\big>\), where i denotes the index of person to identify, and \(+/-\) refer to the corresponding feature representations of the person i, and a randomly sampled irrelevant person respectively. We note that features in the triplet can be both at the global-level and the attribute-level. In the following, we discuss the learning schema on \(\big <\textit{\textbf{v}}^i, \textit{\textbf{t}}^+, \textit{\textbf{t}}^-\big>\) which can be extended to \(\big <\textit{\textbf{t}}^i, \textit{\textbf{v}}^+, \textit{\textbf{v}}^-\big>\).

We adopt the cosine similarity as the scoring function between visual and textual features \(S= \frac{\textit{\textbf{v}}^T\cdot \textit{\textbf{t}}}{||\textit{\textbf{v}}||\cdot ||\textit{\textbf{t}}||}\). For a positive pair \(\big <\textit{\textbf{v}}^i, \textit{\textbf{t}}^+\big>\), the cosine similarity \(S^+\) is encouraged to be as large as possible, which we define as absolute similarity criterion. While for a negative pair \(\big <\textit{\textbf{v}}^i, \textit{\textbf{t}}^-\big>\), enforcing the cosine similarity \(S^-\) to be minimal may yield an arbitrary constraint over the negative samples \(\textit{\textbf{t}}^-\). Instead, we propose to optimize the deviation between \(S^-\) and \(S^+\) to be larger than a preset margin, called relative similarity criterion. These criterion can be formulated as:

$$\begin{aligned} S^+\rightarrow 1\ and\ S^+ - S^- > m, \end{aligned}$$
(1)

where m is the least margin that positive and negative similarity should differ and is set to 0.2 in practice.

In contrastive learning, the general form of the basic objective function are either hinge loss \(\mathcal {L}(\textit{\textbf{x}})=\max \{0, 1-\textit{\textbf{x}}\}\) or logistic loss \(\mathcal {L}(\textit{\textbf{x}})=\log (1+\exp {(-\textit{\textbf{x}})})\). One crucial drawback of hinge loss is that its derivative w.r.t. x is a constant value: \(\frac{\partial \mathcal {L}}{\partial x}=-1\). Since the pair-based construction of training data leads to a polynomial growth of training pairs, inevitably we will have a certain part of the randomly sampled negative texts being less informative during training. Treating all the redundant samples equally might raise the risk of a slow convergence and/or even model degeneration for the metric learning tasks. While the derivative of logistic loss w.r.t. x is: \(\frac{\partial \mathcal {L}}{\partial x}=-\frac{1}{e^x+1}\), which is related with the input value. Hence, we settle with the logistic loss as our basic objective function.

With the logistic loss, the aforementioned criterion can be further derived and rewritten as:

$$\begin{aligned} (S^+-\alpha )> 0,\ -( S^- - \beta ) > 0, \end{aligned}$$
(2)

where \(\alpha \rightarrow 1\) denotes the lower bound for positive similarity and \(\beta = (\alpha - m)\) denotes the upper bound for negative similarity. Together with logistic loss function, our final Alignment loss can be unrolled as:

$$\begin{aligned} \mathcal {L}_{align} = \frac{1}{N}\sum _{i=1}^{N}\Big \{ \log \left[ 1 + e^{-\tau _p(S_{i}^+ - \alpha )}\right] + \log \left[ 1 + e^{\tau _n(S_i^{-} - \beta )}\right] \Big \}, \end{aligned}$$
(3)

where \(\tau _p\) and \(\tau _n\) denote the temperature parameters that adjust the slope of gradient. The partial derivatives are calculated as:

$$\begin{aligned} \frac{\partial \mathcal {L}_{align}}{\partial S_i^{+}} = \frac{-\tau _p}{1 + e^{\tau _p (S_i^{+} - \alpha )}}, \frac{\partial \mathcal {L}_{align}}{\partial S_i^{-}} = \frac{\tau _n}{1+e^{\tau _n (\beta - S_i^{-})}}. \end{aligned}$$
(4)

Thus, we show that Eq. 3 outputs continuous gradients and will assign higher weights to more informative samples accordingly.

K-Rreciprocal Sampling. One of the premise of visual-textual alignment is to fully exploit the informative positive and negative samples \(\textit{\textbf{t}}^+, \textit{\textbf{t}}^-\) to provide valid supervisions. However, most of the current contrastive learning methods  [48, 54] construct the positive pairs by selecting samples belonging to the same class and simply treat the random samples from other classes as negative. This is viable when using only global information at coarse level during training, but may not be able to handle the case as illustrated in Fig. 1 where a fine-grained level comparison is needed. This practice is largely depending on the average number of samples for each attribute category to provide comprehensive positive samples. With this insight, we propose to further enlarge the searching space of positive samples from the cross-id incidents.

For instance, as in Fig. 1, though the two ladies are with different identities, they share the extremely alike shoes which can be treated as the positive samples for learning. We term these kinds of samples with identical attributes but belong to different person identities as the “surrogate positive samples”. Kindly including the common attribute features of the surrogate positive samples in positive pairs makes much more sense than the reverse. It is worth noting that, this is unique only to our attribute alignment learning phase because attributes can only be compared at the fine-grained level. Now the key question is, how can we dig out the surrogate positive samples since we do not have direct cross-ID attribute annotations? Inspired by the re-ranking techniques in re-id community  [11, 61], we propose k-reciprocal sampling as an unsupervised method to generate the surrogate labels at the attribute-level. How does the proposed method sample from a batch of visual and textual features? Straightforwardly, for each attribute a, we can extract a batch of visual and textual features from the feature learning network and mine their corresponding surrogate positive samples using our sampling algorithm. Since we are only discussing the input form of \(\big <\textit{\textbf{v}}^i, \textit{\textbf{t}}^+, \textit{\textbf{t}}^-\big>\), our sampling algorithm is actually mining the surrogate positive textual features for each \(\textit{\textbf{v}}^i\). Note that, if the attribute information in either modality is missing after parsing, we can simply ignore them during sampling.

figure a

3.4 Joint Training

The entire network is trained in an end-to-end manner. We adopt the widely-used cross-entropy loss (ID Loss) to assist the learning of the discriminative features of each instance, as well as pixel-level cross-entropy loss (Seg Loss) to classify the attribute categories in the auxiliary segmentation task. For the cross-modal alignment learning, we design the Alignment Loss on both the global-level and the attribute-level representations. The overall loss function thus emerges:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{id} + \mathcal {L}_{seg} + \mathcal {L}_{align}^{glo} + \mathcal {L}_{align}^{attr}. \end{aligned}$$
(5)
Table 1. Detailed architecture of our global and local branches in image stream. #Branch. denotes the number of sub-branches.

4 Experiment

4.1 Experimental Setting

Datasets. We conduct experiments on the CUHK-PEDES  [25] dataset, which is currently the only benchmark for person search by natural language. It contains 40,206 images of 13,003 different persons, where each image comes with two human-annotated sentences. The dataset is split into 11,003 identities with 34,054 images in the training set, 1,000 identities with 3,078 images in validation, and 1,000 identities with 3,074 images in testing set.

Evaluation Protocols. Following the standard evaluation setting, we adopt Recall@K (K = 1, 5, 10) as the retrieval criteria. Specifically, given a text description as query, Recall@K (R@K) reports the percentage of the images where at least one corresponding person is retrieved correctly among the top-K results.

Implementation Details. For the global and local branches in image stream, we use the Basicblock as described in  [17], where each branch is randomly initialized (detailed architecture is shown in Table 1). We use horizontally flipping as data augmenting and resize all the images to \(384\times 128\). We use the Adam solver as the training optimizer with weight decay set as \(4\times 10^{-5}\), and involves 64 image-language pairs per mini-batch. The learning rate is initialized at \(2\times 10^{-4}\) for the first 40 epochs during training, then decayed by a factor of 0.1 for the remaining 30 epochs. The whole experiment is implemented on a single Tesla V100 GPU machine. The hyperparameters in Eq. 3 are empirically set as: \(\alpha =0.6,\beta =0.4,\tau _p=10,\tau _n=40\).

Pedestrian Attributes Parsing. Based on the analysis of image and natural language annotations in the dataset, we warp both visual and textual attributes into 5 categories: head (including descriptions related to hat, glasses, and face), clothes on the upper body, clothes on the lower body, shoes and bags (including backpack and handbag). We reckon that these attributes are visually distinguishable from both modalities. In Fig. 2, we visualize the segmentation maps generated by our human parsing network, where attribute regions can be properly segmented and associated with correct labels.

4.2 Comparisons with the State-of-The-Arts

Result on CUHK-PEDES Dataset. We summarize the performance of ViTAA and compare it with state-of-the-art methods in Table 2 on the CUHK-PEDES test set. Methods like GNA-RNN  [25], CMCE  [24], PWM-ATH  [4] employ attention mechanism to learn the relation between visual and textual representation, while Dual Path  [60], CMPM+CMPC  [55] design objective function for better joint embedding learning. These methods only learn and utilize the “global” feature representation of both image and text. Moreover, MIA  [32] exploits “region” information by dividing the input image into several horizontal stripes and extracting noun phrases from the natural language description. Similarly, GALM  [19] leverage “keypoint” information from human pose estimation as an attention mechanism to assist feature learning and together with a noun phrases extractor implemented on input text. Though the above two utilize the local-level representations, neither of them learns the associations between visual features with textual phrases. From Table 2, we observe that ViTAA shows a consistent lead on all metrics (R@1-10), outperforming the GALM  [19] by a margin of 1.85%, 0.39%, 0.55% and claims the new state-of-the-art results. We note that though the performance could be considered as incremental, the shown improvement on the R@1 performance is challenging. It suggests that the alignment learning of ViTAA contributes to the retrieval task directly. We further report the ablation studies on the effect of different components, and exhibit the attribute retrieval results quantitatively and qualitatively.

Table 2. Person search results on the CUHK-PEDES test set. Best results are in bold.

4.3 Ablation Study

We carry out comprehensive ablations to evaluate the contribution of different components and the training configurations.

Comparisons over Different Component Combinations. To compare the individual contribution of each component, we set the baseline model as the one trained with only ID loss. In Table 3, we report the improvement of the proposed components (segmentation, global-alignment, and attribute-alignment) on the basis of the baseline model. From the table, we have the following observations and analyses: First, using segmentation loss only brings marginal improvement because the visual features are not aligned with their corresponding textual features. Similarly, we observe the same trend when the training is combined with only attribute-alignment loss where the visual features are not properly segmented, thus can not be associated for retrieval. An incremental gain is obtained by combining these two components. Next, compared with attribute-level, global-level alignment greatly improves the performance under all criteria, which demonstrates the efficiency of the visual-textual alignment schema. The cause of the performance gap is that: the former is learning the attribute similarity across different person identities while the latter is concentrating on the uniqueness of each person. At the end, by combining all the loss terms yields the best performance, validating that our global-alignment and attribute-alignment learning are complimentary with each other.

Table 3. The improvement of components added on baseline model. Glb-Align and Attr-Align represent global-level and attribute-level alignment respectively.
Fig. 4.
figure 4

From left to right, we exhibit the raw input person images, attribute labels generated by the pre-trained HRNet, attribute segmentation result from our segmentation layer, and their corresponded feature maps from the local branches.

Visual Attribute Segmentation and Representations. In Fig. 4, we visualize the segmentation maps from the segmentation layer and the feature representations of the local branches. It evidently shows that, even transferred using only a lightweight structure, the auxiliary person segmentation layer produces accurate pixel-wise labels under different human pose. This suggests that person parsing knowledge has been successfully distilled our local branches, which is crucial for the precise cross-modal alignment learning. On the right side of Fig. 4, we showcase the feature maps of local branch per attribute.

Fig. 5.
figure 5

(a) R@1 and R@10 results across different K value in the proposed surrogate positive data sampling method. (b) Some examples of the surrogate positive data with different person identities.

Fig. 6.
figure 6

Examples of person search results on CUHK-PEDES. We indicate the true/false matching results in boxes. (Color figure online)

K-Reciprocal Sampling. We investigate how the value of K impacts the pair-based sampling and learning process. We evaluate the R@1 and R@10 performance under different K settings in Fig. 5(a). Ideally, the larger the K is, the more potential surrogate positive samples will be mined, while this also comes with the possibility that more non-relevant examples (false positive examples) might be incorrectly sampled. Result in Fig. 5(a) agrees with our analysis: best R@1 and R@10 is achieved when K is set to 8, and the performances are persistently declining as K goes larger. In Fig. 5(b), we provide visual examinations of the surrogate positive pairs that mined by our sampling method. The visual attributes from different persons serve as valuable positive samples in our alignment learning schema.

Qualitative Analysis. We present the qualitative examples of person retrieval results to provide a more in-depth examination. As shown in Fig. 6, we illustrate the top-10 matching results using the given query. In the successful case (top), ViTAA precisely capture all attributes in the target person. It is worth noting that the wrong answers still capture the relevant attributes: “sweater with black, gray and white stripes”, “tan pants”, and “carrying a bag”. For the failure case (bottom), though the retrieved images are incorrect, we observe that all the attributes described in the query are there in almost all retrieved results.

4.4 Extension: Attribute Retrieval

In order to validate the ability of associating the visual attribute with the text phrase, we further conduct attribute retrieval experiment on the datasets of Market-1501  [59] and DukeMTMC  [34], where 27 and 23 human related attributes are annotated per image by  [29]. In our experiment, we use our pre-trained ViTAA on CUHK-PEDES without any further finetuning, and conduct the retrieval task using the attribute phrase as the query under R@1 and mAP metrics. In our experiment, we simply test on the upper-body clothing attribute category, and post the retrieval results in Table 4. We introduce the details of our experiment in the supplementary materials. From Table 4, it clearly shows that ViTAA achieves great performances on almost all sub-attributes. This further strongly supports our argument that ViTAA is able to associate the visual attribute features with textual attribute descriptions successfully.

Table 4. Upper-body clothing attribute retrieve results. Attr is the short form of attribute and “upblack” denotes the upper-body in black.

5 Conclusion

In this work, we present a novel ViTAA model to address the person search by natural language task from the perspective of an attribute-specific alignment learning. In contrast to the existing methods, ViTAA fully exploits the common attribute information in both visual and textual modalities across different person identities, and further builds strong association between the visual attribute features and their corresponding textual phrases by using our alignment learning schema. We show that ViTAA achieves state-of-the-art results on the challenging benchmark CUHK-PEDES and demonstrate its promising potential that further advances the person search by natural language domain.