ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

Wang, Zhe; Fang, Zhiyuan; Wang, Jun; Yang, Yezhou

doi:10.1007/978-3-030-58610-2_24

Zhe Wang¹²,
Zhiyuan Fang¹³,
Jun Wang¹² &
…
Yezhou Yang¹³

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12357))

Included in the following conference series:

European Conference on Computer Vision

5698 Accesses
103 Citations

Abstract

Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as a performance boost by a robust feature learning that the referred identity can be accurately bundled by multiple attribute cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into sub-spaces corresponding to attributes using a light auxiliary attribute segmentation layer. It then aligns these visual features with the textual attributes parsed from the sentences via a novel contrastive learning loss. We validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances. Codes and models are available at https://github.com/Jarr0d/ViTAA.

Z. Wang and Z. Fang—Equal contribution. This work was done when Z. Wang was a visiting scholar at Active Perception Group, Arizona State University.

You have full access to this open access chapter, Download conference paper PDF

Enhanced Attribute Alignment Based on Semantic Co-Attention for Text-Based Person Search

See Finer, See More: Implicit Modality Alignment for Text-Based Person Retrieval

Improving embedding learning by virtual attribute decoupling for text-based person search

Article 07 January 2022

Keywords

1 Introduction

Recently, we have witnessed numerous practical breakthroughs in person modeling related tasks, e.g., pedestrian detection [2, 5, 47], pedestrian attribute recognition [30, 40] and person re-identification [13, 26, 59]. Person search [15, 25] as an aggregation of the aforementioned tasks thus gains increasing research attention. Comparing with searching by image queries, person search by natural language [6, 24, 25, 52] makes the retrieving procedure more user-friendly with increased flexibility due to a supporting of open-form natural language queries. Meanwhile, learning robust visual-textual associations becomes increasingly critical, which calls an urgent demand for a representation learning schema that is able to fully exploit both modalities.

Relevant studies in person modeling related research points out the critical role of the discriminative representations, especially of the local fragments in both image and text. For the former, [38, 58] learn the pose-related features from the key points map of human, while [20, 27] leverage the body-part features by auxiliary segmentation-based supervision. For the latter, [24, 25, 55] decompose the complex sentences into noun phrases, and [23, 52] directly adopt the attribute-specific annotations to learn fine-grained attribute related features. Moving forward, attribute specific features from image and text are even requisite for person search by natural language task, and how to effectively couple them stays an open question. We seek insight from a fatally flawed case that lingers in most of the current visual-language systems in Fig. 1, termed as “malpositioned matching”. For example, tasks like textual grounding [33, 35], VQA [1], and image-text retrieval [18, 36] are measuring the similarities or mutual information across modalities in a holistic fashion by answering: are the feature vectors of image and text match with each other? That way, when users input “a girl in white shirt and black skirt” as retrieval query, the model is not able to distinguish the nuances of the two images as shown in Fig. 1, where the false positive one actually shows “black shirt and white skirt”. As both the distinct color visual cues (“white” and “black”) exist in the images, overall matching without the ability of referring them to specific appearance attributes prevents the model from discriminating them as needed. Such cases exist extensively in almost all cross-modal tasks, which pose an indispensable challenge for a system to tackle with the ability of fine-grained interplay between image and text.

Here, we put forward a novel Visual-Textual Attributes Alignment model (ViTAA). For feature extraction, we fully exploit both visual and textual attribute representations. Specifically, we leverage segmentation labels to drive the attribute-aware feature learning from the input image. As shown in Fig. 3, we design multiple local branches, each of which is responsible to predict one particular attribute visual feature. This process is guided by the supervision on segmentation annotations, so the features are intrinsically aligned through the label information. We then use a generic natural language parser to extract attribute-related phrases, which at the same time remove the complex syntax in natural language and redundant non-informative descriptions. Building upon this, we adopt a contrastive learning schema to learn a joint embedding space for both visual and textual attributes. Meanwhile, we also notice that there may exist common attributes across different person identities (e.g., two different persons may wear similar “black shirt”). To thoroughly exploits these cases during training, we propagate a novel sampling method to mine surrogate positive examples which largely enriches the sampling space, and also provides us with valid informative samples for the sake of overcoming convergence problem in metric learning.

To this end, we argue and show that the benefits of the attribute alignment to person search model go well beyond the obvious. As the images used for person search tasks often contain a large variance on appearance (e.g., varying poses or viewpoints, with/without occlusion, and with cluttered background), the abstracted attribute-specific features could naturally help to resolve the ambiguity in feature representations. Also, searching by appearance attributes innately brings interpretability for the retrieving task and enables the attribute specific retrieval. It is worth mentioning that, there exist few very recent efforts that attempt to utilize the local fragments in both visual and textual modalities [8, 49] and hierarchically align them [3, 6]. The pairing schema of visual features and textual phrases in these methods are all based on the same identity, where they neglect the cues that exist across different identities. Comparing with them, ours is a more comprehensive modeling method that fully exploits the identical attributes from different persons thus greatly helps the alignment learning.

To validate these speculations, extensive experiments are conducted to our ViTAA model on the task of 1) person search by natural language and 2) by attribute, showing that our proposed model is capable of linking specific visual cues with specific words/phrases. More concretely, our ViTAA achieves a promising results across all these tasks. Further qualitative analysis verifies that our alignment learning successfully learns the fine-grained level correspondence across the visual and textual attributes. To summarize our contributions:

We design an attribute-aware representation learning framework to extract and align both visual and textual features for the task of person search by natural language. To the best of our knowledge, this is the first to adopt both semantic segmentation as well as natural language parsing to facilitate a semantically aligned representation learning.
We design a novel cross-modal alignment learning schema based on contrastive learning which can adaptively highlight the informative samples during the alignment learning. Meanwhile, an unsupervised data sampling method is proposed to facilitate the construction of learning pairs by exploiting more surrogate positive samples across different person identities.
We evaluate the superiority of ViTAA over other state-of-the-art methods for the person search by natural language task. We also conduct qualitative analysis to demonstrate the interpretability of ViTAA.

2 Related Work

Person Search. Given the form of the querying data, current person search tasks can be categorized into two major thrusts: searching by images (termed as Person Re-Identification), and person search by textual descriptions. Typical person re-identification (re-id) methods [13, 26, 59] are formulated as retrieving the candidate that shows highest correlation with the query in the image galleries. However, a clear and valid image query is not always available in the real scenario, thus largely impedes the applications of re-id tasks. Recently, researchers alter their attention to re-id by textual descriptions: identifying the target person by using free-form natural languages [3, 24, 25]. Meanwhile, it also comes with great challenges as requiring the model to deal with the complex syntax from the long and free-form descriptive sentence, and the inconsistent interpretations of low-quality surveillance images. To tackle these, methods like [4, 24, 25] employ attention mechanism to build relation module between visual and textual representations, while [55, 60] propose cross-modal objective functions for joint embedding learning. Dense visual feature is extracted in [32] by cropping the input image for learning a regional-level matching schema. Beyond this, [19] introduces pose estimation information for delicate human body-part parsing.

Attribute Representations. Adopting appropriate feature representations is of crucial importance for learning and retrieving from both image and text. Previous efforts in person search by natural language unanimously use holistic features of the person, which omit the partial visual cues from attributes at the fine-grained level. Multiple re-id systems have focused on the processing of body-part regions for visual feature learning, which can be summarized as: hand-craft horizontal stripes or grid [26, 43, 46], attention mechanism [37, 45], and auxiliary information including keypoints [41, 50], human parsing mask [20, 27] and dense semantic estimation [56]. Among these methods, the auxiliary information usually provides more accurate partition results on localizing human parts and facilitating body-part attribute representations thanks to the multi-task training or the auxiliary networks. However, only few work [14] pay attention to the accessories (such as the backpack) which could be the potential contextual cues for accurate person retrieval. As the corresponding components to specific visual cues, textual attribute phrases are usually provided as ground-truth labels or can be extracted from sentences through identifying the noun phrases with sentence parsing. Many of them use textual attributes as auxiliary label information to complement the content of image features [23, 29, 39]. Recently, a few attempts leverage textual attribute as query for person retrieval [6, 52]. [52] imposes an attribute-guided attention mechanism to capture the holistic appearance of person. [6] proposes a hierarchical matching model that can jointly learn global category-level and local attribute-level embedding.

Visual-Semantic Embedding. Works in vision and language propagate the notion of visual semantic embedding, with a goal to learn joint feature space for both visual inputs and their correspondent textual annotations [10, 53]. Such mechanism plays a core role in a series of cross-modal tasks, e.g., image/video captioning [7, 21, 51], image retrieval through natural language [8, 49, 55], and vision question answering [1]. Conventional joint embedding learning framework adopts two-branch architecture [8,9,10, 53, 55], where one extracts image features and the other encodes textual descriptions. The extracted cross-modal embedding features are learned through carefully designed objective functions.

3 Our Approach

Our network is composed of an image stream and a language stream (see Fig. 3), with the intention to encode inputs from both modalities for a visual-textual embedding learning. To be specific, given a person image $\mathcal {I}$ and its textual description $\mathcal {T}$, we first use the image stream to extract a global visual representation $\textit{\textbf{v}}_0$, and a stack of local visual representations of $N_{att}$ attributes $\{\textit{\textbf{v}}_1, ... ,\textit{\textbf{v}}_{N_{att}}\}$, $\textit{\textbf{v}}_i\in \mathbb {R}^{d}$. Similarly, we follow the language stream to extract overall textual embedding $\textit{\textbf{t}}_0$, then decompose the whole sentence using standard natural language parser [22] into a list of the attribute phrases, and encode them as $\{\textit{\textbf{t}}_1, ... ,\textit{\textbf{t}}_{N_{att}}\}$, $\textit{\textbf{t}}_i\in \mathbb {R}^d$. Our core contribution is the cross-modal alignment learning that matches each visual component $\textit{\textbf{v}}_a$ with its corresponding textual phrase $\textit{\textbf{t}}_a$, along with the global representation matching $\big <\textit{\textbf{v}}_0, \textit{\textbf{t}}_0\big>$ for the person search by natural language task.

3.1 The Image Stream

We adopt the sub-network of ResNet-50 (conv1, conv2_x, conv3_x, conv4_x) [17] as the backbone to extract feature maps $\textit{\textbf{F}}$ from the input image. Then, we introduce a global branch $\mathcal {F}^{glb}$, and multiple local branches $\mathcal {F}^{loc}_{a}$ to generate global visual features $\textit{\textbf{v}}_{0} = \mathcal {F}^{glb}(\textit{\textbf{F}})$, and attribute visual features $\{\textit{\textbf{v}}_{1}\dots \textit{\textbf{v}}_{N_{att}}\}$ respectively, where $\textit{\textbf{v}}_{a} = \mathcal {F}^{loc}_{a}(\textit{\textbf{F}})$. The network architectures are shown in Table 1. On the top of all the local branches is an auxiliary segmentation layer supervising each local branch to generate the segmentation map of one specific attribute category (shown in Fig. 3). Intuitively, we argue that the additional auxiliary task acts as a knowledge regulator that diversifies each local branch to present attribute-specific features.

Our segmentation layer utilizes the architecture of a lightweight MaskHead [16] and can be removed during inference phase to reduce the computational cost. The remaining unsolved problem is that parsed annotations are not available in all person search datasets. To address that, we first train a human parsing network with HRNet [42] as an off-the-shelf tool, where the HRNet is jointly trained on multiple human parsing datasets: MHPv2 [57], ATR [28], and VIPeR [44]. We then use the attribute category predictions as our segmentation annotations (illustrated in Fig. 2). With these annotations, local branches receive the supervision needed from the segmentation task to learn attribute-specific features. Essentially, we are distilling the attribute information from a well-trained human parsing networks to the lightweight segmentation layer through joint training^{Footnote 1}.

Discussion. Using attribute feature has the following advantages over the global features. 1) The textual annotations in person search by natural language task describe the person mostly by their dressing/body appearances, where the attribute features perfectly fit the situation. 2) Attribute aligning avoids the “malpositioned matching” cases as shown in Fig. 1: using segmentation to regularize feature learning equips the model to be resilient over the diverse human poses or viewpoints, and also robust to the background noises.

3.2 The Language Stream

Given the raw textual description, our language stream first parses and extracts noun phrases w.r.t. each attribute through the Stanford POS tagger [31], and then feeds them into a language network to obtain the sentence-level as well as the phrase-level embeddings. We adopt a bi-directional LSTM to generate the global textual embedding $\textit{\textbf{t}}_0$ and the local textual embedding. Meanwhile, we adopt a dictionary clustering approach to categorize the novel noun phrases in the sentence to specific attribute phrases as in [8]. Concretely, we manually collect a list of words per attribute category, e.g., “shirt”, “jersey”, “polo” to represent the upper-body category, and use the average-pooled word vectors [12] of them as the anchor embedding $\mathbf {d}_a$, and form the dictionary $\mathbf {D} = [\mathbf {d}_1, ... ,\mathbf {d}_{N_{att}}]$, where $N_{att}$ is the total number of attributes. Building upon that, we assign the noun phrase to the category that has the highest cosine similarity, and form the local textual embedding {$\textit{\textbf{t}}_1\dots \textit{\textbf{t}}_N$}. Different from previous works like [32, 56], we include accessory as one type of attribute as well, which serves as a crucial matching clue in many cases.

3.3 Visual-Textual Alignment Learning

Once we extract the global and attribute features, the key objective for the next stage is to learn a joint embedding space across the visual and the textual modalities, where the visual cues are tightly matched with the given textual description. Mathematically, we formulate our learning objective as a contrastive learning task that takes input as triplets, i.e., $\big <\textit{\textbf{v}}^i, \textit{\textbf{t}}^+, \textit{\textbf{t}}^-\big>$ and $\big <\textit{\textbf{t}}^i, \textit{\textbf{v}}^+, \textit{\textbf{v}}^-\big>$, where i denotes the index of person to identify, and $+/-$ refer to the corresponding feature representations of the person i, and a randomly sampled irrelevant person respectively. We note that features in the triplet can be both at the global-level and the attribute-level. In the following, we discuss the learning schema on $\big <\textit{\textbf{v}}^i, \textit{\textbf{t}}^+, \textit{\textbf{t}}^-\big>$ which can be extended to $\big <\textit{\textbf{t}}^i, \textit{\textbf{v}}^+, \textit{\textbf{v}}^-\big>$.

We adopt the cosine similarity as the scoring function between visual and textual features $S= \frac{\textit{\textbf{v}}^T\cdot \textit{\textbf{t}}}{||\textit{\textbf{v}}||\cdot ||\textit{\textbf{t}}||}$. For a positive pair $\big <\textit{\textbf{v}}^i, \textit{\textbf{t}}^+\big>$, the cosine similarity $S^+$ is encouraged to be as large as possible, which we define as absolute similarity criterion. While for a negative pair $\big <\textit{\textbf{v}}^i, \textit{\textbf{t}}^-\big>$, enforcing the cosine similarity $S^-$ to be minimal may yield an arbitrary constraint over the negative samples $\textit{\textbf{t}}^-$. Instead, we propose to optimize the deviation between $S^-$ and $S^+$ to be larger than a preset margin, called relative similarity criterion. These criterion can be formulated as:

$$\begin{aligned} S^+\rightarrow 1\ and\ S^+ - S^- > m, \end{aligned}$$

(1)

where m is the least margin that positive and negative similarity should differ and is set to 0.2 in practice.

In contrastive learning, the general form of the basic objective function are either hinge loss $\mathcal {L}(\textit{\textbf{x}})=\max \{0, 1-\textit{\textbf{x}}\}$ or logistic loss $\mathcal {L}(\textit{\textbf{x}})=\log (1+\exp {(-\textit{\textbf{x}})})$. One crucial drawback of hinge loss is that its derivative w.r.t. x is a constant value: $\frac{\partial \mathcal {L}}{\partial x}=-1$. Since the pair-based construction of training data leads to a polynomial growth of training pairs, inevitably we will have a certain part of the randomly sampled negative texts being less informative during training. Treating all the redundant samples equally might raise the risk of a slow convergence and/or even model degeneration for the metric learning tasks. While the derivative of logistic loss w.r.t. x is: $\frac{\partial \mathcal {L}}{\partial x}=-\frac{1}{e^x+1}$, which is related with the input value. Hence, we settle with the logistic loss as our basic objective function.

With the logistic loss, the aforementioned criterion can be further derived and rewritten as:

$$\begin{aligned} (S^+-\alpha )> 0,\ -( S^- - \beta ) > 0, \end{aligned}$$

(2)

where $\alpha \rightarrow 1$ denotes the lower bound for positive similarity and $\beta = (\alpha - m)$ denotes the upper bound for negative similarity. Together with logistic loss function, our final Alignment loss can be unrolled as:

$$\begin{aligned} \mathcal {L}_{align} = \frac{1}{N}\sum _{i=1}^{N}\Big \{ \log \left[ 1 + e^{-\tau _p(S_{i}^+ - \alpha )}\right] + \log \left[ 1 + e^{\tau _n(S_i^{-} - \beta )}\right] \Big \}, \end{aligned}$$

(3)

where $\tau _p$ and $\tau _n$ denote the temperature parameters that adjust the slope of gradient. The partial derivatives are calculated as:

$$\begin{aligned} \frac{\partial \mathcal {L}_{align}}{\partial S_i^{+}} = \frac{-\tau _p}{1 + e^{\tau _p (S_i^{+} - \alpha )}}, \frac{\partial \mathcal {L}_{align}}{\partial S_i^{-}} = \frac{\tau _n}{1+e^{\tau _n (\beta - S_i^{-})}}. \end{aligned}$$

(4)

Thus, we show that Eq. 3 outputs continuous gradients and will assign higher weights to more informative samples accordingly.

K-Rreciprocal Sampling. One of the premise of visual-textual alignment is to fully exploit the informative positive and negative samples $\textit{\textbf{t}}^+, \textit{\textbf{t}}^-$ to provide valid supervisions. However, most of the current contrastive learning methods [48, 54] construct the positive pairs by selecting samples belonging to the same class and simply treat the random samples from other classes as negative. This is viable when using only global information at coarse level during training, but may not be able to handle the case as illustrated in Fig. 1 where a fine-grained level comparison is needed. This practice is largely depending on the average number of samples for each attribute category to provide comprehensive positive samples. With this insight, we propose to further enlarge the searching space of positive samples from the cross-id incidents.

For instance, as in Fig. 1, though the two ladies are with different identities, they share the extremely alike shoes which can be treated as the positive samples for learning. We term these kinds of samples with identical attributes but belong to different person identities as the “surrogate positive samples”. Kindly including the common attribute features of the surrogate positive samples in positive pairs makes much more sense than the reverse. It is worth noting that, this is unique only to our attribute alignment learning phase because attributes can only be compared at the fine-grained level. Now the key question is, how can we dig out the surrogate positive samples since we do not have direct cross-ID attribute annotations? Inspired by the re-ranking techniques in re-id community [11, 61], we propose k-reciprocal sampling as an unsupervised method to generate the surrogate labels at the attribute-level. How does the proposed method sample from a batch of visual and textual features? Straightforwardly, for each attribute a, we can extract a batch of visual and textual features from the feature learning network and mine their corresponding surrogate positive samples using our sampling algorithm. Since we are only discussing the input form of $\big <\textit{\textbf{v}}^i, \textit{\textbf{t}}^+, \textit{\textbf{t}}^-\big>$, our sampling algorithm is actually mining the surrogate positive textual features for each $\textit{\textbf{v}}^i$. Note that, if the attribute information in either modality is missing after parsing, we can simply ignore them during sampling.

3.4 Joint Training

The entire network is trained in an end-to-end manner. We adopt the widely-used cross-entropy loss (ID Loss) to assist the learning of the discriminative features of each instance, as well as pixel-level cross-entropy loss (Seg Loss) to classify the attribute categories in the auxiliary segmentation task. For the cross-modal alignment learning, we design the Alignment Loss on both the global-level and the attribute-level representations. The overall loss function thus emerges:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{id} + \mathcal {L}_{seg} + \mathcal {L}_{align}^{glo} + \mathcal {L}_{align}^{attr}. \end{aligned}$$

(5)

Table 1. Detailed architecture of our global and local branches in image stream. #Branch. denotes the number of sub-branches.

Full size table

4 Experiment

4.1 Experimental Setting

Datasets. We conduct experiments on the CUHK-PEDES [25] dataset, which is currently the only benchmark for person search by natural language. It contains 40,206 images of 13,003 different persons, where each image comes with two human-annotated sentences. The dataset is split into 11,003 identities with 34,054 images in the training set, 1,000 identities with 3,078 images in validation, and 1,000 identities with 3,074 images in testing set.

Evaluation Protocols. Following the standard evaluation setting, we adopt Recall@K (K = 1, 5, 10) as the retrieval criteria. Specifically, given a text description as query, Recall@K (R@K) reports the percentage of the images where at least one corresponding person is retrieved correctly among the top-K results.

Implementation Details. For the global and local branches in image stream, we use the Basicblock as described in [17], where each branch is randomly initialized (detailed architecture is shown in Table 1). We use horizontally flipping as data augmenting and resize all the images to $384\times 128$. We use the Adam solver as the training optimizer with weight decay set as $4\times 10^{-5}$, and involves 64 image-language pairs per mini-batch. The learning rate is initialized at $2\times 10^{-4}$ for the first 40 epochs during training, then decayed by a factor of 0.1 for the remaining 30 epochs. The whole experiment is implemented on a single Tesla V100 GPU machine. The hyperparameters in Eq. 3 are empirically set as: $\alpha =0.6,\beta =0.4,\tau _p=10,\tau _n=40$.

Pedestrian Attributes Parsing. Based on the analysis of image and natural language annotations in the dataset, we warp both visual and textual attributes into 5 categories: head (including descriptions related to hat, glasses, and face), clothes on the upper body, clothes on the lower body, shoes and bags (including backpack and handbag). We reckon that these attributes are visually distinguishable from both modalities. In Fig. 2, we visualize the segmentation maps generated by our human parsing network, where attribute regions can be properly segmented and associated with correct labels.

4.2 Comparisons with the State-of-The-Arts

Result on CUHK-PEDES Dataset. We summarize the performance of ViTAA and compare it with state-of-the-art methods in Table 2 on the CUHK-PEDES test set. Methods like GNA-RNN [25], CMCE [24], PWM-ATH [4] employ attention mechanism to learn the relation between visual and textual representation, while Dual Path [60], CMPM+CMPC [55] design objective function for better joint embedding learning. These methods only learn and utilize the “global” feature representation of both image and text. Moreover, MIA [32] exploits “region” information by dividing the input image into several horizontal stripes and extracting noun phrases from the natural language description. Similarly, GALM [19] leverage “keypoint” information from human pose estimation as an attention mechanism to assist feature learning and together with a noun phrases extractor implemented on input text. Though the above two utilize the local-level representations, neither of them learns the associations between visual features with textual phrases. From Table 2, we observe that ViTAA shows a consistent lead on all metrics (R@1-10), outperforming the GALM [19] by a margin of 1.85%, 0.39%, 0.55% and claims the new state-of-the-art results. We note that though the performance could be considered as incremental, the shown improvement on the R@1 performance is challenging. It suggests that the alignment learning of ViTAA contributes to the retrieval task directly. We further report the ablation studies on the effect of different components, and exhibit the attribute retrieval results quantitatively and qualitatively.

Table 2. Person search results on the CUHK-PEDES test set. Best results are in bold.

Full size table

4.3 Ablation Study

We carry out comprehensive ablations to evaluate the contribution of different components and the training configurations.

Comparisons over Different Component Combinations. To compare the individual contribution of each component, we set the baseline model as the one trained with only ID loss. In Table 3, we report the improvement of the proposed components (segmentation, global-alignment, and attribute-alignment) on the basis of the baseline model. From the table, we have the following observations and analyses: First, using segmentation loss only brings marginal improvement because the visual features are not aligned with their corresponding textual features. Similarly, we observe the same trend when the training is combined with only attribute-alignment loss where the visual features are not properly segmented, thus can not be associated for retrieval. An incremental gain is obtained by combining these two components. Next, compared with attribute-level, global-level alignment greatly improves the performance under all criteria, which demonstrates the efficiency of the visual-textual alignment schema. The cause of the performance gap is that: the former is learning the attribute similarity across different person identities while the latter is concentrating on the uniqueness of each person. At the end, by combining all the loss terms yields the best performance, validating that our global-alignment and attribute-alignment learning are complimentary with each other.

Table 3. The improvement of components added on baseline model. Glb-Align and Attr-Align represent global-level and attribute-level alignment respectively.

Full size table

Visual Attribute Segmentation and Representations. In Fig. 4, we visualize the segmentation maps from the segmentation layer and the feature representations of the local branches. It evidently shows that, even transferred using only a lightweight structure, the auxiliary person segmentation layer produces accurate pixel-wise labels under different human pose. This suggests that person parsing knowledge has been successfully distilled our local branches, which is crucial for the precise cross-modal alignment learning. On the right side of Fig. 4, we showcase the feature maps of local branch per attribute.

K-Reciprocal Sampling. We investigate how the value of K impacts the pair-based sampling and learning process. We evaluate the R@1 and R@10 performance under different K settings in Fig. 5(a). Ideally, the larger the K is, the more potential surrogate positive samples will be mined, while this also comes with the possibility that more non-relevant examples (false positive examples) might be incorrectly sampled. Result in Fig. 5(a) agrees with our analysis: best R@1 and R@10 is achieved when K is set to 8, and the performances are persistently declining as K goes larger. In Fig. 5(b), we provide visual examinations of the surrogate positive pairs that mined by our sampling method. The visual attributes from different persons serve as valuable positive samples in our alignment learning schema.

Qualitative Analysis. We present the qualitative examples of person retrieval results to provide a more in-depth examination. As shown in Fig. 6, we illustrate the top-10 matching results using the given query. In the successful case (top), ViTAA precisely capture all attributes in the target person. It is worth noting that the wrong answers still capture the relevant attributes: “sweater with black, gray and white stripes”, “tan pants”, and “carrying a bag”. For the failure case (bottom), though the retrieved images are incorrect, we observe that all the attributes described in the query are there in almost all retrieved results.

4.4 Extension: Attribute Retrieval

In order to validate the ability of associating the visual attribute with the text phrase, we further conduct attribute retrieval experiment on the datasets of Market-1501 [59] and DukeMTMC [34], where 27 and 23 human related attributes are annotated per image by [29]. In our experiment, we use our pre-trained ViTAA on CUHK-PEDES without any further finetuning, and conduct the retrieval task using the attribute phrase as the query under R@1 and mAP metrics. In our experiment, we simply test on the upper-body clothing attribute category, and post the retrieval results in Table 4. We introduce the details of our experiment in the supplementary materials. From Table 4, it clearly shows that ViTAA achieves great performances on almost all sub-attributes. This further strongly supports our argument that ViTAA is able to associate the visual attribute features with textual attribute descriptions successfully.

Table 4. Upper-body clothing attribute retrieve results. Attr is the short form of attribute and “upblack” denotes the upper-body in black.

Full size table

5 Conclusion

In this work, we present a novel ViTAA model to address the person search by natural language task from the perspective of an attribute-specific alignment learning. In contrast to the existing methods, ViTAA fully exploits the common attribute information in both visual and textual modalities across different person identities, and further builds strong association between the visual attribute features and their corresponding textual phrases by using our alignment learning schema. We show that ViTAA achieves state-of-the-art results on the challenging benchmark CUHK-PEDES and demonstrate its promising potential that further advances the person search by natural language domain.

Notes

1.
More details of our human parsing network and segmentation results can be found in the experimental part and the supplementary materials.

References

Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Benenson, R., Omran, M., Hosang, J., Schiele, B.: Ten years of pedestrian detection, what have we learned? In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8926, pp. 613–627. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16181-5_47
Chapter Google Scholar
Chen, D., et al.: Improving deep visual representation for person re-identification by global and local image-language association. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 54–70 (2018)
Google Scholar
Chen, T., Xu, C., Luo, J.: Improving text-based person search by spatial matching and adaptive threshold. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1879–1887, March 2018
Google Scholar
Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: a benchmark. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 304–311. IEEE (2009)
Google Scholar
Dong, Q., Gong, S., Zhu, X.: Person search by text attribute query as zero-shot learning. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Fang, Z., Gokhale, T., Banerjee, P., Baral, C., Yang, Y.: Video2Commonsense: generating commonsense descriptions to enrich video captioning. arXiv preprint arXiv:2003.05162 (2020)
Fang, Z., Kong, S., Fowlkes, C., Yang, Y.: Modularized textual grounding for counterfactual resilience. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Google Scholar
Fang, Z., Kong, S., Yu, T., Yang, Y.: Weakly supervised attention learning for textual phrases grounding. arXiv preprint arXiv:1805.00545 (2018)
Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems, pp. 2121–2129 (2013)
Google Scholar
Garcia, J., Martinel, N., Micheloni, C., Gardel, A.: Person re-identification ranking optimisation by discriminant context information analysis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1305–1313 (2015)
Google Scholar
Goldberg, Y., Levy, O.: word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014)
Gong, S., Cristani, M., Yan, S., Loy, C.C.: Person Re-Identification. Springer, London (2014). https://doi.org/10.1007/978-1-4471-6296-4
Book MATH Google Scholar
Guo, J., Yuan, Y., Huang, L., Zhang, C., Yao, J.G., Han, K.: Beyond human parts: dual part-aligned representations for person re-identification. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Han, C., et al.: Re-ID driven localization refinement for person search. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9814–9823 (2019)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 119–126 (2003)
Google Scholar
Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided joint global and attentive local matching network for text-based person search. arXiv preprint arXiv:1809.08440 (2018)
Kalayeh, M.M., Basaran, E., Gökmen, M., Kamasak, M.E., Shah, M.: Human semantic parsing for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1062–1071 (2018)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Klein, D., Manning, C.D.: Fast exact inference with a factored model for natural language parsing. In: Advances in Neural Information Processing Systems, pp. 3–10 (2003)
Google Scholar
Layne, R., Hospedales, T.M., Gong, S.: Attributes-based re-identification. In: Gong, S., Cristani, M., Yan, S., Loy, C.C. (eds.) Person Re-Identification. ACVPR, pp. 93–117. Springer, London (2014). https://doi.org/10.1007/978-1-4471-6296-4_5
Chapter Google Scholar
Li, S., Xiao, T., Li, H., Yang, W., Wang, X.: Identity-aware textual-visual matching with latent co-attention. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1890–1899 (2017)
Google Scholar
Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1970–1979 (2017)
Google Scholar
Li, W., Zhao, R., Xiao, T., Wang, X.: DeepReID: deep filter pairing neural network for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 152–159 (2014)
Google Scholar
Liang, X., Gong, K., Shen, X., Lin, L.: Look into person: joint body parsing & pose estimation network and a new benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 871–885 (2018)
Article Google Scholar
Liang, X., et al.: Deep human parsing with active template regression. IEEE Trans. Pattern Anal. Mach. Intell. 12, 2402–2414 (2015)
Article Google Scholar
Lin, Y., et al.: Improving person re-identification by attribute and identity learning. Pattern Recogn. 95, 151–161 (2019)
Article Google Scholar
Liu, X., et al.: HydraPlus-Net: attentive deep features for pedestrian analysis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 350–359 (2017)
Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014)
Google Scholar
Niu, K., Huang, Y., Ouyang, W., Wang, L.: Improving description-based person re-identification by multi-granularity image-text alignments. arXiv preprint arXiv:1906.09610 (2019)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
Google Scholar
Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 17–35. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_2
Chapter Google Scholar
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_49
Chapter Google Scholar
Shekhar, R., Jawahar, C.: Word image retrieval using bag of visual words. In: 2012 10th IAPR International Workshop on Document Analysis Systems, pp. 297–301. IEEE (2012)
Google Scholar
Si, J., et al.: Dual attention matching network for context-aware feature sequence based person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5363–5372 (2018)
Google Scholar
Su, C., Li, J., Zhang, S., Xing, J., Gao, W., Tian, Q.: Pose-driven deep convolutional model for person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3960–3969 (2017)
Google Scholar
Su, C., Zhang, S., Xing, J., Gao, W., Tian, Q.: Multi-type attributes driven multi-camera person re-identification. Pattern Recog. 75, 77–89 (2018)
Article Google Scholar
Sudowe, P., Spitzer, H., Leibe, B.: Person attribute recognition with a jointly-trained holistic CNN model. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 87–95 (2015)
Google Scholar
Suh, Y., Wang, J., Tang, S., Mei, T., Mu Lee, K.: Part-aligned bilinear representations for person re-identification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 402–419 (2018)
Google Scholar
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Google Scholar
Sun, Y., Zheng, L., Yang, Y., Tian, Q., Wang, S.: Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 480–496 (2018)
Google Scholar
Tan, Z., Yang, Y., Wan, J., Hang, H., Guo, G., Li, S.Z.: Attention-based pedestrian attribute analysis. IEEE Trans. Image Process. 12, 6126–6140 (2019)
Article MathSciNet Google Scholar
Wang, C., Zhang, Q., Huang, C., Liu, W., Wang, X.: Mancs: a multi-task attentional network with curriculum sampling for person re-identification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–381 (2018)
Google Scholar
Wang, G., Yuan, Y., Chen, X., Li, J., Zhou, X.: Learning discriminative features with multiple granularities for person re-identification. In: 2018 ACM Multimedia Conference on Multimedia Conference, pp. 274–282. ACM (2018)
Google Scholar
Wang, Z., Wang, J., Yang, Y.: Resisting crowd occlusion and hard negatives for pedestrian detection in the wild. arXiv preprint arXiv:2005.07344 (2020)
Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_31
Chapter Google Scholar
Wu, H., et al.: Unified visual-semantic embeddings: bridging vision and language with structured meaning representations. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Google Scholar
Xu, J., Zhao, R., Zhu, F., Wang, H., Ouyang, W.: Attention-aware compositional network for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2119–2128 (2018)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Yin, Z., et al.: Adversarial attribute-image person re-identification. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-2018, pp. 1100–1106. International Joint Conferences on Artificial Intelligence Organization, July 2018
Google Scholar
You, Q., Zhang, Z., Luo, J.: End-to-end convolutional semantic embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5735–5744 (2018)
Google Scholar
Zhang, X., Fang, Z., Wen, Y., Li, Z., Qiao, Y.: Range loss for deep face recognition with long-tailed training data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5409–5418 (2017)
Google Scholar
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701 (2018)
Google Scholar
Zhang, Z., Lan, C., Zeng, W., Chen, Z.: Densely semantically aligned person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 667–676 (2019)
Google Scholar
Zhao, J., Li, J., Cheng, Y., Sim, T., Yan, S., Feng, J.: Understanding humans in crowded scenes: deep nested adversarial learning and a new benchmark for multi-human parsing. In: 2018 ACM Multimedia Conference on Multimedia Conference, pp. 792–800. ACM (2018)
Google Scholar
Zheng, L., Huang, Y., Lu, H., Yang, Y.: Pose invariant embedding for deep person re-identification. IEEE Trans. Image Process. 28(9), 4500–4509 (2019)
Article MathSciNet Google Scholar
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: a benchmark. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1116–1124 (2015)
Google Scholar
Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Shen, Y.D.: Dual-path convolutional image-text embedding with instance loss. arXiv preprint arXiv:1711.05535 (2017)
Zhong, Z., Zheng, L., Cao, D., Li, S.: Re-ranking person re-identification with k-reciprocal encoding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1318–1327 (2017)
Google Scholar

Download references

Acknowledgements

Vising scholarship support for Z. Wang from the China Scholarship Council #201806020020 and Amazon AWS Machine Learning Research Award (MLRA) support are greatly appreciated. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the sponsors.

Author information

Authors and Affiliations

Beihang University, Beijing, China
Zhe Wang & Jun Wang
Arizona State University, Tempe, USA
Zhiyuan Fang & Yezhou Yang

Authors

Zhe Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyuan Fang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yezhou Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhe Wang .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 867 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Z., Fang, Z., Wang, J., Yang, Y. (2020). ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12357. Springer, Cham. https://doi.org/10.1007/978-3-030-58610-2_24

Download citation

DOI: https://doi.org/10.1007/978-3-030-58610-2_24
Published: 07 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58609-6
Online ISBN: 978-3-030-58610-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

Abstract

Similar content being viewed by others

Enhanced Attribute Alignment Based on Semantic Co-Attention for Text-Based Person Search

See Finer, See More: Implicit Modality Alignment for Text-Based Person Retrieval

Improving embedding learning by virtual attribute decoupling for text-based person search

Keywords

1 Introduction

2 Related Work