Keywords

1 Introduction

In recent years a surge of researches on image generation have found their applications in various real-world computer vision and multimedia tasks, e.g. facial attribute editing and animation [2, 13, 19], image super-resolution [4], object detection [3], and image-to-image translation [9, 10, 12]. Among them, some recent works focus on generating pedestrian images given pose information [16, 21, 26, 27] or just from scratch [23]. The generated pedestrian images can be used to boost the performance of related learning tasks, such as person attribute prediction and person re-ID, through data augmentation of pedestrian image datasets. However, compared with general image generation, pedestrian image generation is more challenging, due to complex body configurations and poses, abundant details, variant lighting and backgrounds, etc.

Fig. 1.
figure 1

The image generation examples of the APIE-Net

The main drawbacks of the aforementioned pedestrian image generation approaches are three-folds: body distortion, inadequate visual details and large vague areas. Firstly, body distortion occurs due to ambiguous locations of different body parts. For example, the generation areas might wrongly deviate to other body parts or even the background. Secondly, while obtaining global structures of pedestrians is relatively easy for most generation models, generating rich local details is problematic due to deficient given information. Though [16, 21, 26, 27] take advantage of pose annotations in person generation, the appearance details is still inadequate. Thirdly, large vague areas appear in the generated person images because of the conflict between disparate appearances from unsettled viewpoints and (or) complex background scenarios. Neither Gaussian noise nor pose labels could provide enough information to address this problem well.

To overcome the above drawbacks, we propose to introduce visual attributes in pedestrian generation. Visual attributes contain characteristics which can greatly benefit image generation: visual attributes usually describe specific regions, e.g. “upper-body black” refers to the pixels of the upper-torso; visual attributes can incorporate affluent details, e.g. “lower-body in white skirt”; visual attributes are relatively consistent, and invariant to viewpoints and backgrounds. The pose information adopted in [16, 21, 26, 27] can be considered as a kind of visual attribute. (For concise presentation we use the term “attribute” for “visual attribute” hereafter.) In this paper, we focus on more general attribute-based pedestrian generation. With a certain specified attribute, we aim to re-generate (or edit) the image of a given pedestrian with the attribute. We thus call our method Attribute-aware Pedestrian Image Editing (APIE). Although attributes have been used in face editing, it is the first time to study pedestrian image editing based on attributes.

Fig. 2.
figure 2

The main architecture of the APIE-Net.

Though some efforts have been made on attribute-based image generation [8,9,10,11,12,13, 29], the task is more difficult in the context of pedestrian editing due to the complexity of pedestrian images. In this paper, our proposed APIE-Net consists of three specific mechanisms to address the difficulties. Firstly, to avoid dramatic distortions of pedestrian appearance, we use a segmentation net before image generation to locate the region of interest, so as to centralize the generation on the target body part and preclude the effects to the other regions including background. Secondly, to generate more visual details, a series of multi-scale adversarial discriminators are adopted to capture local visual information of different granularity. Thirdly, to disentangle the attributes which blend with each other or with other image appearance, we exploit adversarial training of the latent variables like the Fader-Net [14] to remove side-effects from irrelevant attributes or image regions, e.g. for white-colored and striped cloths we want to retain the strips while switching the color.

To verify the effectiveness of our method, we utilize the attribute labels provided in [28] and generate additional pedestrian images for two datasets, Market-1501 [30] and DukeMTMC-reID [22]. Some exemplars are depicted in Fig. 1. Comparisons on the generated pedestrian images with other generation methods [7, 14, 19] show that our method brings both quantitative and qualitative improvements. On the other hand, the generated images are beneficial to other pedestrian-related visual tasks through data augmentation. Most pedestrian datasets are “attribute-imbalanced” in which some attributes have few positive exemplars while others have many. The generated pedestrian images can augment the original datasets and make them more balanced, leading to state-of-the-art performance in both person re-ID and pedestrian attribute learning tasks. This result proves the effectiveness and practicability of our method from another view.

2 Methodology

The main architecture of the APIE-Net consisting of four parts is shown in Fig. 2. At the input-end an attribute-aware segmentation network is used to segment the input pedestrian image and extract the region corresponding to the given attributes. The generator consists of an encoder-decoder network with skip connections and a latent-variable discriminator used to erase contradictory attributes. The multi-scale discriminator guarantees the realness of the generated images. The attribute classifier ensures the generated images to posses the assigned attributes.

Let \(\mathbf {x^a}\) denote an input pedestrian image \(\mathbf {x}\in \mathbb {R}^{W\times H\times C}\) (WHC denote the width, height and the number of channels respectively) with attributes \(\mathbf {a}\in \{0,1\}^l\) (l denotes the number of attribute categories). The \(t^{th}\) element of \(\mathbf {a}\) is 1 if \(\mathbf {x^a}\) has the \(t^{th}\) attribute (e.g., white cloths) and 0 otherwise. The overall target of APIE-Net is to generate a new pedestrian image \(\mathbf {x^b}\) given new attributes \(\mathbf {b}\) (e.g., blue cloths).

2.1 Attribute-Aware Segmentation Network

In order to decide the location of the body part in which the attributes are to be changed, we resort to attribute-aware pedestrian segmentation. More specifically, we adopt a well-trained segmentation model [15, 17]. Based on the fine-grained segmentation results and user specified target, we can obtain mask \(\mathbf {m^a}\in \{0, 1\}^{W\times H}\), where the pixels with value 1 denote the foreground area and 0 the background area. Note that pose information may also be utilized like [16, 21] to obtain the mask if the key points of pedestrian bodies are available. For simple and illustrative purposes, the learned mask mainly corresponds to three body parts: head, upper body, and lower body. Then, the image region to be edit can be expressed as \(\mathbf {p^a} = \mathbf {x^a} \odot \mathbf {m^a}\), e.g. the upper body of a pedestrian. The other part \(\mathbf {q^a} = \mathbf {x^a} \odot (\mathbf {1 - m^a})\) is considered as the background. \(\odot \) denotes element-wise multiplication across channels.

We apply the above process to a set of pedestrian images \(\mathbf {X^A}\) and obtain the editable and background regions represented as \(\mathbf {P^A} = \mathbf {X^A} \odot \mathbf {M^A}\) and \(\mathbf {Q^A} = \mathbf {X^A} \odot (\mathbf {1} - \mathbf {M^A})\).

2.2 The Generator

Encoder-Decoder Network with Skip Connection. The generator consists of an encoder and a decoder, denoting as \(\mathbf {G}_{enc}\) and \(\mathbf {G}_{dec}\) respectively, as shown in Fig. 3. The encoder \(\mathbf {G}_{enc}\) projects the selected image region \(\mathbf {p^a}\) to the latent variable \(\mathbf {z^a}\) by use of several convolution layers. Then we randomly flip one bit of the attribute vector \(\mathbf {a}\) to get the new attribute vector \(\mathbf {b}\), and check the conflicts to avoid irrational attribute combination (e.g. one cannot wear in white while in red). \(\mathbf {{b}}\) will be given at test time. The concatenation of \(\mathbf {b}\) and the latent variable \(\mathbf {z^a}\) passes the deconvolution layers of decoder \(\mathbf {G}_{dec}\), from which a new image region \(\mathbf {p^b}\) is generated. Similarly, decoding the concatenation of \(\mathbf {a}\) and \(\mathbf {z^a}\) outputs the reconstructed original image \(\mathbf {\tilde{p^a}}\), where “\(\mathbf {\tilde{}}\)” differentiates the estimation from the ground-truth. Skip connections are adopted between corresponding convolution and deconvolution layers, making the encoder-decoder network a U-net. The above process can be expressed more formally as following: \(\mathbf {Z^A} = \mathbf {G}_{enc}(\mathbf {P^A})\), \(\mathbf {\tilde{P^A}} = \mathbf {G}_{dec}(\mathbf {Z^A}, \mathbf {A})\), \(\mathbf {P^B} = \mathbf {G}_{dec}(\mathbf {Z^A}, \mathbf {B})\). The reconstructed image regions should be as close to the original ones as possible, thus the loss function for this encoder-decoder network is defined as:

$$\begin{aligned} \mathcal {L}_{rec} = {||\mathbf {P^A} - \mathbf {\tilde{P^A}} ||}^2. \end{aligned}$$
(1)
Fig. 3.
figure 3

The generator with skip-connections.

Discrimination Between Latent Variables. Some attributes of pedestrian images are entangled with each other, while some are mutually exclusive. For instance, the sex of a person is highly related to the styles of hair and clothes, while the concolorous shirt cannot be “white” and “blue” simultaneously. Therefore, when we edit an image according to the attribute \(\mathbf {b}\), we have to pay attention to the attribute cooccurrence and contradiction issues. Instead of shielding the attributes \(\mathbf {b}\) manually to avoid the conflict with the latent variable, a more ideal method is to automatically erase attributes from the latent variable in the encoding process. To this end, we resort to adversarial learning between latent variables. More specifically, the generated \(\mathbf {p^b}\) is re-input to the encoder to get latent variable \(\mathbf {z^b}\). Then, a discriminator \(\mathbf {D}_Z\) is trained to capture the discrimination between latent variables \(\mathbf {z^a}\) and \(\mathbf {z^b}\), which is formulated as follows:

$$\begin{aligned} \mathcal {L}_{{adv}_{D_z}} = \mathbb {E}_{\mathbf {z^a}}[\mathbf {D}_Z(\mathbf {Z^A})]-\mathbb {E}_{{\mathbf {z^b}}}[\mathbf {D}_Z(\mathbf {Z^B})]. \end{aligned}$$
(2)

The adversarial min-max process is

$$\begin{aligned} \min _{\mathbf {G}_{enc}}{\max _{||\mathbf {D}_Z ||\le 1}}{\mathcal {L}_{{adv}_{D_z}} }. \end{aligned}$$
(3)

The min-max game between \(\mathbf {G}_{enc}\) and \(\mathbf {D}_Z\) learns latent variables invariant to attribute vectors. When the invariance is met, the decoder must use the attribute vector to generate image. In this way, attribute information is implicitly erased from the latent variables.

2.3 The Multi-scale Discriminators

We propose another adversarial learning process to guarantee the realness of the generated image \(\mathbf {X^B} = \mathbf {P^B} + \mathbf {Q^A}\), which has the same size with the original image \(\mathbf {X^A}\). To capture abundant local visual details, we use two discriminators in different resolutions as in [13]. As shown in Fig. 4, one discriminator, denoted as \(\mathbf {D}_1\), takes in the generated images as ever. The other discriminator, denoted as \(\mathbf {D}_2\), receives the generated images down sampled to the half resolution. Because of the unstable training process of the original GAN model, here we use the WGAN-GP model [18]. The formulations for the discriminators are

$$\begin{aligned}&\mathcal {L}_{{adv}_{1}}= \mathbb {E}_\mathbf {x^a}[\mathbf {D}_1(\mathbf {X^A})]-\mathbb {E}_{\mathbf {x^b}}[\mathbf {D}_1(\mathbf {X^B})] + \lambda _{GP}\mathbb {E}_\mathbf {\hat{x}}[\mathbf {D}_1(\mathbf {\hat{X}})], \end{aligned}$$
(4)
$$\begin{aligned}&\mathcal {L}_{{adv}_{2}} = \mathbb {E}_\mathbf {x^a}[\mathbf {D}_2(\mathbf {X^A})]-\mathbb {E}_{\mathbf {x^b}}[\mathbf {D}_2(\mathbf {X^B})]+ \lambda _{GP}\mathbb {E}_\mathbf {\hat{x}}[\mathbf {D}_1(\mathbf {\hat{X}})], \end{aligned}$$
(5)

The overall loss of multi-scale discriminators is:

$$\begin{aligned} \mathcal {L}_{{adv}} = \mathcal {L}_{{adv}_{1}} + \mathcal {L}_{{adv}_{2}} \end{aligned}$$
(6)

The adversarial optimization process is:

$$\begin{aligned} \min _{\mathbf {G}}{\max _{||\mathbf {D}_* ||\le 1}{\mathcal {L}_{{adv}}}}, \end{aligned}$$
(7)

where \(\mathbf {D}_*(\mathbf {X^B})=\mathbf {D}_*(\mathbf {G}_{dec}(\mathbf {G}_{enc}(\mathbf {X^A} \odot \mathbf {M^A}), \mathbf {B})+\mathbf {X^A} \odot (\mathbf {1} - \mathbf {M^A}))\), and \(\mathbf {G}\) denotes the generator parameters in both of the encoder and the decoder. \(||\mathbf {D_*} ||\le 1\) is the 1-Lipschitz constraint implemented by gradient penalty.

Fig. 4.
figure 4

The multi-scale discriminators.

2.4 The Attribute Classifier

The APIE-Net edits pedestrian images by transferring attributes, i.e., altering one attribute and keeping the others unchanged. One way to evaluate the model capability is to investigate the attributes of the generated images. We construct a classifier for this purpose, as in [14, 19]. The classifier takes the generated image \(\mathbf {x^b}\) as input and output the estimated attributes \(\mathbf {\tilde{b}}\). Then, the cross-entropy loss between \(\mathbf {\tilde{B}}\) and the ground-truth \(\mathbf {B}\) is:

$$\begin{aligned} \mathcal {L}_{cls} =\mathbf {\sum _{\tilde{b}\in \mathbf {\tilde{B}},b\in \mathbf {B}}{-blog\tilde{b} - (1-b)log(1-\tilde{b})}}. \end{aligned}$$
(8)

It is noteworthy that the classifier share model parameters with the discriminator \(\mathbf {D}_1\). The effect of attribute classification and image discrimination is reciprocal. Experiments will show the improvements on generating image details.

2.5 The Overall Loss

The generator, multi-scale discriminator and attribute classifier are trained simultaneously by optimizing the overall loss function.

$$\begin{aligned} \mathcal {L}(&\mathbf {C},\mathbf {G}_{enc},\mathbf {G}_{dec},\mathbf {D}_1,\mathbf {D}_2,\mathbf {D}_Z) \nonumber \\&= \mathcal {L}_{{adv}} + \lambda _C \mathcal {L}_{cls} + \lambda _R \mathcal {L}_{rec} +\lambda _{D_Z} \mathcal {L}_{{adv}_{D_Z}}, \end{aligned}$$
(9)

where \(\lambda _*\) denote hyper-parameters to balance the losses. Following the adversarial training process, the parameters can be obtained by the following min-max game:

$$\begin{aligned} \arg \min _{\mathbf {C},\mathbf {G}_{enc},\mathbf {G}_{dec}}{\max _{\mathbf {D}_1,\mathbf {D}_2,\mathbf {D}_Z}{\mathcal {L}(\mathbf {C},\mathbf {G}_{enc},\mathbf {G}_{dec},\mathbf {D}_1,\mathbf {D}_2,\mathbf {D}_Z)}}. \end{aligned}$$
(10)

3 Experiments

3.1 Datasets and Implementation

To verify the effectiveness of the proposed method, three experiments are conducted, namely pedestrian images generation, person re-ID and attribute learning. All experiments are performed on the datasets Market-1501 [30] and DukeMTMC-reID [22], with labels provided in [28]. These two datasets are collected for person reID task, with 12,936/16,622 images belonging to 751/702 identities for training. The images for training have size of \(128\times 64\times 3\), and so are the generated images. The datasets are labeled with 27/23 attributes, with 8/8 attributes for upper-body clothes and 9/7 attributes for the lower-body.

Table 1. The IS of attribute transformation on Market-1501
Table 2. The IS of attribute transformation on DukeMTMC-reID.

The encoder and decoder of the generator network consist of 5 (de)convolution layers, so do the classifier and the multi-scale discriminator. The latent discriminator in the generator is a light-weighted net with only two fully connected layers. The initial learning rate is 0.0002 and it declines to one-tenth after every 10,000 iterations. We set \(\lambda _R = 100, \lambda _C = 10, \lambda _{D_Z} = 0.1\). We use Adam optimizer as in WGAN-GP model. The batch size is set as 50, and the number of training epochs is 400.

For the person re-ID experiments, we construct a baseline model by using the ResNet50 [25] as our backbone and tuning the network based on [20]. The input images are resized to \(256\times 128\), randomly cropped and horizontally flipped before training, and no dropout is adopted. The batch size is set to 32. The learning rate is initialized as 0.0003, and decreased to one-tenth after every 20 epochs. During training we augment the original datasets with the same amount of images generated by APIE-Net, which are supposed to have distinct identities. We use label smoothing regularization (LSR) as in [5] to balance the weights of the generated images. Mean average precision (mAP) and Cumulative Matching Characteristics (CMC) are evaluation metrics. As for attribute prediction, we use similar baseline model except two points: batch normalization is not adopted and multi-sigmoid loss is used instead of softmax loss. mAP is used for evaluation.

3.2 Pedestrian Image Editing

Pedestrian image editing is to generate new pedestrian images according to specified attributes. For illustration convenience, we select two identities from each dataset and change the color attributes of the clothes.

Some generated pedestrian examples by APIE-Net have been illustrated in Fig. 1. For comparison, we select four commonly used researches on image editing, namely CycleGAN [7], Att-GAN [19], Fader-Net [14] and StarGAN [11]. The model parameters are set as in the paper accordingly. In the Fig. 5 we demonstrate the qualitative comparison. Though these works perform well in facial attribute editing, they do poorly in the pedestrian editing scenario. The APIE-Net outperforms all of the other methods, due to its merits in image generation. First, while some attributes are changed, most details can be maintained. Second, the transformed body part can be precisely located. Third, the edited pedestrian images look natural with very few artifacts.

The quantitative comparison is listed in Tables 1 and 2. Inception score (IS) is used as the evaluation metric, which measures the diversity of the generated images. The higher value means the better result. Our method achieves the highest IS in both datasets.

Fig. 5.
figure 5

The image generation examples of the APIE-Net.

3.3 Data Augmentation for Person Re-ID and Attribute Prediction

We further apply APIE-Net to augment the datasets in person re-ID and attribute prediction tasks. Each augmented dataset is of double size of the original one. We construct a strong “baseline” model for each task as explained above.

We compare the performance of APIE-Net with the methods that augment the datasets for person re-ID in Table 3. The results in the upper part are the highest records reported in the corresponding papers. In the lower part, the cautiously tuned baseline model performs better than the above state-of-the-art methods. The other three methods (including ours) are built on the baseline with data augmentation. However, the trivial pure-color-filling method and starGAN decrease the performance, indicating that inferior augmented images have negative effects on the task. On the other hand, our method generates images with accurate location and enough details, thus achieve the highest performance.

Table 3. Comparison on person re-ID with data augmentation.

The comparative study on attribute learning is reported in Table 4. We compare our method with [28, 31,32,33] with respect to “C.up” (the color of the upper-body), “C.low” (the color of the lower-body) and the mean accuracy, as we augment these attributes in the experiments. An average of 0.3% performance gain is achieved by APIE-Net over the baseline, and much larger gain over other methods on the Market1501 dataset. APIE performs lower than [32, 33] on DukeMTMC for the lower baseline we used. The high performance attributed to the well generated images with accurate attribute information, which provides another verification to our method.

Table 4. The attribute learning experiment using the augmented data on Market-1501 and DukeMTMC-reID.

3.4 Ablation Study

Four parts are used in the APIE-Net which are alternative but useful, namely the attributes-aware segmentation net, the multi-scale discriminator (“M.D.”), the latent variable discriminator (“L.D.”), and the discriminator-classifier share-weight mechanism (“S.W.”). The attribute-aware segmentation network and the generator-discriminator together construct the baseline set-up. “w.o. mask” means the baseline without the attribute-aware segmentation net. The combination of “M.D.”, “L.D.” and “S.W.” is the APIE-Net. To verify the effectiveness of each model part to the final performance we make the following ablation study.

As shown in Table 5, the IS values roughly increase whenever a new model part is involved. The final combination outperforms each single part. Then we make visualization of the image generation results as shown in Fig. 6. The result of taking the attribute-aware segmentation net out of the baseline method is depicted in the second row, which is least satisfying for doing the least transformation.

Table 5. The IS comparison between different schemes
Fig. 6.
figure 6

Image generation results with different parts of APIE-Net.

4 Conclusion and Future Works

In this paper, we propose the problem of attribute-aware pedestrian image editing and a new model, APIE-Net, as its solution. With the aid of specified attributes, APIE-Net can generate high-quality pedestrian images, which benefit for real-world applications including person re-ID and attribute prediction. Comprehensive experiments demonstrate the effectiveness of the proposed method. In the future we will experiment on more complex attributes in pedestrian image editing.