Keywords

1 Introduction

Human faces are inherently linked to people identity. Due to this, faces are extensively used to recognize individuals. In forensic scenarios, there are many cases where law enforcement agencies have no photos of a suspect and only a facial sketch made with the help of an eyewitness or victim is available. Recently, automatic facial sketch recognition methods have attracted great attention due to its promising application in subject identification from police mug-shot databases. Facial sketches constitute not a proof but only an approximation to the identity of the subject. However, the use of this approximation in sketch-photo recognition allows reducing the list of potential candidates or suspects.

The manual creation of facial sketches is a challenging task and depends on both, eyewitness and specialists. Independently of this fact, humans can easily recognize facial sketches, even if there are great differences between photos and sketches. Nonetheless, facial sketches are not easily recognized by standard automatic face recognition methods due to significant differences with facial photos. This problem has been defined in the literature as modality gap [15]. The preferred approach to face the modality gap has been to transform facial photos into facial sketches in order to perform the matching operation for the same image modality. However, this may loose important discriminative information. We think that this transformation can be avoided if a suitable feature representation that can cope directly with this problem is obtained.

It was shown in [7] that software-generated composite sketches are more effective than hand-drawn sketches for automatic photo-sketch recognition. However there are only few works on this topic. To the best of our knowledge, the first work on using software-generated composites for face recognition was presented by Yuen and Man [16]. It used a combination of local and global features, and it required human intervention for the recognition phase. Other relevant works are the ones by Klare and Jain. Their first proposal [4] was inspired by the component-based manner in which the composite sketches are created. Block-based multi-scale local binary patterns (MLBP) are extracted from facial regions corresponding to 76 facial landmarks, and similarities between the same components for photos and sketches are obtained and combined by score fusion. The second one [6] is a holistic method that extracts SIFT and MLBP features from uniform regions across the face, and learned optimal subspaces for each patch using linear discriminant analysis (LDA). The projected features are concatenated into single vectors, which are compared by using \(L^2\) distance measure. Their last work [8] combines both strategies introducing some modifications and parameters tuning that boosts the performance. More recently, the works by Mittal et al. [11,12,13] gradually improve the matching accuracy when recognizing composite sketches. Their most recent approach is based on deep learning and transfer learning [13]. First, a deep network is trained using 30 000 face photographs, next the network is updated with information from composite sketch-photo pairs. This is a very interesting approach since it compensates the small amount of sketches available for training.

Despite the promising results obtained by deep networks for this task, they are particularly difficult to train for our problem if there is no outside data available, and only a small amount of photo-sketches pairs is available. However, another promising direction is discriminative or metric learning methods, which have shown a significantly good performance in many problems, and we believe they can be suitable to bridge the modality gap caused by differences between mug-shots and sketches. Therefore, in this paper we focus on learning distances or discriminative representations on top of an intermediate representation based on quantized features, which does not require such a large amount of training data as the deep networks. As intermediate representation we proposed to use densely sampled SIFT features or dense SIFT, quantized by a visual dictionary. This dictionary-based representation compensates geometry differences caused by the component-based manner in which composites are created since it is moderately robust to image distortions. In addition, it was shown in previous works that SIFT features were able to achieve some robustness in front of the modality gap for face-sketch recognition [5].

The remaining part of the paper is organized as follows: Sect. 2 presents the main components of the proposed approach, the BoVW model and the discriminative representations. Experimental results are reported and discussed in Sect. 3. Concluding remarks are presented in Sect. 4.

2 Proposal: BoVW-Based Discriminative Representation

BoVW Representation. We consider the dense SIFT (DSIFT) descriptors as intermediate features for our Bag-of-Visual-Words (BoVW) representation [10]. The image representation by dense SIFT avoids the interest points detection step of the standard SIFT which is computationally expensive, and provides sparse and potentially unreliable points. Instead, DSIFT samples the image using a dense grid with a user-defined size of spatial bins, where the bins are located at a fixed scale and orientation, with a user-defined sampling density step. Note that the representation by DSIFT returns a bag of descriptors. A visual dictionary is created by kmeans clustering and the DSIFT features are vector quantized into K visual words [10]. Next, a histogram of assignment frequencies for each center is computed. This process is repeated for different Gaussian smoothing from coarse to fine (e.g. using 4 different standard deviations) and sizes of spatial bins. Finally, histograms are concatenated in a high dimensional representation. The dense SIFT applied at several resolutions is called Pyramid Histogram of Visual Words (PHOW). This representation is further reduced by principal component analysis (PCA) to a small number of dimensions (e.g. 20) in order to avoid overfitting and to ensure that finding the discriminative projections is practical. After BoVW features are obtained, we create discriminative representations by means of metric learning or other similar methods that learn discriminative projections or similarities. The schematic view of our proposal is shown in Fig. 1.

Fig. 1.
figure 1

Schematic view of our proposal

Although the discriminative methods follow the same principle, they rely on different criteria and therefore we briefly describe the foundations for each of them. All the discriminative methods presented here take advantage of the label information either as class labels or as genuine and impostor labels as in a verification setting. We believe that these methods may help to emphasize the discriminative information needed to pull together sketches and mug-shots from the same class while pulling apart those of different classes.

LDA. The Linear Discriminant Analysis (LDA) is one of the oldest methods to find discriminative projections, which still is widely used for its good performance. Here we use the Fisherfaces version of the method [1], which uses PCA before the LDA for regularization. The method learns a projection such that it maximizes the between or inter-class scatter over the within or intra-class scatter. Let the between class scatter be: \(S_B=\sum _{i=1}^{C}N_i(\mu _i-\mu )(\mu _i-\mu )^T\) and the within class scatter \(S_W=\sum _{i=1}^{C}\sum _{x_j\in X_i}(x_j-\mu _i)(x_j-\mu _i)^T\), where \(\mu \) is the mean of all objects in the dataset, \(\mu _i\) is the mean of each class, C the number of classes, and \(N_i\) is the total number of objects of class \(X_i\). The optimal projection is:

$$\begin{aligned} \hat{W} = arg \max _{W}\frac{|W^T S_B W|}{|W^T S_W W|} =[w_1 w_2 \ldots w_M], \end{aligned}$$
(1)

where \(w_i, i=1:M\) corresponds to the generalized eigenvectors of \(S_B w_i = \lambda _i S_W w_i\). There are at most \(C-1\) eigenvectors, due to this, the linear projections generated by these eigenvectors are of dimension \(C-1\) at most.

KISS Metric Learning (KISSME). The idea of metric learning methods in general is to learn a Mahalanobis distance of the form: \((x-y)^tM(x-y)\), where M corresponds to a weight matrix to be learnt by the metric learning methods. This distance in the original space is equivalent to the squared Euclidean distance in the discriminative space where the linear projections of the data are found by \(\tilde{x}=Lx\), where L is related to M by \(M = L^tL\). Note that the squared Euclidean distance in the original space can be retrieved by using M as the Identity matrix. The KISSME method models the commonalities of genuine and impostor pairs [9]. From a statistical inference sense, the optimal statistical decision about the similarity of a pair (xy) can be obtained by:

$$\begin{aligned} r(x,y) = \log {\frac{P(x,y|H0)}{P(x,y|H1)}}, \end{aligned}$$
(2)

where we test the hypothesis H0 that a pair is similar versus the alternative H1 that a pair is dissimilar. The method cast the problem in the space of differences with zero mean. Therefore, by using \(d_{xy} = x-y\) we have that:

$$\begin{aligned} \delta (d_{xy}) = \log {\frac{P(d_{xy}|H0)}{P(d_{xy}|H1)}}. \end{aligned}$$
(3)

By assuming Gaussianity of the difference space they rewrite Eq. 3 in terms of Gaussian distributions and the parameters \(\theta _0\) and \(\theta _1\) corresponding to probability density functions for the hypothesis H0 and H1 are estimated from covariance matrices from genuine \((\sum _{y_{ij}=1})\) and impostor \((\sum _{y_{ij}=0})\) pairs. The maximum likelihood estimate of the Gaussian is equivalent to minimize the Mahalanobis distance from the mean by least squares. In this way, relevant directions for each set of genuine and impostor pairs are found, and after reformulation and simplification M is found by clipping the spectrum from the eigendecomposition of \(\hat{M} = (\sum ^-1_{y_{ij}=1}-\sum ^-1_{y_{ij}=0})\) to ensure positive-definiteness.

Joint Bayesian. This method was proposed in [2] for face recognition, and it has been used extensively since then especially on top of learned representations by convolutional networks. The Joint Bayesian is based on the idea of the Bayesian face recognition method, but instead of modeling the distances in a 1D space, it models the joint distributions of two samples (xy) before computing the distance between the samples. The previous method has the problem that by modeling the differences instead of the joint distributions it may reduce class separability since differences lie in a 1D line and therefore may overlap. In the Joint Bayesian, by assuming a face prior, each face is the summation of two independent Gaussian latent variables: the intrinsic variability for identity and intrapersonal variable for intrapersonal variation: \(x = I+E\), where the samples have zero mean. The Joint formulation with prior is also Gaussian with zero mean and after algebraic operations the log likelihood ratio is obtained, which can be thought of as a measure of similarity.

Fig. 2.
figure 2

Images from the PRIP-VSGC database, composites of the subjects with photos in the first row are shown per column. Rows from 2 to 4 correspond to composites by: (2) American user, (3) Asian users, (4) Identi-Kit.

3 Experimental Analysis

We present the results of the proposed discriminative representations for sketch recognition. For the implementation of the BoVW approach we used the VLFeat library [14] while the metric and similarity learning codes are taken from the author’s website and for LDA the PRtools library was used [3]. In order to evaluate the proposed discriminative representations for sketch recognition, we use the PRIP Viewed Software-Generated Composite (PRIP-VSGC) [4] dataset (see Fig. 2). It contains photographs from 123 subjects from the AR database and three composites created for each subject using FACES (American and Asian users) and Identi-Kit softwares. Both the mug-shots and sketches are normalized to 150\(\,\times \,\)150. Parameters used for the BoVW representation are: SIFT patch sizes = 4, 6, 8, 10. Standard deviations for Gaussian blur: sigma = sizes/magnification factor, where the magnification factor is 6, and the step size is 3. The number of words of the visual dictionary is 600 and PCA reduces the data to 20 dimensions for the datasets.

Table 1. Rank-10 identification accuracies for state-of-the-art methods and our methods for the different datasets

We compare our proposal with the most recent approach using deep learning and transfer learning in [13], as well as other sate-of-the-art methods compared in their work, including a commercial-off-the-shelf (COTS) software, FaceVACS. We replicate their experimental protocol, in which 48 photo-sketch pairs for training and 75 for testing are randomly selected five times. The average accuracies and standard deviations are shown in Table 1.

Table 2. Rank-20 and rank-40 identification accuracies on an extended gallery set of 2400 photos

From the results in Table 1, it can be seen that the proposed approaches outperform the other methods in the closed-set scenario where only 75 images are used for the gallery. We compare our methods in an open-set scenario, where we add up to 2400 mug-shot images for the gallery set, in order to compare our proposal with that in [13], in terms of accuracies at ranks 20 and 40. Results are shown in Table 2. It can be seen that the LDA also outperforms the method in [13] in this setting. Besides, our methods are able to obtain significantly higher accuracies for lower rankings, which may be very useful in real-world applications where it is more convenient for the specialist to check a small list of potential candidates or suspects.

Our intuition behind the better performance of the LDA when more distractors are added to the database is that the LDA uses the most discriminative information of the three compared methods. Besides, the previous PCA projection to a very small dimensional space provides a regularization that is highly beneficial for the generalization (to unseen data) of the method. The KISSME receives the discriminative information in form of genuine and impostor pairs which for small training sets may not generalize as well as the LDA which receives the class memberships. However, for larger training sets there is a possibility that the KISSME may generalize better and avoid overtraining since it does not receive such specific information as the LDA. The Joint Bayesian estimates Gaussian distributions that with small training sets may not be very accurately estimated, which may be problematic in front of distractors.

4 Conclusions

We proposed the use of discriminative representations for the problem of face-sketch recognition. The intermediate representation is achieved by means of multiscale dense SIFT quantized in a visual dictionary and metric learning or other discriminative methods are applied on top of the representation. The results obtained by the proposed approach are very competitive, providing even similar or better results than deep learning-based methods. Besides, the length of the vector representation used for classification of composites after PCA is only 20, which is impressively small and very suitable for real-world applications. In fact, we found that when using more dimensions for the composite sketches the results deteriorate. This shows the intrinsic low dimensionality of composite sketch representation. In addition, it must be taken into account that, for the learning methods, a set of photo-sketch pairs is needed. In our experiments we confirmed that a small-sized set is sufficient for achieving good results, but a larger set must lead to better results.