Face Composite Sketch Recognition by BoVW-Based Discriminative Representations

Plasencia-Calaña, Yenisel; Méndez-Vázquez, Heydi; Fonseca, Rainer Larin

doi:10.1007/978-3-319-52277-7_25

Yenisel Plasencia-Calaña¹⁶,
Heydi Méndez-Vázquez¹⁶ &
Rainer Larin Fonseca¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10125))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

1430 Accesses
1 Citations

Abstract

Face sketches are one of the main sources used for criminal investigation. In this paper, we propose a new approach for the recognition of facial composite sketches. We propose the use of discriminative representations as a way to bridge the modality gap between sketches and mug-shot photos. The intermediate representation is based on the bag-of-visual-words (BoVW) approach using dense SIFT features on multiple scales. Next, a discriminative representation is computed on top of the intermediate representation. Experimental results show that the discriminative representations outperform state-of-the-art approaches for this task in composite sketch datasets for both a close-set scenario as well as an open-set recognition scenario.

You have full access to this open access chapter, Download conference paper PDF

Deep feature representation and ball-tree for face sketch recognition

Article 03 September 2019

Face Recognition from Multiple Stylistic Sketches: Scenarios, Datasets, and Evaluation

Efficient Sketch Recognition Based on Shape Features and Multidimensional Indexing

Keywords

1 Introduction

Human faces are inherently linked to people identity. Due to this, faces are extensively used to recognize individuals. In forensic scenarios, there are many cases where law enforcement agencies have no photos of a suspect and only a facial sketch made with the help of an eyewitness or victim is available. Recently, automatic facial sketch recognition methods have attracted great attention due to its promising application in subject identification from police mug-shot databases. Facial sketches constitute not a proof but only an approximation to the identity of the subject. However, the use of this approximation in sketch-photo recognition allows reducing the list of potential candidates or suspects.

The manual creation of facial sketches is a challenging task and depends on both, eyewitness and specialists. Independently of this fact, humans can easily recognize facial sketches, even if there are great differences between photos and sketches. Nonetheless, facial sketches are not easily recognized by standard automatic face recognition methods due to significant differences with facial photos. This problem has been defined in the literature as modality gap [15]. The preferred approach to face the modality gap has been to transform facial photos into facial sketches in order to perform the matching operation for the same image modality. However, this may loose important discriminative information. We think that this transformation can be avoided if a suitable feature representation that can cope directly with this problem is obtained.

It was shown in [7] that software-generated composite sketches are more effective than hand-drawn sketches for automatic photo-sketch recognition. However there are only few works on this topic. To the best of our knowledge, the first work on using software-generated composites for face recognition was presented by Yuen and Man [16]. It used a combination of local and global features, and it required human intervention for the recognition phase. Other relevant works are the ones by Klare and Jain. Their first proposal [4] was inspired by the component-based manner in which the composite sketches are created. Block-based multi-scale local binary patterns (MLBP) are extracted from facial regions corresponding to 76 facial landmarks, and similarities between the same components for photos and sketches are obtained and combined by score fusion. The second one [6] is a holistic method that extracts SIFT and MLBP features from uniform regions across the face, and learned optimal subspaces for each patch using linear discriminant analysis (LDA). The projected features are concatenated into single vectors, which are compared by using $L^2$ distance measure. Their last work [8] combines both strategies introducing some modifications and parameters tuning that boosts the performance. More recently, the works by Mittal et al. [11,12,13] gradually improve the matching accuracy when recognizing composite sketches. Their most recent approach is based on deep learning and transfer learning [13]. First, a deep network is trained using 30 000 face photographs, next the network is updated with information from composite sketch-photo pairs. This is a very interesting approach since it compensates the small amount of sketches available for training.

Despite the promising results obtained by deep networks for this task, they are particularly difficult to train for our problem if there is no outside data available, and only a small amount of photo-sketches pairs is available. However, another promising direction is discriminative or metric learning methods, which have shown a significantly good performance in many problems, and we believe they can be suitable to bridge the modality gap caused by differences between mug-shots and sketches. Therefore, in this paper we focus on learning distances or discriminative representations on top of an intermediate representation based on quantized features, which does not require such a large amount of training data as the deep networks. As intermediate representation we proposed to use densely sampled SIFT features or dense SIFT, quantized by a visual dictionary. This dictionary-based representation compensates geometry differences caused by the component-based manner in which composites are created since it is moderately robust to image distortions. In addition, it was shown in previous works that SIFT features were able to achieve some robustness in front of the modality gap for face-sketch recognition [5].

The remaining part of the paper is organized as follows: Sect. 2 presents the main components of the proposed approach, the BoVW model and the discriminative representations. Experimental results are reported and discussed in Sect. 3. Concluding remarks are presented in Sect. 4.

2 Proposal: BoVW-Based Discriminative Representation

BoVW Representation. We consider the dense SIFT (DSIFT) descriptors as intermediate features for our Bag-of-Visual-Words (BoVW) representation [10]. The image representation by dense SIFT avoids the interest points detection step of the standard SIFT which is computationally expensive, and provides sparse and potentially unreliable points. Instead, DSIFT samples the image using a dense grid with a user-defined size of spatial bins, where the bins are located at a fixed scale and orientation, with a user-defined sampling density step. Note that the representation by DSIFT returns a bag of descriptors. A visual dictionary is created by kmeans clustering and the DSIFT features are vector quantized into K visual words [10]. Next, a histogram of assignment frequencies for each center is computed. This process is repeated for different Gaussian smoothing from coarse to fine (e.g. using 4 different standard deviations) and sizes of spatial bins. Finally, histograms are concatenated in a high dimensional representation. The dense SIFT applied at several resolutions is called Pyramid Histogram of Visual Words (PHOW). This representation is further reduced by principal component analysis (PCA) to a small number of dimensions (e.g. 20) in order to avoid overfitting and to ensure that finding the discriminative projections is practical. After BoVW features are obtained, we create discriminative representations by means of metric learning or other similar methods that learn discriminative projections or similarities. The schematic view of our proposal is shown in Fig. 1.

Although the discriminative methods follow the same principle, they rely on different criteria and therefore we briefly describe the foundations for each of them. All the discriminative methods presented here take advantage of the label information either as class labels or as genuine and impostor labels as in a verification setting. We believe that these methods may help to emphasize the discriminative information needed to pull together sketches and mug-shots from the same class while pulling apart those of different classes.

LDA. The Linear Discriminant Analysis (LDA) is one of the oldest methods to find discriminative projections, which still is widely used for its good performance. Here we use the Fisherfaces version of the method [1], which uses PCA before the LDA for regularization. The method learns a projection such that it maximizes the between or inter-class scatter over the within or intra-class scatter. Let the between class scatter be: $S_B=\sum _{i=1}^{C}N_i(\mu _i-\mu )(\mu _i-\mu )^T$ and the within class scatter $S_W=\sum _{i=1}^{C}\sum _{x_j\in X_i}(x_j-\mu _i)(x_j-\mu _i)^T$, where $\mu $ is the mean of all objects in the dataset, $\mu _i$ is the mean of each class, C the number of classes, and $N_i$ is the total number of objects of class $X_i$. The optimal projection is:

$$\begin{aligned} \hat{W} = arg \max _{W}\frac{|W^T S_B W|}{|W^T S_W W|} =[w_1 w_2 \ldots w_M], \end{aligned}$$

(1)

where $w_i, i=1:M$ corresponds to the generalized eigenvectors of $S_B w_i = \lambda _i S_W w_i$. There are at most $C-1$ eigenvectors, due to this, the linear projections generated by these eigenvectors are of dimension $C-1$ at most.

KISS Metric Learning (KISSME). The idea of metric learning methods in general is to learn a Mahalanobis distance of the form: $(x-y)^tM(x-y)$, where M corresponds to a weight matrix to be learnt by the metric learning methods. This distance in the original space is equivalent to the squared Euclidean distance in the discriminative space where the linear projections of the data are found by $\tilde{x}=Lx$, where L is related to M by $M = L^tL$. Note that the squared Euclidean distance in the original space can be retrieved by using M as the Identity matrix. The KISSME method models the commonalities of genuine and impostor pairs [9]. From a statistical inference sense, the optimal statistical decision about the similarity of a pair (x, y) can be obtained by:

$$\begin{aligned} r(x,y) = \log {\frac{P(x,y|H0)}{P(x,y|H1)}}, \end{aligned}$$

(2)

where we test the hypothesis H0 that a pair is similar versus the alternative H1 that a pair is dissimilar. The method cast the problem in the space of differences with zero mean. Therefore, by using $d_{xy} = x-y$ we have that:

$$\begin{aligned} \delta (d_{xy}) = \log {\frac{P(d_{xy}|H0)}{P(d_{xy}|H1)}}. \end{aligned}$$

(3)

By assuming Gaussianity of the difference space they rewrite Eq. 3 in terms of Gaussian distributions and the parameters $\theta _0$ and $\theta _1$ corresponding to probability density functions for the hypothesis H0 and H1 are estimated from covariance matrices from genuine $(\sum _{y_{ij}=1})$ and impostor $(\sum _{y_{ij}=0})$ pairs. The maximum likelihood estimate of the Gaussian is equivalent to minimize the Mahalanobis distance from the mean by least squares. In this way, relevant directions for each set of genuine and impostor pairs are found, and after reformulation and simplification M is found by clipping the spectrum from the eigendecomposition of $\hat{M} = (\sum ^-1_{y_{ij}=1}-\sum ^-1_{y_{ij}=0})$ to ensure positive-definiteness.

Joint Bayesian. This method was proposed in [2] for face recognition, and it has been used extensively since then especially on top of learned representations by convolutional networks. The Joint Bayesian is based on the idea of the Bayesian face recognition method, but instead of modeling the distances in a 1D space, it models the joint distributions of two samples (x, y) before computing the distance between the samples. The previous method has the problem that by modeling the differences instead of the joint distributions it may reduce class separability since differences lie in a 1D line and therefore may overlap. In the Joint Bayesian, by assuming a face prior, each face is the summation of two independent Gaussian latent variables: the intrinsic variability for identity and intrapersonal variable for intrapersonal variation: $x = I+E$, where the samples have zero mean. The Joint formulation with prior is also Gaussian with zero mean and after algebraic operations the log likelihood ratio is obtained, which can be thought of as a measure of similarity.

3 Experimental Analysis

We present the results of the proposed discriminative representations for sketch recognition. For the implementation of the BoVW approach we used the VLFeat library [14] while the metric and similarity learning codes are taken from the author’s website and for LDA the PRtools library was used [3]. In order to evaluate the proposed discriminative representations for sketch recognition, we use the PRIP Viewed Software-Generated Composite (PRIP-VSGC) [4] dataset (see Fig. 2). It contains photographs from 123 subjects from the AR database and three composites created for each subject using FACES (American and Asian users) and Identi-Kit softwares. Both the mug-shots and sketches are normalized to 150$\,\times \,$150. Parameters used for the BoVW representation are: SIFT patch sizes = 4, 6, 8, 10. Standard deviations for Gaussian blur: sigma = sizes/magnification factor, where the magnification factor is 6, and the step size is 3. The number of words of the visual dictionary is 600 and PCA reduces the data to 20 dimensions for the datasets.

Table 1. Rank-10 identification accuracies for state-of-the-art methods and our methods for the different datasets

Full size table

We compare our proposal with the most recent approach using deep learning and transfer learning in [13], as well as other sate-of-the-art methods compared in their work, including a commercial-off-the-shelf (COTS) software, FaceVACS. We replicate their experimental protocol, in which 48 photo-sketch pairs for training and 75 for testing are randomly selected five times. The average accuracies and standard deviations are shown in Table 1.

Table 2. Rank-20 and rank-40 identification accuracies on an extended gallery set of 2400 photos

Full size table

From the results in Table 1, it can be seen that the proposed approaches outperform the other methods in the closed-set scenario where only 75 images are used for the gallery. We compare our methods in an open-set scenario, where we add up to 2400 mug-shot images for the gallery set, in order to compare our proposal with that in [13], in terms of accuracies at ranks 20 and 40. Results are shown in Table 2. It can be seen that the LDA also outperforms the method in [13] in this setting. Besides, our methods are able to obtain significantly higher accuracies for lower rankings, which may be very useful in real-world applications where it is more convenient for the specialist to check a small list of potential candidates or suspects.

Our intuition behind the better performance of the LDA when more distractors are added to the database is that the LDA uses the most discriminative information of the three compared methods. Besides, the previous PCA projection to a very small dimensional space provides a regularization that is highly beneficial for the generalization (to unseen data) of the method. The KISSME receives the discriminative information in form of genuine and impostor pairs which for small training sets may not generalize as well as the LDA which receives the class memberships. However, for larger training sets there is a possibility that the KISSME may generalize better and avoid overtraining since it does not receive such specific information as the LDA. The Joint Bayesian estimates Gaussian distributions that with small training sets may not be very accurately estimated, which may be problematic in front of distractors.

4 Conclusions

We proposed the use of discriminative representations for the problem of face-sketch recognition. The intermediate representation is achieved by means of multiscale dense SIFT quantized in a visual dictionary and metric learning or other discriminative methods are applied on top of the representation. The results obtained by the proposed approach are very competitive, providing even similar or better results than deep learning-based methods. Besides, the length of the vector representation used for classification of composites after PCA is only 20, which is impressively small and very suitable for real-world applications. In fact, we found that when using more dimensions for the composite sketches the results deteriorate. This shows the intrinsic low dimensionality of composite sketch representation. In addition, it must be taken into account that, for the learning methods, a set of photo-sketch pairs is needed. In our experiments we confirmed that a small-sized set is sufficient for achieving good results, but a larger set must lead to better results.

References

Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 711–720 (1997)
Article Google Scholar
Chen, D., Cao, X., Wang, L., Wen, F., Sun, J.: Bayesian face revisited: a joint formulation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 566–579. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33712-3_41
Chapter Google Scholar
Duin, R., Juszczak, P., Paclik, P., Pekalska, E., de Ridder, D., Tax, D., Verzakov, S.: PR-Tools4.1, a matlab toolbox for pattern recognition. Technical report, Information and Communication Theory Group: Delft University of Technology, The Netherlands (2007). http://www.prtools.org/
Han, H., Klare, B., Bonnen, K., Jain, A.K.: Matching composite sketches to face photos: a component-based approach. IEEE Trans. Inf. Forensics Secur. 8(1), 191–204 (2013)
Article Google Scholar
Klare, B., Jain, A.K.: Sketch-to-photo matching: a feature-based approach. SPIE Defense, Security, and Sensing, vol. 7667 (2010)
Google Scholar
Klare, B., Jain, A.K.: Heterogeneous face recognition using kernel prototype similarities. IEEE Trans. Pattern Anal. Mach. Intell. 35(6), 1410–1422 (2013)
Article Google Scholar
Klum, S., Han, H., Jain, A.K., Klare, B.: Sketch based face recognition: forensic vs. composite sketches. In: International Conference on Biometrics, ICB 2013, Madrid, Spain, 4–7 June 2013, pp. 1–8 (2013)
Google Scholar
Klum, S., Han, H., Klare, B., Jain, A.K.: The facesketchid system: matching facial composites to mugshots. IEEE Trans. Inf. Forensics Secur. 9(12), 2248–2263 (2014)
Article Google Scholar
Köstinger, M., Hirzer, M., Wohlhart, P., Roth, P.M., Bischof, H.: Large scale metric learning from equivalence constraints. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012, pp. 2288–2295 (2012)
Google Scholar
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2006, vol. 2. pp. 2169–2178. IEEE Computer Society, Washington, DC (2006)
Google Scholar
Mittal, P., Jain, A., Singh, R., Vatsa, M.: Boosting local descriptors for matching composite and digital face images. In: 2013 IEEE International Conference on Image Processing, pp. 2797–2801, September 2013
Google Scholar
Mittal, P., Jain, A., Goswami, G., Singh, R., Vatsa, M.: Recognizing composite sketches with digital face images via SSD dictionary. In: IEEE International Joint Conference on Biometrics, IJCB 2014, Clearwater, FL, USA, 29 September–2 October 2014, pp. 1–6 (2014)
Google Scholar
Mittal, P., Vatsa, M., Singh, R.: Composite sketch recognition via deep network - a transfer learning approach. In: International Conference on Biometrics, ICB 2015, Phuket, Thailand, 19–22 May 2015, pp. 251–256 (2015)
Google Scholar
Vedaldi, A., Fulkerson, B.: VLFeat: an open and portable library of computer vision algorithms. In: Proceedings of the 18th ACM International Conference on Multimedia, MM 2010, pp. 1469–1472. ACM, New York (2010)
Google Scholar
Wang, X., Tang, X.: Face photo-sketch synthesis and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(11), 1955–1967 (2009)
Article Google Scholar
Yuen, P.C., Man, C.H.: Human face image searching system using sketches. IEEE Trans. Syst. Man Cybern. Part A: Syst. Hum. 37(4), 493–504 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Advanced Technologies Application Center, 7ma A # 21406, Playa, Havana, Cuba
Yenisel Plasencia-Calaña, Heydi Méndez-Vázquez & Rainer Larin Fonseca

Authors

Yenisel Plasencia-Calaña
View author publications
You can also search for this author in PubMed Google Scholar
Heydi Méndez-Vázquez
View author publications
You can also search for this author in PubMed Google Scholar
Rainer Larin Fonseca
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yenisel Plasencia-Calaña .

Editor information

Editors and Affiliations

Pontificia Universidad Católica del Perú, Lima, Peru
César Beltrán-Castañón
Uppsala University, Uppsala, Sweden
Ingela Nyström
University of Ottawa, Ottawa, Ontario, Canada
Fazel Famili

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Plasencia-Calaña, Y., Méndez-Vázquez, H., Fonseca, R.L. (2017). Face Composite Sketch Recognition by BoVW-Based Discriminative Representations. In: Beltrán-Castañón, C., Nyström, I., Famili, F. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2016. Lecture Notes in Computer Science(), vol 10125. Springer, Cham. https://doi.org/10.1007/978-3-319-52277-7_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-52277-7_25
Published: 16 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52276-0
Online ISBN: 978-3-319-52277-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Face Composite Sketch Recognition by BoVW-Based Discriminative Representations

Abstract

Similar content being viewed by others

Deep feature representation and ball-tree for face sketch recognition

Face Recognition from Multiple Stylistic Sketches: Scenarios, Datasets, and Evaluation

Efficient Sketch Recognition Based on Shape Features and Multidimensional Indexing

Keywords

1 Introduction

2 Proposal: BoVW-Based Discriminative Representation

3 Experimental Analysis

4 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Face Composite Sketch Recognition by BoVW-Based Discriminative Representations

Abstract

Similar content being viewed by others

Deep feature representation and ball-tree for face sketch recognition

Face Recognition from Multiple Stylistic Sketches: Scenarios, Datasets, and Evaluation

Efficient Sketch Recognition Based on Shape Features and Multidimensional Indexing

Keywords

1 Introduction

2 Proposal: BoVW-Based Discriminative Representation

3 Experimental Analysis

4 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation