Elsevier

Pattern Recognition

Volume 69, September 2017, Pages 61-81
Pattern Recognition

Dynamic ensembles of exemplar-SVMs for still-to-video face recognition

https://doi.org/10.1016/j.patcog.2017.04.014Get rights and content

Highlights

  • An efficient multi-classifier system is proposed for robust still-to-video FR.

  • Multiple diverse representations are generated from the single target still face.

  • Individual-specific ensembles of exemplar-SVM are designed based on domain adaptation.

  • Different domain adaptation training schemes are proposed to generate the classifiers.

  • Dynamic classifier selection and weighting are applied to perform spatio-temporal FR.

Abstract

Face recognition (FR) plays an important role in video surveillance by allowing to accurately recognize individuals of interest over a distributed network of cameras. Systems for still-to-video FR are exposed to challenging operational environments. The appearance of faces changes when captured under unconstrained conditions due to variations in pose, scale, illumination, occlusion, blur, etc. Moreover, the facial models used for matching may not be robust to intra-class variations because they are typically designed a priori with one reference facial still per person. Indeed, faces captured during enrollment (using still cameras) may differ considerably from those captured during operations (using surveillance cameras). In this paper, an efficient multi-classifier system (MCS) is proposed for accurate still-to-video FR based on multiple face representations and domain adaptation (DA). An individual-specific ensemble of exemplar-SVM (e-SVM) classifiers is thereby designed to improve robustness to intra-class variations. During enrollment of a target individual, an ensemble is used to model the single reference still, where multiple face descriptors and random feature subspaces allow to generate a diverse pool of patch-wise classifiers. To adapt these ensembles to the operational domains, e-SVMs are trained using labeled face patches extracted from the reference still versus patches extracted from cohort and other non-target stills mixed with unlabeled patches extracted from the corresponding face trajectories captured with surveillance cameras. During operations, the most competent classifiers per given probe face are dynamically selected and weighted based on the internal criteria determined in the feature space of e-SVMs. This paper also investigates the impact of using different training schemes for DA, as well as, the validation set of non-target faces extracted from stills and video trajectories of unknown individuals in the operational domain. The performance of the proposed system was validated using videos from the COX-S2V and Chokepoint datasets. Results indicate that the proposed system can surpass state-of-the-art accuracy, yet with a significantly lower computational complexity. Indeed, dynamic selection and weighting allow to combine only the most relevant classifiers for each input probe.

Introduction

Face analysis and recognition are widely used in applications of law enforcement, forensics, e-learning, biometric authentication, health monitoring and surveillance. In decision support systems for video surveillance, recognizing the faces of target individuals is increasingly employed to enhance security in public places, such as airports, subways, shopping malls, etc [1]. These systems must accurately detect the presence of the individuals of interest across a distributed network of video cameras based on their corresponding facial models. Still-to-video FR systems capture faces appearing in videos, and then match them against facial models generated based on high-quality target face stills [2]. Spatio-temporal recognition and multi-view analysis are typically exploited to enhance performance in such applications [3].

In still-to-video FR, facial models are designed using one or more target facial regions of interest (ROIs) isolated in reference still images for template matching, or for determining a set of classifier parameters [4]. Still-to-video FR systems are typically designed as independent individual-specific detectors, each one implemented with a template matcher, 1-, or 2-class classification system per individual of interest [5]. During enrollment, each detector may be modeled using reference still ROI(s) from target individuals, and possibly still ROIs from the cohort or other non-target persons, as well as, trajectories of video ROIs from unknown (non-target) individuals. The benefits of designing individual-specific detectors are the feasibility to add, update, and remove detectors from the system, as well as, to select specialized feature subsets, and decision thresholds for each corresponding individual [6].

Watch-list screening is challenging for still-to-video FR systems, because the number of representative reference still ROIs (high-quality mugshots or ID photos) available during enrollment of a target individual is very limited [6]. It is typically too costly or unfeasible to collect and analyze several reference ROIs. In particular, only one or few still ROIs are available for enrollment of an individual, and also a restricted or small number of individuals (cohort) are enrolled to the system. Furthermore, the appearance of ROIs captured from reference stills may differ significantly from ROIs captured from videos, and vary due to capture conditions (e.g. illumination, pose, scale, blur, expression, and occlusion) [7].

Given this single sample per person (SSPP) problem, state-of-the-art systems for still-to-video FR may achieve a low level of performance due to difficulties in designing robust facial models [8]. Different techniques specialized for SSPP problems have been proposed to improve robustness to intra-class variability, such as using multiple face representations, synthesizing virtual faces, and incorporating auxiliary sets to enlarge the training data [6], [9], [10]. However, multiple representations and synthetic generation techniques alone are only effective to the extent where reference target ROIs captured in the enrollment domain (ED) are representative of an operational domain (OD).

An important issue in still-to-video FR is that probe ROIs are captured over multiple distributed surveillance cameras, where each camera represents a different non-stationary OD. Capture conditions may vary dynamically within an OD according to environmental conditions and individual behaviors. Accordingly, their data distribution differs significantly from ROIs captured with a still camera in the ED, degrading system performance [11]. Designing a robust face model for still-to-video FR is a challenging task due to the difference of faces captured in the ED and OD [8].

Several transfer learning methods have been proposed to design accurate recognition systems that will perform well in the OD using the knowledge taken from the ED [12]. Since the learning tasks and feature spaces between the ED and OD are the same, but their data probability distributions are different, watch-list screening corresponds to domain adaptation (DA) [13]. According to the information transferred between the domains, two unsupervised DA approaches are relevant for still-to-video FR: instance-based and feature representation-based approaches [12]. The former methods attempt to exploit parts of the ED for learning in the OD, while the latter methods exploit OD to find a desired common representation space that reduces the difference between domain spaces and subsequently, the classification error.

Recently, multi-classifier systems (MCSs) have been shown to provide a high level of accuracy and robustness in watch-list screening applications [6], [8]. In particular, classifier ensembles can increase the accuracy and robustness of still-to-video FR by integrating diverse pools of classifiers generated using multiple representations of reference facial ROIs. Furthermore, during operations, dynamic classifier selection/weighting methods allow to exploit the most competent classifiers from the pool for a given input probe [14], [15], [16]. Dynamic selection (DS) has been shown to be an effective tool to address ill-defined classification problems, where the training data is limited and imbalanced [17], [18]. To the best of authors’ knowledge, DS has not been exploited in these SSPP problems without using several other target samples to form a validation set.

In this paper, an efficient and robust MCS is proposed for still-to-video FR. Multiple face representations and domain adaptation are exploited to generate an individual-specific ensemble of e-SVMs (Ee-SVM) per target individual using a mixture of facial ROIs captured in the ED (the single labeled high-quality still of target and cohort captured under controlled conditions) and the OD (i.e., an abundance of unlabeled facial trajectories captured by surveillance cameras during a calibration process). Facial models are adapted to the OD by training the Ee-SVMs using a single labeled target still ROI versus cohort still ROIs, along with unlabeled non-target video ROIs. Several training schemes are considered for DA of ensembles according to utilization of labeled ROIs in the ED and unlabeled ROIs in the OD.

During enrollment of a target individual, semi-random feature subspaces corresponding to different face patches and descriptors are employed to generate a diverse pool of classifiers that provides robustness against different perturbations frequently observed in real-world surveillance environments. In this paper, two application scenarios are investigated to design individual-specific ensembles. In the first scenario, a validation set is employed together with a global criterion (measuring the significance of each patch on the overall performance) in order to rank and select patches and subspaces. In contrast, a local distance-based criterion is used in the second scenario to rank subspaces without employing a validation set. In particular, various ranked feature subspaces are sampled from face patches represented using state-of-the-art face descriptors, instead of randomly sampling from the entire ROIs. Pruning of the less accurate classifiers is performed to store a compact pool of classifiers in order to alleviate computational complexity.

During operations, a subset of the most competent classifiers is dynamically selected/weighted and combined into an ensemble for each probe using a novel distance-based criteria. Internal criteria are defined in the e-SVM feature space that rely on the distances between the input probe to the target still and non-target support vectors. In addition, persons appearing in a scene are tracked over multiple frames, where matching scores of each individual are integrated over a facial trajectory (i.e., group of ROIs linked to the high-quality track) for robust spatio-temporal FR. The proposed system is efficient, since the criteria to perform DS and weighting allows to combine a lower restrained number of the most relevant classifiers within the individual-specific ensembles.

Videos from the COX-S2V [19] and Chokepoint [20] datasets are employed to evaluate and compare the performance of the proposed system against state-of-the-art methods. These datasets contains a high-quality reference still from the ED and low-quality videos of individuals captured under uncontrolled conditions in different ODs. Experimental results are obtained at the transaction- and trajectory-levels in the ROC and precision-recall spaces. The results indicate that the proposed system provides state-of-the-art accuracy, yet with a significantly lower computational complexity.

This paper is organized as follows. Section 2 provides some background on still-to-video FR, and its challenges, and on state-of-the-art systems developed to address this SSPP problem. Section 3 presents a review of techniques proposed in the literature for ensemble generation, dynamic selection and weighting of classifiers. Section 4 presents a detailed description of the proposed system. The experimental methodology and simulation results are presented and interpreted in Sections 5 and 6, respectively.

Section snippets

A generic spatio-temporal system

A spatio-temporal system for still-to-video FR is mainly comprised of the following components, face segmentation (detection), person tracking, face classification and spatio-temporal fusion. In such a system, each surveillance camera captures individuals appearing in its field of view (FoV). Segmentation is performed in each frame to isolate the facial ROIs and then the features are extracted and combined into ROI patterns, as well as, initiating the person tracker. Input ROI patterns are then

Generation and selection of individual-specific ensembles

Techniques introduced in the literature that are relevant for the generation and selection of individual-specific ensembles are briefly presented including random subspace methods, classification systems, dynamic classifier selection and weighting. To overcome the challenges of designing a robust MCS according to the watch-list screening constraints, different techniques can be applied for ensemble generation. Bagging, boosting, and random subspace method (RSM) are well-known resampling

Dynamic individual-specific Ee-SVMs through domain adaptation

A novel ensemble learning approach is proposed in this paper to design accurate classification systems for each target individual enrolled to a still-to-video FR system. In particular, to improve robustness to intra-class variations, individual-specific Ee-SVMs models the single reference still ROI for the OD using several diverse e-SVMs based on multiple face representations and domain adaptation. During enrollment, each patch-wise e-SVM is trained for a different patch, descriptor and feature

Experimental methodology

Several aspects of the proposed system are assessed experimentally using real-world video surveillance data. First, different e-SVM training schemes are compared for the individual-specific ensembles. Second, different pool generation scenarios are evaluated in terms of accuracy and time complexity. Finally, the impact of applying DS and DW are analyzed on the performance.

Number and size of feature subspaces

The critical parameters of the proposed system need to be defined precisely to select the best values using the generic pool. The impact of different numbers and dimensions of feature subspaces are statistically analyzed for each face descriptors extracted from each patch using a validation set during the design phase. In this analysis, different numbers of subspaces (Nrs) are considered w.r.t. different proportions of feature dimensions (Nd). In this section, experiments were conducted with a

Conclusion

In this paper, a robust MCS is proposed for still-to-video FR that is specialized for watch-list screening applications, where individual-specific Ee-SVMs are designed to model a single reference still of target individuals. A novel ensemble-based learning is utilized, where multiple random subspaces are generated for different face descriptors extracted from face patches to effectively provide ensemble diversity and address the SSPP constraints. Unlike conventional RSM that completely select

Acknowledgment

This work was supported by the Fonds de Recherche du Québec - Nature et Technologies.

Saman Bashbaghi received the B.Sc. degree in computer engineering and M.Sc. in artificial intelligence from Bu-Ali Sina University, Hamedan, Iran, in 2010 and 2012, respectively. He is currently pursuing the Ph.D. in Laboratoire d’imagerie, de vision et d’intelligence artificielle (LIVIA) at the École de Technologie Supérieure (ETS). His main research interests are pattern recognition, computer vision, adaptive classification systems, video surveillance and deep learning.

References (64)

  • M. Galar et al.

    Eusboost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling

    Pattern Recognit.

    (2013)
  • G. Yu et al.

    Semi-supervised classification based on random subspace dimensionality reduction

    Pattern Recognit.

    (2012)
  • L. Didaci et al.

    A study on the performances of dynamic classifier selection based on local accuracy estimation

    Pattern Recognit.

    (2005)
  • A.H. Ko et al.

    From dynamic classifier selection to dynamic ensemble selection

    Pattern Recognit.

    (2008)
  • M. Galar et al.

    Drcw-ovo: Distance-based relative competence weighting combination for one-vs-one strategy in multi-class problems

    Pattern Recognit.

    (2015)
  • O. Deniz et al.

    Face recognition using histograms of oriented gradients

    Pattern Recognit. Lett.

    (2011)
  • M.D. la Torre et al.

    Partially-supervised learning from facial trajectories for face recognition in video surveillance

    Inf. Fusion

    (2015)
  • M.D. la Torre et al.

    Adaptive skew-sensitive ensembles for face recognition in video surveillance

    Pattern Recognit.

    (2015)
  • S. Bashbaghi et al.

    Watch-list screening using ensembles based on multiple face representations

    ICPR

    (2014)
  • R. Chellappa et al.

    Face tracking and recognition in video

  • S. Bashbaghi et al.

    Ensembles of exemplar-svms for video face recognition from a single sample per person

    AVSS

    (2015)
  • F. Mokhayeri et al.

    Synthetic face generation under various operational conditions in video surveillance

    ICIP

    (2015)
  • V. Patel et al.

    Visual domain adaptation: A survey of recent advances

    IEEE Signal Process. Mag.

    (2015)
  • S.J. Pan et al.

    A survey on transfer learning

    KDE, IEEE Trans.

    (2010)
  • S. Shekhar et al.

    Generalized domain-adaptive dictionaries

    CVPR

    (2013)
  • T. Gao et al.

    Active classification based on value of classifier

    Advances in Neural Information Processing Systems 24

    (2011)
  • P. Matikainen et al.

    Classifier ensemble recommendation

    ECCV, Workshops and Demonstrations

    (2012)
  • P. Cavalin et al.

    Dynamic selection approaches for multiple classifier systems

    Neural Comput. Appl.

    (2013)
  • Z. Huang et al.

    Benchmarking still-to-video face recognition via partial and local linear discriminant analysis on cox-s2v dataset

    ACCV

    (2013)
  • Y. Wong et al.

    Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition

    CVPR, Biometrics Workshop

    (2011)
  • X. Chen et al.

    Still-to-video face recognition via weighted scenario oriented discriminant analysis

    IJCB

    (2014)
  • H. Wang et al.

    Still-to-video face recognition in unconstrained environments

    Proc. SPIE, Image Processing: Machine Vision Applications

    (2015)
  • Cited by (55)

    • A framework of dynamic selection method for user classification in touch-based continuous mobile device authentication

      2022, Journal of Information Security and Applications
      Citation Excerpt :

      However, an extensive study is necessary to explore various methods of classifier selection, especially DS that has the ability to select the most competent classifier(s) for each test sample. This exploration may be helpful in improving the authentication performance as the outstanding results in other domains have been shown in several studies [28–36]. Our work further explores different classifier selection methods.

    • Towards a self-sufficient face verification system

      2021, Expert Systems with Applications
      Citation Excerpt :

      The idea behind this strategy is to have an ensemble of very-specific classifiers whose combined decision will be able to overcome the over-fitting. In the specific case of face verification in video-surveillance, a similar approach is used in Bashbaghi et al. (2017) where an identity-specific ensemble of exemplar SVMs has been proposed to recognise a target identity among distractors in the case of S2V face recognition. Each exemplar is built during enrolment from a single target sample and multiple distractors’ samples, to represent the diversity of the same identity appearance due to various perturbation factors.

    • Multi-Layer Selector(MLS): Dynamic selection based on filtering some competence measures

      2021, Applied Soft Computing
      Citation Excerpt :

      In MCSs, first a pool of classifiers is generated, then, the whole or a part of the pool is used to classify the input data. The MCSs have been widely applied to solve many real-world issues [1], including problems in face recognition [2], speech emotion recognition problems [3], credit scoring [4,5], class imbalanced learning [6,7], recommender systems [8,9], software bug prediction [10,11], intrusion detection [12], process monitoring [13], electric load forecasting [14–16] and changing environments [17–19]. In general, the MCSs are divided into the following groups:

    View all citing articles on Scopus

    Saman Bashbaghi received the B.Sc. degree in computer engineering and M.Sc. in artificial intelligence from Bu-Ali Sina University, Hamedan, Iran, in 2010 and 2012, respectively. He is currently pursuing the Ph.D. in Laboratoire d’imagerie, de vision et d’intelligence artificielle (LIVIA) at the École de Technologie Supérieure (ETS). His main research interests are pattern recognition, computer vision, adaptive classification systems, video surveillance and deep learning.

    Eric Granger obtained a Ph.D. in Electrical Engineering from the École Polytechnique de Montréal in 2001. From 1999 to 2001, he was a Defence Scientist at Defence R&D Canada in Ottawa. Until then, his work was focused primarily on neural networks for fast classification of radar signals in Electronic Surveillance (ES) systems. From 2001 to 2003, he worked in R&D with Mitel Networks Inc. on algorithms and electronic circuits to implement cryptographic functions in Internet Protocol (IP) based communication platforms. In 2004, he joined the ETS, Université du Québec, where he has developed applied research activities in the areas of patterns recognition, computer vision and microelectronics. He is presently Full Professor in System Engineering. Since joining ÉTS, he has been a member of the Laboratoire d’imagerie, de vision et d’intelligence artificielle (LIVIA), and his main research interests are adaptive classification systems, incremental learning, change detection, and multi-classifier systems, with applications in biometrics, video surveillance, and computer and network security.

    Robert Sabourin joined the physics department of the Montreal University in 1977 where he was responsible for the design, experimentation and development of scientific instrumentation for the Mont Mégantic Astronomical Observatory. His main contribution was the design and the implementation of a microprocessor based fine tracking system combined with a low light level CCD detector. In 1983, he joined the staff of the École de Technologie Supérieure, Université du Québec, in Montréal where he cofounded the Dept. of Automated Manufacturing Engineering where he is currently a Full Professor, and teaches Pattern Recognition, Evolutionary Algorithms, Neural Networks and Fuzzy Systems. In 1992, he joined also the Computer Science Department of the Pontifícia Universidade Católica do Paraná (Curitiba, Brazil) where he was, co-responsible for the implementation in 1995 of a master program and in 1998 a PhD program in applied computer science. Since 1996, he is a senior member of the Centre for Pattern Recognition and Machine Intelligence (CENPARMI, Concordia University). Since 2012, he is the Research Chair holder specializing in Adaptive Surveillance Systems in Dynamic Environments. Dr. Sabourin is the author (and coauthor) of more than 400 scientific publications including journals and conference proceeding. He was co-chair of the program committee of CIFED’98 (Conférence Internationale Francophone sur l’Écrit et le Document, Québec, Canada) and IWFHR’04 (9th International Workshop on Frontiers in Handwriting Recognition, Tokyo, Japan). He was nominated as Conference co-chair of ICDAR’07 (9th International Conference on Document Analysis and Recognition) that has been held in Curitiba, Brazil in 2007. His research interests are in the areas of adaptive biometric systems, adaptive surveillance systems in dynamic environments, intelligent watermarking systems, evolutionary computation and biocryptography.

    Guillaume-Alexandre Bilodeau received the B.Sc.A. degree in computer engineering and the Ph.D. degree in electrical engineering from Université Laval, QC, Canada, in 1997 and 2004, respectively. He was appointed as an Assistant Professor with Polytechnique Montréal, QC, Canada, in 2004, where he was an Associate Professor in 2011. Since 2014, he has been a Full Professor with Polytechnique Montréal. His research interests encompass image and video processing, video surveillance, object recognition, content-based image retrieval, and medical applications of computer vision. Dr. Bilodeau is a member of the Province of Québec’s Association of Professional Engineers and REPARTI research network.

    View full text