Elsevier

Neurocomputing

Volume 151, Part 3, 3 March 2015, Pages 1500-1506
Neurocomputing

Embedding metric learning into set-based face recognition for video surveillance

https://doi.org/10.1016/j.neucom.2014.10.032Get rights and content

Abstract

Face recognition in video surveillance is a challenging task, largely due to the difficulty in matching images across cameras of distinct viewpoints and illuminations. To overcome this difficulty, this paper proposes a novel method which embeds distance metric learning into set-based image matching. First we use sets of face images, rather than individual images, as the input for recognition, since in surveillance systems the former is a more natural way. We model each image set using a convex-hull space spanned by its member images and measure the dissimilarity of two sets using the distance between the closest points of their corresponding convex hull spaces. Then we propose a set-based distance metric learning scheme to learn a feature-space mapping to a discriminative subspace. Finally we project image sets into the learned subspace and achieve face recognition by comparing the projected sets. In this way, we can adapt the variation in viewpoints and illuminations across cameras in order to improve face recognition in video surveillance. Experiments on the public Honda/UCSD and ChokePoint databases demonstrate the superior performance of our method to the state-of-the-art approaches.

Introduction

Because of the strong demand for public security in recent years, camera networks of intelligent surveillance have been distributed all over the world. Many new issues are raised [19], [20]. As one of the key technologies of intelligent surveillance, face recognition in surveillance has attracted growing interests [1], [2], [3], [4], [5].

In video surveillance we often need to compare face images from different cameras. For example, given a video that records a subject walking through a camera׳s view, we are often required to retrieve the subject from the videos captured by other cameras in a relevant camera network. However, due to the variation in the environments (e.g. viewpoints and illuminations) of the cameras, the appearances of a subject that captured by different cameras are quite different, making face recognition a challenging task.

On the other hand, many researches have shown that although single-image-based face recognition algorithms can perform well on controlled environments, their performances decrease dramatically in surveillance contexts [6]. This phenomenon has motivated the development of algorithms that make use of sets of images, rather than only individual images, that are provided by videos, to compensate for the poor viewing conditions [7]. Indeed recent development in set-based recognition has shown its excellent promise [2], [5].

Although set-based methods are more promising than single-image-based methods, they also face challenges when image sets are captured by different cameras. One big challenge is the fact that, due to the large variation in viewpoints and illuminations, the similarity of image sets of the same subject becomes low and sometimes even lower than the similarity between sets of different subjects. This largely increases the misclassification errors. To overcome this challenge, a natural solution is to learn a mapping which increases the similarity of the sets of the same subject from different cameras while reducing the similarity between the sets of different subjects. However, most of such learning schemes are based on individual images.

In this paper, we propose a novel distance metric learning scheme based on image sets. Our novelty and contribution is threefold. First, our scheme aims to learn a feature-space mapping to adapt the variation in viewpoints and illuminations between cameras. The learning procedure is an extension of the large margin nearest neighbor (LMNN) [8] to image sets (LMNN was based on individual images). Secondly, although we adopt the convex hull model of [2] to represent an image set, our scheme is different from the CHISD method of [2] in that we use the learned feature-space mapping to project all face sets into a discriminative feature subspace for face recognition. Compared with the original feature space, this subspace is designed to make the distances between sets of the same subject shorter and the distances between sets of different subjects longer. Thirdly, we shall use various real datasets to illustrate that, for video surveillance, face recognition using the learned feature subspace is better than that using the original feature space.

Section snippets

Related work

There are two main elements in set-based face recognition: (1) the model to represent the image sets, and (2) the distance to measure the similarity between sets. Existing set-based methods have different emphases on these two elements.

Some methods focus on one of the elements. For example, Chen et al. [3] focus on the representation model, utilizing the mean feature of a set to represent set; Stallkamp et al. [1] focus on the distance measures, manually designing three metrics that weight the

The proposed method

The framework of our proposed method is shown in Fig. 1. We use a convex hull model to represent an image set and use the closest distance between convex hull models to measure the similarity between sets. In the training phase, we learn a set-based discriminative subspace to adapt the variation across cameras. In the recognition phase, face images are projected into the learned subspace and the final identity of the probe set is established by a nearest neighbor classifier.

Experiments

We evaluate our proposed method on two public databases of faces, Honda/UCSD [15] and ChokePoint [16], and compare it with some state-of-the-art set-based methods including CHISD [2] as the baseline.

System parameters: There are two parameters we should consider: the penalty factor λ in (9) and the initial step-size s to update M. We set the penalty factor λ to 0.5 to balance the cost of “target neighbor” sets and the cost of “imposter” sets. The initial step-size s needs to be a sufficiently

Conclusions

In this paper we have proposed a method embedding distance metric learning into set-based face recognition for video surveillance. The idea was motivated by the recognition difficulty due to viewpoint and illumination variations across multiple cameras in surveillance networks. Experiments on public databases showed that the proposed method was superior to the state-of-the-art methods in overcoming this difficulty. The method can be applied to surveillance networks with fixed cameras; future

Acknowledgments

The work was partially sponsored by National Natural Science Foundation of China (Nos. 61132007 and 61271390).

Guijin Wang received the B.S. and Ph.D. degrees in signal and information processing (with honors) from the Department of Electronics Engineering, Tsinghua University, China, in 1998 and 2003, respectively. From 2003 to 2006, he was with Sony Information Technologies Laboratories as a researcher. Since 2006, he has been with the Department of Electronics Engineering at, Tsinghua University, China, as an associate professor. He has published over 50 International journal and conference papers

References (20)

  • J. Stallkamp, H.K. Ekenel, R. Stiefelhagen, Video-based face recognition on real-world data, in: ICCV, 2007, pp....
  • H. Cevikalp, B. Triggs, Face recognition based on image sets, in: CVPR, 2010, pp....
  • S. Chen et al.

    Face recognition from still images to video sequencesa local-feature-based framework

    EURASIP J. Image Video Process.

    (2011)
  • R. Wang, S. Shan, X. Chen, W. Gao, Manifold-manifold distance with application to face recognition based on image set,...
  • Y. Hu, A.S. Mian, R. Owens, Sparse approximated nearest points for image set classification, in: CVPR, 2011, pp....
  • S. Zhou et al.

    Beyond a single still imageface recognition from multiple still images and videos

    Face Process. Adv. Model. Methods

    (2005)
  • J.R. Barr et al.

    Face recognition from videoa review

    Int. J. Pattern Recognit. Artif. Intell.

    (2012)
  • K.Q. Weinberger et al.

    Distance metric learning for large margin nearest neighbor classification

    J. Mach. Learn. Res.

    (2009)
  • O. Yamaguchi, K. Fukui, K.-i. Maeda, Face recognition using temporal image sequence, in: Automatic Face and Gesture...
  • K. Fukui, O. Yamaguchi, Face recognition using multi-viewpoint patterns for robot vision, in: Robotics Research,...
There are more references available in the full text version of this article.

Cited by (0)

Guijin Wang received the B.S. and Ph.D. degrees in signal and information processing (with honors) from the Department of Electronics Engineering, Tsinghua University, China, in 1998 and 2003, respectively. From 2003 to 2006, he was with Sony Information Technologies Laboratories as a researcher. Since 2006, he has been with the Department of Electronics Engineering at, Tsinghua University, China, as an associate professor. He has published over 50 International journal and conference papers and holds several patents. He is the session chair of IEEE CCNC’06. His research interests are focused on wireless multimedia, image and video processing, depth imaging, pose recognition, intelligent surveillance, industry inspection, object detection and tracking and online learning.

Fei Zheng received the B.S. degree in Information and Electronics Engineering from the Department of Electronic Engineering, Tsinghua University, China in 2011. He is currently a master candidate in the Department of Electronic Engineering, Tsinghua University. His research interests are in the area of machine learning and intelligent surveillance.

Chenbo Shi received the B.S. and Ph.D. degrees from the Department of Electronics Engineering, Tsinghua University, China in 2005 and 2012 respectively. From 2008 to 2012, He has published over 10 International journal and conference papers. He is the reviewers for several international journals and conferences. Now he is a postdoctoral researcher in Tsinghua University. His research interests are focused on image stitching, stereo matching, matting, object detection and tracking, etc.

Jing-Hao Xue received the B.Eng. degree in telecommunication and information systems in 1993 and the Ph.D. degree in signal and information processing in 1998, both from Tsinghua University, the M.Sc. degree in medical imaging and the M.Sc. degree in statistics, both from Katholieke Universiteit Leuven in 2004, and the degree of Ph.D. in statistics from the University of Glasgow in 2008. He has worked in the Department of Statistical Science at University College London as a Lecturer since 2008. His research interests include statistical and machine-learning techniques for pattern recognition, data mining and image processing, in particular supervised, unsupervised and incompletely supervised learning for complex and high-dimensional data.

Chunxiao Liu is a Ph.D. candidate in the Department of Electronics Engineering, Tsinghua University, China. He received his B.S. degree in the Department of Electronics and Information Engineering, Huazhong University of Science & Technology, Wuhan, in 2008. His research interests include human re-identification, tracking, camera network activity analysis, machine learning.

Li He was born in Jilin, China, in 1986. He received the B.S. degree from the Department of Electronics Engineering, Tsinghua University, Beijing, China, in 2010. He is currently working toward the Ph.D. degree in the Department of Electronics Engineering, Tsinghua University, Beijing, China. His research interests include the applications of machine learning and pattern recognition in human pose/action recognition and tracking.

View full text