On potentials of regularized Wasserstein generative adversarial networks for realistic hallucination of tiny faces
Introduction
High-res (HR) face images contain rich semantic information and ideally can be captured non-intrusively both for human perception and machine interpretation. Nonetheless, due to the limited imaging conditions low-res (LR) face images are usually unavoidable, which naturally leads to a negative impact on the performance of facial verification and recognition systems. Hence, the so-called face hallucination technique, i.e., super-resolution of face images, has been intensively studied since the pioneering work of Baker and Kanade [21], [2]. However, we observe that the top-performing face hallucination methods [1], [3], [4], [26], [39], [40], [41], [42], [43] generally super-resolve face images up to 4 × upscaling factors. As faces are far from the surveillance cameras, few of those hallucination algorithms are applicable to the tiny faces due to their inability of recovering visually semantic high-frequency information [23], [44], [7].
In this paper, tiny faces of size 16 × 16 are of our particular interests which are to be magnified into their 8 × upscaling ones. In mathematics, the problem is highly ill-posed and therefore strong regularization constraints are to be imposed on HR faces. While in the view of Bayes statistics, the problem desires a strongly informative prior on the appearances of HR faces. Specifically, if we can formulate well what a face of size 128 × 128 looks like in various poses and expressions, it is highly probable that visually acceptable results can be expected. Nevertheless, pursuing such face regularization or priori model is almost as difficult as the tiny face hallucination problem itself.
According to the comprehensive survey on face hallucination [49], learning-based methods are shown more effective, among which the subspace-based approaches [1], [3], [26], [39], [40], [41], [42] are particularly popular. However, those methods usually assume the appearances of the testing LR faces are similar to the training HR faces in terms of pose, expression, or lighting condition. Otherwise, ghosting artifacts are to be produced in the outputs reflecting the significant importance of a powerful facial representation model. Meanwhile, it is noticed that most of previous alignment-based hallucination algorithms, e.g., [43], [4], are highly dependent on the precise facial features or landmark points, which is apparently a chicken and egg problem as for tiny face hallucination.
In the kingdom of machine learning, deep learning [50] is no doubt one of the most powerful representation learning methods. There have proposed several general super-resolution approaches by use of convolutional neural networks [6], [31], [32], [33], [34], [35], [36], [37], [38], [19] since the original work of Dong et al. [5]. These advocated architectures are, however, not specifically objected to tiny face hallucination, and naïve retraining of them usually fails to produce visually pleasant and realistic outputs. An example is provided in Fig. 1, showing that the recent general super-resolution network termed as VDSR [6] fails to produce visually acceptable hallucination results from the counterpart tiny face, i.e., Fig. 1(c) and (d). To take into account of the underlying pose, expression and lighting variations in the captured tiny faces, Yu and Porikli [9] further design a more advanced model UR-DGN via use of the generative adversarial networks (GAN) [7] and a large-scale face dataset. It is observed that the same authors [45] also incorporate the transformative discriminative networks into the pipeline of unaligned tiny face hallucination, so as to remedy the deficient representation ability [14], [17], [18] inherent in the generative adversarial networks. Besides, a deep cascaded bi-network [44] is proposed as another candidate solution to unaligned tiny face hallucination, by learning the hallucination mapping function and dense face correspondence field alternatingly. But as claimed in the paper, this model bears the risk of different types of failure.
In this paper, we take a step further and propose a new alignment-free method for tiny face hallucination. The critical novelty is inspired by the most recent progress made in deep unsupervised learning, i.e., Wasserstein GAN (WGAN) [14], [18], [53]. Although naïve use of the original topologies of WGAN cannot produce acceptable hallucination results, it has been demonstrated a fairly strong ability for representing class-specific images particularly as regularized by gradient constraints [53], [18]. Thus, WGAN is viewed as an implicit face priori model in this paper, in that the face representation and hallucination mapping function are to be learned simultaneously from a large face dataset. For the convenience of description, the proposed tiny face hallucination is termed as tfh-WGAN for short. To the best of our knowledge, the present paper is the first work successfully addressing tiny face hallucination via exploring the potentials of WGAN.
Similar to UR-DGN [9], the proposed tfh-WGAN consists of two topologies: a generator G and a discriminator D or a critic as termed in [14]. Besides a pixel-wise L2 regularization term for imposing the similarity between the hallucinated face and its ground truth one, it is particularly discovered that our advocated autoencoding generator with both residual and skip connections plays a critical role for WGAN representing the facial contour and semantic content to a reasonable precision. With the additional Lipschitz penalty and architectural considerations for the critic D, the proposed tfh-WGAN achieves state-of-the-art tiny face hallucination performance in terms of both visual perception and objective assessment. Experimental results demonstrate that the proposed approach could not only achieve realistic hallucination of tiny faces, but also adapt to pose, expression, illuminance and occluded variations to a great degree. Since tfh-WGAN is trained in an end-to-end manner, the hallucinated face can be produced directly by the generator G of tfh-WGAN in the testing phase. Note that, in the pipeline of our tfh-WGAN there is no any module for dense alignment or facial landmark location. There is not any explicit constraint on the appearances of face images, either. What is only required is that, the training facial images are approximately close at eye locations which can be made satisfied in most of face datasets. Following several previous works [9], [45], [54], the cropped CelebA dataset [20] is utilized to aid the tuning and analysis of the new approach which includes more than 200 thousand face images of about 10 thousand celebrities, among which the first 20 thousand faces are selected for training tfh-WGAN and the last 260 faces are used for testing it.
Overall, the main contributions and distinctions in this paper are summarized as:
- 1.
We introduce tfh-WGAN, a new successful approach to tiny face hallucination via use of regularized WGAN as an implicit face priori model.
- 2.
The residual and skip connections are demonstrated two critical components plugged into the topology of the auto-encoder for generator, which ensures tfh-WGAN representing the facial contour and semantic content to a reasonable precision.
- 3.
The proposed approach achieves state-of-the-art hallucination performance in terms of both visual perception and objective assessment, with the additional Lipschitz penalty and architectural considerations for the critic D in tfh-WGAN.
- 4.
There is not any module for dense alignment or facial landmark location or any constraint on the appearances of facial images in tfh-WGAN, while more realistic hallucination results can be produced regardless of facial pose, expression, illuminance and occluded variations.
Section snippets
Related work
In this section, we make a discussion on several related work prior to introducing the proposed hallucination method, including general image super-resolution, face hallucination, and the recent progress on generative adversarial networks.
The proposed tfh-WGAN approach
UR-DGN [9], as the first work proposing use of GAN [7] for tiny face hallucination, replaces the random noise in (1) by a tiny face along with additional L2 regularization for identity fidelity. This paper is further inspired by the advanced representation potentials of recent unsupervised learning methods, proposing for the first time a successful WGAN-based tiny face hallucination method to the best of our knowledge. It is also highly expected that this newly proposed approach will produce
Experiments
This section makes both quantitative and qualitative comparisons among the proposed tfh-WGAN and some recent representative methods, including two shallow learning-based schemes: LLE [25], LcR [68] and five deep learning-based schemes: SRCNN [5], VDSR [6], UR-DGN [9], LCGE [56], CNN-MNCE [54], where LLE [25], SRCNN [5], VDSR [6] are oriented to general image super-resolution, and LcR [68], UR-DGN [9], LCGE [56], CNN-MNCE [54] are for face specific super-resolution. Besides, Bicubic
Conclusion
This paper presents a new approach to tiny face hallucination from images of size 16 × 16 by exploiting the potentials of the recent representation learning method WGAN. To produce visually pleasant hallucinated faces, tremendous efforts are taken for more advanced architectures to the proposed tfh-WGAN approach. It is especially found that, our advocated autoencoding-based generator with both residual and concat-based skip connections is a critical component for WGAN representing the facial
Conflict of interest
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Acknowledgments
The authors would like to show their gratitude to the anonymous reviewers for their helpful and pertinent comments, which have strenghened the manmuscript greatly. The first author also shows many thanks to Prof. Zhihui Wei, Prof. Michael Elad, Prof. Yizhong Ma, and Dr. Min Wu, and Mr. Yatao Zhang for their kind help in the past years. The study is supported in part by the Natural Science Foundation (NSF) of China (61771250, 61602257, 61502244, 61572503, 61872424, 6193000388), the NSF of
Wen-Ze Shao received the B.S. degree in Science of Information and Computation in June 2003, and the Ph.D. degree in Pattern Recognition and Intelligent Systems in July 2008, both from Nanjing University of Science and Technology, Nanjing, China. From June 2003 to December 2011, he served as an officer in the People's Liberation Army of China. In January 2012, he joined Nanjing University of Posts and Telecommunications as an Assistant Professor. From May 2014 to June 2015, he also worked as a
References (73)
- et al.
Hallucinating face by position-patch
Pattern Recognit.
(2010) - et al.
Hallucinating face by eigentransformation
IEEE Trans. Syst. Man Cybern. C
(2005) - et al.
Limits on super-resolution and how to break them
IEEE Trans. Pattern Anal. Mach. Intell.
(2002) - et al.
A two-step approach to hallucinating faces: global parametric model and local nonparametric model
- et al.
Structured face hallucination
- et al.
Learning a deep convolutional network for image super-resolution
- et al.
Accurate image super-resolution using very deep convolutional networks
- et al.
Generative adversarial networks
Adv. Neural Inf. Process. Syst.
(2014) - et al.
Photo-realistic single image super-resolution using a generative adversarial network
- et al.
Ultra-resolving face images by discriminative generative networks
Improved techniques for training gans
Adv. Neural Inf. Process. Syst.
Image-to-image translation with conditional adversarial networks
Deep residual learning for image recognition
Enhanced deep residual networks for single image super-resolution
Deep learning face attributes in the wild
Hallucinating faces
Learning to estimate scenes from images
Single-image super-resolution: a benchmark
Low-complexity single-Image super-resolution based on nonnegative neighbor embedding
Super-resolution through neighbor embedding
Image super-resolution via sparse representation
IEEE Trans. Image Process.
On single image scale-up using sparse representation
Int. Conf. Curves Surf.
Anchored neighborhood regression for fast example-based super-resolution
A+: adjusted anchored neighborhood regression for fast super-resolution
Deep network cascade for image super-resolution
Deeply-recursive convolutional network for image super-resolution
Perceptual losses for real-time style transfer and super-resolution
Deep networks for image super-resolution with sparse prior
Super-resolution with deep convolutional sufficient statistics
Accelerating the super-resolution convolutional neural network
Cited by (12)
Forecasting crude oil risk: A multiscale bidirectional generative adversarial network based approach
2023, Expert Systems with ApplicationsCitation Excerpt :GAN has demonstrated convincing positive performance improvement in the image restoration and has been applied in a wide range of physical disciplines such as image processing, natural language processing, robotics, medical imaging, energy modeling, transportation forecast (Pandey & Janghel, 2019; Yi et al., 2019). For example, Shao et al. (2019) used the GAN model to generate corrupted data in the speech and proposed a GAN based speech enhancement model. Pascual et al. (2019) used the Wasserstein GAN model to upsample the tiny face image and achieved the state-of-the-art hallucination effect appealing both to the visual perception and the assessment.
Super-resolution of very low-resolution face images with a wavelet integrated, identity preserving, adversarial network
2022, Signal Processing: Image CommunicationCitation Excerpt :Although these perceptual-aware models perform well on small datasets, they do not have good generalizing power due to the limited number of scored training images. Another technique to ameliorate the blurry effect of the pixel-wise MSE loss is to use GAN models [6] as proposed by authors in [5,8,9,25–27]. Although GAN-oriented face hallucination networks generate perceptually more convincing images, they suffer from two aspects.
Progressive face super-resolution with cascaded recurrent convolutional network
2021, NeurocomputingFace sketch-to-photo transformation with multi-scale self-attention GAN
2020, NeurocomputingCitation Excerpt :The GAN proposed by Goodfellow et al. [9] stands out amongst the various generative models; it has become the preferred method of modeling distributions and transforming a distribution into another distribution. Since the release of the pix2pix model [10], a large number of GAN-based encoder–decoder methods [11–13] have been proposed; however, these existing works treat all the encoder layers equally, ignoring the fact that the features extracted from different encoder layers are distinguishable. For the shallow layers, the features contain more detailed information, and for the deep layers, the features reflect global information such as the dependencies across image regions.
A novel approach inspired by optic nerve characteristics for few-shot occluded face recognition
2020, NeurocomputingCitation Excerpt :Another type of generators does not explicitly specify what to transfer but directly integrate the generator into a meta-learning algorithm for improving the recognition accuracy. Similarlry, because this kind methods treat each faces equally, the face images with partial occlusion will distort the face representation [54–57]. Although many algorithms have achieved good performance in few-shot learning, their performances are limited in dealing with occlusion.
Wen-Ze Shao received the B.S. degree in Science of Information and Computation in June 2003, and the Ph.D. degree in Pattern Recognition and Intelligent Systems in July 2008, both from Nanjing University of Science and Technology, Nanjing, China. From June 2003 to December 2011, he served as an officer in the People's Liberation Army of China. In January 2012, he joined Nanjing University of Posts and Telecommunications as an Assistant Professor. From May 2014 to June 2015, he also worked as a Postdoc Researcher at Department of Computer Science in Technion-Israel Institute of Technology. He is currently an Associate Professor in Nanjing University of Posts and Telecommunications, working in the fields of computational imaging, computer vision, deep learning, and variational optimization.
Jing-Jing Xu received the B.E. degree in Electrical and Information Engineering in June 2017 from Hubei University of Economics, Hubei, China. Since September 2017, she started a Master's program at Nanjing University of Posts and Telecommunications, Nanjing, China. Her current research interests include face hallucination, object detection, deep learning, and generative adversarial modeling.
Long Chen received the B.E. degree in Electrical and Information Engineering in June 2015 from Chengxian College of Southeast University, and the M.S. degree in Electronic and Communication Engineering in April 2019 from Nanjing University of Posts and Telecommunications, Nanjing, China. Currently, he is serving as an Algorithm Engineer in Innovation Resource Center, Suzhou Keda Technology Co., Ltd., working in the fields of edge computing, computer vision, and deep learning.
Qi Ge received the B.S. degree in Science of Information and Computation in 2006 and the M.S. degree in Applied Mathematics in 2009, both from Nanjing University of Information and Engineering, Nanjing, China, and the Ph.D. degree in Pattern Recognition and Intelligent System in 2013 from Nanjing University of Science and Technology, Nanjing, China. Now, she serves as an Assistant Professor at College of Telecommunications and Information Engineering in Nanjing University of Posts and Telecommunications. She is particularly interested in the fields of variational PDE approaches, probabilistic graphical models, sparse representation and their applications to image restoration, segmentation, and so on.
Li-Qian Wang received the B.S. degree in Science of Information and Computation in June 2006, the M.S. degree in Computer Application Technology in 2008, and the Ph.D. degree in Pattern Recognition and Intelligent System in 2015, all from Nanjing University of Science and Technology, Nanjing, China. She is currently an Assistant Professor at College of Telecommunications and Information Engineering in Nanjing University of Posts and Telecommunications. Her research interests include variational partial differential equations with applications in image processing, image restoration, image enhancement, and pattern recognition.
Bing-Kun Bao received the Ph.D. degree in control theory and control application from the University of Science and Technology of China, Hefei, China, in 2009. She is currently a professor with College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, China. Her current research interests include cross-media cross-modal image search, social event detection, image classification and annotation, and sparse/low rank representation. She was the recipient of the 2016 ACM Transactions on Multimedia Computing, Communications and Applications (ACM TOMM) Nicolas D. Georganas Best Paper Award, IEEE Multimedia 2017 Best Paper Award, and the Best Paper Award from ICIMCS’09.
Hai-Bo Li received the B.E. degree in Wireless Engineering in 1985 and the M.S. degree in Communication and Electronic Systems in 1988, both from Nanjing University of Posts and Telecommunications, Nanjing, China, and the Ph.D. degree in Information Theory in 1993 from Lin ping University, Sweden. In 1997, he became docent (UK equivalent senior lecturer, US equivalent associate professor) in image coding, and in 1999 he became the youngest lifetime professor of signal processing in Umea University, Sweden. Now, he is a Professor of innovative media technology in KTH Royal Institute of Technology. His research interest is mainly media signal processing, including image coding, video compression, motion estimation, facial and hand gesture recognition for the next generation of mobile phones, invisible interaction technology, and so on.