Unsupervised virtual view synthesis from monocular video with generative adversarial warping☆
Graphical abstract
Introduction
The rapid development of mobile devices and wireless network makes interactive 3D graphic applications popular in recent years. Good examples include 3DTV, free-viewpoint video, 3D navigation and virtual environment roaming, as well as some recent attempts from the industry, namely Google Stadia, Nvidia GRID and Microsoft Project xCloud. To support the above applications, depth-image-based rendering (DIBR) [1], which is based on virtual view synthesis, is one of a most cost-effective solution. Major bottlenecks of DIBR are acquisition and transmission of reference views. To alleviate the problem, synthesizing virtual views from limited reference views is desirable, yet being challenging, especially from monocular video. To infer a novel view from monocular view is an ill-posed problem, which can be formulated as follows: where denotes a reference view and indicates supervision information, such as depth, disparity, or appearance flow to .
Existing virtual view synthesis methods can be generally categorized into geometry-based and learning-based approaches. The former ones transform pixels of a reference view to a virtual view with explicit geometry constraints, while the latter ones learn a parametric model of the scene and use it to generate novel views.
Geometry-based methods, including photometric stereo, depth estimation and appearance flow, rely heavily on explicit geometry of a scene structure, which may not be available in practice. Unfortunately, estimating scene geometry is a hard problem itself. For example, depth estimation may not work for regions with non-Lambertian reflectance and transparency. Moreover, estimated depth can describe spatial and occlusion relationship, only if the camera pose can be obtained. Besides, pixel transformation cannot infer the contents in disoccluded regions of synthesized views, resulting in severe geometric distortions. Post-processing, such as texture synthesis or image inpainting, can alleviate geometric distortions to certain extent, yet possibly inducing new distortions and time cost.
Learning-based methods do not need explicit geometry information, yet estimating implicit geometry information by utilizing feature representation based on convolutional neural network (CNN). However, the training of a generative network desires pristine virtual views as supervision. Taking Deep3D [2] as an example, the pixel error based loss function relies on the ground truth of a synthesized image. In practice, such pristine virtual views are rarely available to satisfy the training, thereby limiting its generalization.
The paper alternatively transforms the virtual view synthesis task into a generative adversarial way. With depth-image-based rendering as the backbone, we embed a spatiotemporal generative adversarial network (GAN) into virtual view synthesis, together with implicit depth estimation, to produce a novel view. To make the result hallucinated, it designs novel perceptual constraints, especially a blind synthesized image quality metric, to optimize the model.
The training is independent of either geometry prior or ground truth of a virtual view. Besides, the whole framework is end-to-end with no extra post-processing. The proposed method have been evaluated on multiple datasets with both subjective and objective evaluations. The main contributions of this paper are addressed as follows:
- •
A novel virtual view synthesis framework in an unsupervised way is proposed, which deserves no extra geometry information or pristine virtual view.
- •
A spatiotemporal generative adversarial network is embedded into the traditional depth-image-based rendering. The proposed framework implicitly combines depth estimation, differential 3D warping, hole filling and spatiotemporal discriminator in an end-to-end way.
- •
To make the novel view hallucinated, new perceptual constraints are proposed, including a pre-trained blind synthesized image quality metric as well as a no-reference structure similarity metric.
The rest of this paper is organized as follows. Related works are reviewed in Section 2. Section 3 introduces the framework. Sections 4 Unsupervised virtual view synthesis network, 5 Perceptual constraints without pristine image present the details of proposed method. In Section 6, the proposed method is evaluated with self investigation and comparison. Finally, a conclusion is presented in Section 7.
Section snippets
Related work
The most related work to the proposed method is virtual view synthesis, which is reviewed as below. Additionally, the recent work of generative adversarial network as well as perceptual constraints are also briefly reviewed.
Overview
Previous virtual view synthesis methods rely on either geometry information or pristine virtual view as supervision. The proposed method alternatively transform virtual view synthesis into an unsupervised GAN, as illustrated in Fig. 1. Specifically, it uses depth-image-based rendering as backbone, utilizing CNN to estimate plausible depth to the reference view, and produces a virtual view through an image inpainting like generator in an end-to-end manner. Compared with previous work, the
Unsupervised virtual view synthesis network
The technical details of the aforementioned four modules, namely depth estimation subnet , 3D warping , hole filling subnet and spatiotemporal discriminator , are presented as below.
Perceptual constraints without pristine image
Since the pristine image to the virtual view is unavailable during the training, the proposed method alternatively introduces a blind image quality assessment to optimize the generator. Particularly, it pre-trained a blind synthesized image quality assessment model , combining it to the generator as a loss function. There are several research on CNN-based blind image quality assessment. However, few of them address geometry distortion in synthesized image. Compared with traditional image
Dataset
A total of three typical RGBD dataset are selected for training, including NYU depth v2, KITTI odometry and SceneFlow [5]. NYU depth v2 provides monocular video and associated depth captured by depth sensor. We choose 14 scenes from it, where 9 scenes for training and the rest 5 scenes for testing. The total training frames are about 7K. KITTI odometry provides monocular video without ground truth depth. We choose 93 scenes as training set, where the other 46 scenes as testing set. The total
Conclusion
In this paper, a novel unsupervised virtual view synthesis approach is proposed. The main idea is to generate hallucinated virtual view without both geometry information and pristine images of the virtual view during training. To achieve this aim, we firstly embed GAN into DIBR framework. By jointly training the depth estimation subnet, differential 3D warping, hole filling subnet and spatiotemporal discriminator, we can synthesize virtual views in an end-to-end way. In particular, we design
CRediT authorship contribution statement
Xiaochuan Wang: Conceptualization, Methodology, Software, Experiments. Yi Liu: Writing – original draft. Haisheng Li: Writing – review & editing.
Declaration of Competing Interest
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.compeleceng.2021.107460.
Acknowledgments
The authors want thank anonymous reviewers for their kindly suggestions. This work is sponsored by Beijing Natural Science Foundation grant number (4202016), National Natural Science Foundation of China grant number (62076012).
Xiaochuan Wang received the Ph.D. degree in computer science and engineering from Beihang University in 2019. He is currently a lecture with the School of Computer Science and Engineering, Beijing Technology and Business University. His research interests include image processing, computer graphics and virtual reality.
References (38)
- et al.
Objective image quality assessment of 3D synthesized views
Signal Process Image Commun
(2015) - et al.
View generation with 3D warping using depth information for FTV
Signal Process Image Commun
(2009) Post-rendering 3D image warping: visibility, reconstruction, and performance for depth-image warpingTech. rep.
(1999)- et al.
Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks
- et al.
A novel depth-based virtual view synthesis method for free viewpoint video
IEEE Trans Broadcast
(2013) - et al.
Efficient dense stereo with occlusions for new view-synthesis by four-state dynamic programming
Int J Comput Vis
(2007) - Mayer N, Ilg E, Hausser P, Fischer P, Cremers D, Dosovitskiy A et al. A large dataset to train convolutional networks...
- et al.
Efficient depth image based rendering with edge dependent depth filter and interpolation
- et al.
Image classification using label constrained sparse coding
Multimedia Tools Appl
(2016) - et al.
Image-based rendering and synthesis
IEEE Signal Process Mag
(2007)
View synthesis by appearance flow
Novel view synthesis from single images via point cloud transformation
Auto3d: Novel view synthesis through unsupervisely learned variational viewpoint and global 3d representation
Hole filling method using depth based in-painting for view synthesis in free viewpoint television and 3-d video
Depth image-based rendering with advanced texture synthesis for 3-D video
IEEE Trans Multimed
SCCGAN: Style and characters inpainting based on CGAN
Mob Netw Appl
Cited by (0)
Xiaochuan Wang received the Ph.D. degree in computer science and engineering from Beihang University in 2019. He is currently a lecture with the School of Computer Science and Engineering, Beijing Technology and Business University. His research interests include image processing, computer graphics and virtual reality.
Liu Yi, general assistant of industrial big data division of Beijing Institute of Mechanical Industry Automation Co., Ltd. His research interests include image processing, virtual reality and digital twins.
Haisheng Li received the Ph.D. degree from Beihang University in 2002. He is a professor in school of Computer Science and Engineering, Beijing Technology and Business University. His research interests include computer graphics, scientific visualization, intelligent information processing etc.
- ☆
This paper is for regular issues of CAEE. Reviews processed and approved for publication by the co-Editor-in-Chief Huimin Lu.