Unsupervised virtual view synthesis from monocular video with generative adversarial warping

doi:10.1016/j.compeleceng.2021.107460

Computers & Electrical Engineering

Volume 96, Part B, December 2021, 107460

https://doi.org/10.1016/j.compeleceng.2021.107460 Get rights and content

Highlights

•
An unsupervised virtual view synthesis approach from monocular video.
•
Embedding traditional DIBR into generative network.
•
Introducing perception constraints to optimize the visual quality.
•
Presenting plausible synthesized results in both objective and subjective studies.

Abstract

Virtual view synthesis from monocular video is challenging, as it aims to infer photorealistic views from single reference view. Previous work have achieved acceptable visual quality, however, are heavily relied on supervision information, such as depth or pristine virtual view, which are inadequate in practical application. In this paper, an unsupervised virtual view synthesis method is proposed to get rid of the supervision information. Firstly, it embed a spatiotemporal generative adversarial network into traditional depth-image-based rendering framework with no explicit depth information provided. Secondly, it utilized novel perceptual constraints without relying on pristine images, including the blind synthesized image quality metric and no-reference structure similarity. The entire framework is fully convolutional, producing hallucinated results in an end-to-end way. Particularly, the whole framework is independent of supervision information. Experimental results demonstrate that the proposed method produces pleasant virtual views in comparison with supervised methods, thereby can be beneficial to practical applications.

Graphical abstract

Introduction

The rapid development of mobile devices and wireless network makes interactive 3D graphic applications popular in recent years. Good examples include 3DTV, free-viewpoint video, 3D navigation and virtual environment roaming, as well as some recent attempts from the industry, namely Google Stadia, Nvidia GRID and Microsoft Project xCloud. To support the above applications, depth-image-based rendering (DIBR) [1], which is based on virtual view synthesis, is one of a most cost-effective solution. Major bottlenecks of DIBR are acquisition and transmission of reference views. To alleviate the problem, synthesizing virtual views from limited reference views is desirable, yet being challenging, especially from monocular video. To infer a novel view from monocular view is an ill-posed problem, which can be formulated as follows: $I^{v_{v i r}} = F (I^{v_{r e f}}; Φ)$ where $I^{v_{r e f}}$ denotes a reference view and $Φ$ indicates supervision information, such as depth, disparity, or appearance flow to $I^{v_{r e f}}$ .

Existing virtual view synthesis methods can be generally categorized into geometry-based and learning-based approaches. The former ones transform pixels of a reference view to a virtual view with explicit geometry constraints, while the latter ones learn a parametric model of the scene and use it to generate novel views.

Geometry-based methods, including photometric stereo, depth estimation and appearance flow, rely heavily on explicit geometry of a scene structure, which may not be available in practice. Unfortunately, estimating scene geometry is a hard problem itself. For example, depth estimation may not work for regions with non-Lambertian reflectance and transparency. Moreover, estimated depth can describe spatial and occlusion relationship, only if the camera pose can be obtained. Besides, pixel transformation cannot infer the contents in disoccluded regions of synthesized views, resulting in severe geometric distortions. Post-processing, such as texture synthesis or image inpainting, can alleviate geometric distortions to certain extent, yet possibly inducing new distortions and time cost.

Learning-based methods do not need explicit geometry information, yet estimating implicit geometry information by utilizing feature representation based on convolutional neural network (CNN). However, the training of a generative network desires pristine virtual views as supervision. Taking Deep3D [2] as an example, the $l_{1}$ pixel error based loss function relies on the ground truth of a synthesized image. In practice, such pristine virtual views are rarely available to satisfy the training, thereby limiting its generalization.

The paper alternatively transforms the virtual view synthesis task into a generative adversarial way. With depth-image-based rendering as the backbone, we embed a spatiotemporal generative adversarial network (GAN) into virtual view synthesis, together with implicit depth estimation, to produce a novel view. To make the result hallucinated, it designs novel perceptual constraints, especially a blind synthesized image quality metric, to optimize the model.

The training is independent of either geometry prior or ground truth of a virtual view. Besides, the whole framework is end-to-end with no extra post-processing. The proposed method have been evaluated on multiple datasets with both subjective and objective evaluations. The main contributions of this paper are addressed as follows:

•
A novel virtual view synthesis framework in an unsupervised way is proposed, which deserves no extra geometry information or pristine virtual view.
•
A spatiotemporal generative adversarial network is embedded into the traditional depth-image-based rendering. The proposed framework implicitly combines depth estimation, differential 3D warping, hole filling and spatiotemporal discriminator in an end-to-end way.
•
To make the novel view hallucinated, new perceptual constraints are proposed, including a pre-trained blind synthesized image quality metric as well as a no-reference structure similarity metric.

The rest of this paper is organized as follows. Related works are reviewed in Section 2. Section 3 introduces the framework. Sections 4 Unsupervised virtual view synthesis network, 5 Perceptual constraints without pristine image present the details of proposed method. In Section 6, the proposed method is evaluated with self investigation and comparison. Finally, a conclusion is presented in Section 7.

Section snippets

Related work

The most related work to the proposed method is virtual view synthesis, which is reviewed as below. Additionally, the recent work of generative adversarial network as well as perceptual constraints are also briefly reviewed.

Overview

Previous virtual view synthesis methods rely on either geometry information or pristine virtual view as supervision. The proposed method alternatively transform virtual view synthesis into an unsupervised GAN, as illustrated in Fig. 1. Specifically, it uses depth-image-based rendering as backbone, utilizing CNN to estimate plausible depth to the reference view, and produces a virtual view through an image inpainting like generator in an end-to-end manner. Compared with previous work, the

Unsupervised virtual view synthesis network

The technical details of the aforementioned four modules, namely depth estimation subnet $F$ , 3D warping $W$ , hole filling subnet $G$ and spatiotemporal discriminator $D_{s, t}$ , are presented as below.

Perceptual constraints without pristine image

Since the pristine image to the virtual view is unavailable during the training, the proposed method alternatively introduces a blind image quality assessment to optimize the generator. Particularly, it pre-trained a blind synthesized image quality assessment model $Q$ , combining it to the generator as a loss function. There are several research on CNN-based blind image quality assessment. However, few of them address geometry distortion in synthesized image. Compared with traditional image

Dataset

A total of three typical RGBD dataset are selected for training, including NYU depth v2, KITTI odometry and SceneFlow [5]. NYU depth v2 provides monocular video and associated depth captured by depth sensor. We choose 14 scenes from it, where 9 scenes for training and the rest 5 scenes for testing. The total training frames are about 7K. KITTI odometry provides monocular video without ground truth depth. We choose 93 scenes as training set, where the other 46 scenes as testing set. The total

Conclusion

In this paper, a novel unsupervised virtual view synthesis approach is proposed. The main idea is to generate hallucinated virtual view without both geometry information and pristine images of the virtual view during training. To achieve this aim, we firstly embed GAN into DIBR framework. By jointly training the depth estimation subnet, differential 3D warping, hole filling subnet and spatiotemporal discriminator, we can synthesize virtual views in an end-to-end way. In particular, we design

CRediT authorship contribution statement

Xiaochuan Wang: Conceptualization, Methodology, Software, Experiments. Yi Liu: Writing – original draft. Haisheng Li: Writing – review & editing.

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.compeleceng.2021.107460.

Acknowledgments

The authors want thank anonymous reviewers for their kindly suggestions. This work is sponsored by Beijing Natural Science Foundation grant number (4202016), National Natural Science Foundation of China grant number (62076012).

Xiaochuan Wang received the Ph.D. degree in computer science and engineering from Beihang University in 2019. He is currently a lecture with the School of Computer Science and Engineering, Beijing Technology and Business University. His research interests include image processing, computer graphics and virtual reality.

References (38)

BattistiF. et al.
Objective image quality assessment of 3D synthesized views
Signal Process Image Commun
(2015)
MoriY. et al.
View generation with 3D warping using depth information for FTV
Signal Process Image Commun
(2009)
MarkW.
Post-rendering 3D image warping: visibility, reconstruction, and performance for depth-image warpingTech. rep.
(1999)
XieJ. et al.
Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks
AhnI. et al.
A novel depth-based virtual view synthesis method for free viewpoint video
IEEE Trans Broadcast
(2013)
CriminisiA. et al.
Efficient dense stereo with occlusions for new view-synthesis by four-state dynamic programming
Int J Comput Vis
(2007)
Mayer N, Ilg E, Hausser P, Fischer P, Cremers D, Dosovitskiy A et al. A large dataset to train convolutional networks...
ChenW.-Y. et al.
Efficient depth image based rendering with edge dependent depth filter and interpolation
LiuR. et al.
Image classification using label constrained sparse coding
Multimedia Tools Appl
(2016)
ChanS. et al.
Image-based rendering and synthesis
IEEE Signal Process Mag
(2007)

Flynn J, Neulander I, Philbin J, Snavely N. Deepstereo: Learning to predict new views from the world’s imagery. In:...

ZhouT. et al.

View synthesis by appearance flow

Xu X, Chen Y-C, Jia J. View independent generative adversarial network for novel view synthesis. In: Proceedings of the...

LeH.-A. et al.

Novel view synthesis from single images via point cloud transformation

(2020)

LiuX. et al.

Auto3d: Novel view synthesis through unsupervisely learned variational viewpoint and global 3d representation

OhK.-J. et al.

Hole filling method using depth based in-painting for view synthesis in free viewpoint television and 3-d video

Ndjiki-NyaP. et al.

Depth image-based rendering with advanced texture synthesis for 3-D video

IEEE Trans Multimed

(2011)

LiuR. et al.

SCCGAN: Style and characters inpainting based on CGAN

Mob Netw Appl

(2021)

Liu G, Reda FA, Shih KJ, Wang T-C, Tao A, Catanzaro B. Image inpainting for irregular holes using partial convolutions....

Cited by (0)

Liu Yi, general assistant of industrial big data division of Beijing Institute of Mechanical Industry Automation Co., Ltd. His research interests include image processing, virtual reality and digital twins.

Haisheng Li received the Ph.D. degree from Beihang University in 2002. He is a professor in school of Computer Science and Engineering, Beijing Technology and Business University. His research interests include computer graphics, scientific visualization, intelligent information processing etc.

^☆: This paper is for regular issues of CAEE. Reviews processed and approved for publication by the co-Editor-in-Chief Huimin Lu.

View full text

Unsupervised virtual view synthesis from monocular video with generative adversarial warping☆

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Related work

Overview

Unsupervised virtual view synthesis network

Perceptual constraints without pristine image

Dataset

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Signal Process Image Commun

Signal Process Image Commun

Post-rendering 3D image warping: visibility, reconstruction, and performance for depth-image warpingTech. rep.

Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks

A novel depth-based virtual view synthesis method for free viewpoint video

IEEE Trans Broadcast

Efficient dense stereo with occlusions for new view-synthesis by four-state dynamic programming

Int J Comput Vis

Efficient depth image based rendering with edge dependent depth filter and interpolation

Image classification using label constrained sparse coding

Multimedia Tools Appl

Image-based rendering and synthesis

IEEE Signal Process Mag