Full Length Article
Self-supervised multi-scale pyramid fusion networks for realistic bokeh effect rendering

https://doi.org/10.1016/j.jvcir.2022.103580Get rights and content

Highlights

  • A self-supervised multi-scale pyramid fusion network is proposed for realistic bokeh effect rendering.

  • ’Circle of confusion’ is taken as task-specific information for network training.

  • The proposed network achieves new state-of-the-art performance on bokeh benchmark.

Abstract

Images with visual pleasing bokeh effect are often unattainable for mobile cameras with compact optics and tiny sensors. To balance the aesthetic requirements on photo quality and expensive high-end SLR cameras, synthetic bokeh effect rendering has emerged as an attractive machine learning topic for engineering applications on imaging systems. However, most of bokeh rendering models either heavily relied on prior knowledge such as scene depth or were topic-irrelevant data-driven networks without task-specific knowledge, which restricted models’ training efficiency and testing accuracy. Since bokeh is closely related to a phenomenon called ”circle of confusion”, therefore, in this paper, following the principle of bokeh generation, a novel self-supervised multi-scale pyramid fusion network has been proposed for bokeh rendering. During the pyramid fusion process, structure consistencies are employed to emphasize the importance of respective bokeh components. Task-specific knowledge which mimics the ”circle of confusion” phenomenon through disk blur convolutions is utilized as self-supervised information for network training. The proposed network has been evaluated and compared with several state-of-the-art methods on a public large-scale bokeh dataset- the ”EBB!” Dataset. The experiment performance demonstrates that the proposed network has much better processing efficiency and can achieve better realistic bokeh effect with much less parameters size and running time. Related source codes and pre-trained models of the proposed model will be available soon on https://github.com/zfw-cv/MPFNet.

Introduction

“Bokeh” is Japanese in origin and refers to a blurry quality. In photography, it is a very recognizable technique, which can lead pleasing visual aesthetic photos, as shown in Fig. 1.

In practice, images with visual pleasing bokeh effect are often produced through professional DLSR camera with large aperture and long focal length. However, they are often unattainable for mobile cameras with compact optics and tiny sensors. To balance the aesthetic requirements on photo quality and expensive high-end SLR cameras, bokeh effect has to be simulated computationally. Therefore, synthetic bokeh effect rendering has emerged as an attractive machine learning technology [1], [2] for engineering applications on imaging systems.

During the past years, many methods on synthetic bokeh effect rendering relied heavily on prior knowledge such as scene depth. Among these depth-based methods, some methods adopted to estimate scene depth through utilizing hardwares like the dual-pixel autofocus system [3] on Google Pixel devices, the dual-lens on iPhone7+ and the Time-of-Flight (TOF) lens on Huawei P30+ smartphones. However, since these specialized hardwares are expensive, they are often not supported on low-end commercial systems. Moreover, for images already captured using monocular cameras, the accurate depth information are not available either. Therefore, many other methods proposed to employ pre-trained models such as MegaDepth [4] to estimate the depth.

To some degree, incorporating prior knowledge to simulate realistic bokeh blur has potentials to improve visual effect of the final generated image. However, every thing has the pros and the cons. The typical limitations of prior-based methods are: (1) the depth sensor related hardwares are not always available on mobile devices; (2) pre-processing prior information by software is generally time-consuming; (3) when the prior information estimated does not work, unexpected out-of-focus blurriness conversely deteriorates the quality of ultimate synthetic image.

Witnessed the impressive success on image-to-image translation tasks, in recent years, many researchers started to consider bokeh simulation as a subtask of image translation [5], [6], [7]. Therefore, routines based end-to-end multi-scale encoder–decoder architecture are commonly adopted as camera-independent solutions. The nonlinear mappings between the low- and high-aperture photos captured with high-end DSLR camera are directly modeled in data-driven way. However, though much progress has been achieved, the majority of these models are topic-irrelevant networks. In other words, it means in these cases, task-specific knowledges like the intrinsic mechanism of bokeh generation are often neglected without fully utilization for more effective solutions.

According to the principle of optical imaging [8], [9], bokeh generation is closely related to a phenomenon called “circle of confusion (CoC)”. The “CoC” approximately brings different sizes of disk blurs on out-of-focus areas. Therefore, following the task-specific knowledge, we propose a novel self-supervised multi-scale pyramid fusion network for bokeh rendering. In the proposed network, blurred images after disk convolution kernels of different radius provide self-supervised informations for bokeh component learning. The final bokeh image is a weighted combination of the learned factorized bokeh components. Therefore, different from existing heavy prior-dependent algorithms, the proposed network does not have to preprocess time-wasting priors during training and testing in practice. At the same time, in contrary to some topic-irrelevant end-to-end networks, the proposed network employs task-specific knowledge as training guidance. With more clear training purpose, it can effectively boost network training’s efficiency and accuracy.

The proposed network has been evaluated and compared with several state-of-the-art methods on a public large-scale bokeh dataset- the “EBB!” Dataset [5]. The experiment performance demonstrates that the proposed network has much better processing efficiency and can achieve better realistic bokeh effect with much less parameters size and running time.

The contributions of this paper are summarized as followings:

  • An effective multi-scale pyramid fusion network is proposed for realistic bokeh effect rendering. The proposed network can achieve new state-of-the-art performance on a large-scale bokeh benchmark dataset with relatively small parameter size and realtime processing speed.

  • Task-specific knowledge which mimics the “circle of confusion” phenomenon through disk blur convolutions is utilized as self-supervised information for network training. Structure consistencies are employed to emphasize the importance of respective bokeh components. With more clear training purpose, it can effectively boost network training’s efficiency and accuracy.

In the following sections, related work will be summarized in Section 2. Details on the proposed network will be described in Section 3. Then experiment results and analysis are demonstrated in Section 4. Finally, conclusions will be given in Section 5.

Section snippets

Bokeh and depth-of-field

The bokeh effect is optically called “circle of confusion”. According to the principle of optical imaging [8], [9], as illustrated in Fig. 2, for a specific aperture and focal length, only object points at focus plane (also refers to focus object) can be ideally projected to corresponding points on image plane. Any other object points before or after the ideal plane are out of focus and form circles of confusion when they are projected onto image plane.

The “circle of confusion”, such as the C1

Methodology

In this section, we describe the proposed method in details. The architecture of the proposed network is illustrated in Fig. 3. Multi-scale information fusions on three pyramid levels are considered in this network.

Specifically, original image is first pyramidally downsampled into three variants of different resolutions. The downsampling factor is 2. Each variant is encoded by respective feature extraction module FEi,i={1,2,3}, where i is pyramid level index. The encoded features are denoted as

Experiment

In this section, we comprehensively describe the experiment settings and results. The proposed network is implemented in PyTorch, and trained on workstation with NVIDIA GeForce RTX 3090 GPU. Adam [41] is employed as optimizer, with initial learning rate set 0.0001. The batch size is set to be 2.

Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM) [42] and Learned Perceptual Image Patch Similarity metrics (LPIPS) [15] are employed as metrics for performance evaluation. The PSNR and

Conclusion

In this paper, an effective multi-scale pyramid fusion network has been proposed for realistic bokeh effect rendering. Structure consistencies are employed as importance weights for pyramid information fusion. Task-specific knowledge which mimics the “circle of confusion” phenomenon through disk blur convolutions is utilized as self-supervised information for network training. The proposed network has been experimented on a public large-scale bokeh dataset. Compared with state-of-the-art

CRediT authorship contribution statement

Zhifeng Wang: Design of this study, Analysis and interpretation of data, Implementation of this methodology and experiments, Preparation the manuscript. Aiwen Jiang: Conceptualization and design of this study, Analysis and interpretation of data, Provision of study materials and computing resources, Writing – original draft, Writing – review & editing. Chunjie Zhang: Conceptualization, Writing – review & editing. Hanxi Li: Conceptualization, Writing – review & editing. Bo Liu: Writing – review

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by National Natural Science Foundation of China under Grand No. 61966018, The open research fund of the State Key Laboratory for Management and Control of Complex Systems under Grant NO. 20220103, and Beijing Natural Science Foundation under Grand No. JQ20022.

Zhifeng Wang is current a post-graduate student in School of Computer and Information Engineering, Jiangxi Normal University. His research interest is on bokeh rendering.

References (42)

  • DuttaSaikat

    Depth-aware blending of smoothed images for bokeh effect generation

    J. Vis. Commun. Image Represent.

    (2021)
  • IgnatovAndrey et al.

    Aim 2019 challenge on bokeh effect synthesis: Methods and results

  • IgnatovAndrey et al.

    Aim 2020 challenge on rendering realistic bokeh

  • NealWadhwa et al.

    Synthetic depth-of-field with a single-camera mobile phone

    ACM Trans. Graph.

    (2018)
  • LiZhengqi et al.

    Megadepth: Learning single-view depth prediction from internet photos

  • IgnatovAndrey et al.

    Rendering natural camera bokeh effect with deep learning

  • QianMing et al.

    Bggan: Bokeh-glass generative adversarial network for rendering realistic bokeh

  • DuttaSaikat et al.

    Stacked deep multi-scale hierarchical network for fast bokeh effect rendering from a single image

  • Robert Kosara, Silvia Miksch, Semantic depth of field, in: Proceedings of the IEEE Symposium on Information...
  • BiglerE.

    Depth of field and scheimpflug’s rule : a minimalist geometrical approach

    (2002)
  • BusamBenjamin et al.

    Sterefo: Efficient image refocusing with stereo vision

  • LuoChenchi et al.

    Wavelet synthesis net for disparity estimation to synthesize dslr calibre bokeh effect on smartphones

  • DongweiLiu et al.

    Stereo-based bokeh effects for photography

    Mach. Vis. Appl.

    (2016)
  • JeongYuna et al.

    Real-time dynamic bokeh rendering with efficient look-up table sampling

    IEEE Trans. Vis. Comput. Graphics

    (2020)
  • XuXiangyu et al.

    Rendering portraitures from monocular camera and beyond

  • ZhangRichard et al.

    The unreasonable effectiveness of deep features as a perceptual metric

  • PurohitKuldeep et al.

    Depth-guided dense dynamic filtering network for bokeh effect rendering

  • YingqianWang et al.

    Selective light field refocusing for camera arrays using bokeh rendering and superresolution

    IEEE Signal Process. Lett.

    (2019)
  • AngelicaTiemi Mizuno Nakamura et al.

    An effective combination of loss gradients for multi-task learning applied on instance segmentation and depth estimation

    Eng. Appl. Artif. Intell.

    (2021)
  • GodardClement et al.

    Digging into self-supervised monocular depth estimation

  • YifanZuo et al.

    Residual dense network for intensity-guided depth map enhancement

    Inform. Sci.

    (2019)
  • Zhifeng Wang is current a post-graduate student in School of Computer and Information Engineering, Jiangxi Normal University. His research interest is on bokeh rendering.

    Aiwen Jiang is full professor and associate director in School of Computer and Information Engineering, Jiangxi Normal University. He received his Ph.D. degree from Institute of Automation, Chinese Academy of Sciences, China, in 2010. He received his bachelor degree from Nanjing University of Post and Telecommunication, China, in 2005. His research interests are computer vision, machine learning.

    Chunjie Zhang is full professor in School of Computer and Information Technology, Beijing Jiaotong University. He received his Ph.D. degree from Institute of Automation, Chinese Academy of Sciences, China, in 2011. He serves as associate editors for many famous journals such as Information Sciences, Neurocomputing etc. His research interests focus on computer vision, machine learning, and multimedia information analysis.

    Hanxi Li is associate professor in School of Computer and Information Engineering, Jiangxi Normal University. He received his Ph.D. degree from Australia National University in 2011. He received his bachelor degree from Beihang University, China, in 2004. His research interests are computer vision, virtual reality.

    Bo Liu is a tenured associate professor in the Department of Computer Science at Auburn University. He obtained his Ph.D. from Autonomous Learning Lab at the University of Massachusetts Amherst, 2016. His research areas cover decisionmaking under uncertainty, human-aided machine learning, symbolic AI, trustworthiness and interpretability in machine learning, and their applications to BIGDATA, autonomous driving, and healthcare informatics.

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text