A generative image fusion approach based on supervised deep convolution network driven by weighted gradient flow

https://doi.org/10.1016/j.imavis.2019.02.011Get rights and content

Abstract

In recent times, convolution neural networks (CNNs) have been utilized to generate desired images benefiting from the layered features. However, few studies have focused on integrating these features gained from multiple sources to obtain a high-quality image. In this paper, we propose a generative fusion approach using a supervised CNN framework with analysis and synthesis modules. According to it, the salient feature maps obtained from the analysis module are integrated to yield output generation by iteratively back-propagating gradients. Furthermore, a differential fusion strategy based on weighted gradient flow is embedded into the end-to-end fusion procedure. To transfer previous network configurations to current fusion tasks, the proposed network is fine-tuned according to the pretrained network such as VGG16, VGG19 and ResNet50. The experimental results indicate superior evaluations of the proposed approach compared with other state-of-the-art schemes in various fusion scenes, and also verify that the CNN features are adaptable and expressive to be aligned to generate fused images.

Introduction

The aim of image fusion is to obtain a high-quality visual scene with salient spatial details considering each available imaging sources. To our knowledge, single sensor with limited imaging capability is impracticable to obtain comprehensive spatial information [1], [2]. Thus, an effective solution is to develop a hybrid system known as multi-source image fusion, which extends visual perceptions and the capability of subsequent processing. For recent years, the image fusion have been widely implemented in medical image processing, remote sensing, robot vision, surveillance analysis [3], etc. These fusion schemes mainly focus on relieving two aspects, that is, first the internal imaging models are hard to be parameterized to guide the fusion procedure, and secondly the aligned images captured in different space-time configurations is easily suffered by unstable external models. Despite a great deal of fusion approaches that have been proposed by improving these incompatibilities, the challenge remains in how to promote these fusion schemes under a generally adaptable framework to achieve desired performance among different fusion scenarios.

Over the years, studies of image fusion are mostly based on a methodology known as synthesis-by-analysis. In this framework, the dominance procedure is to decompose the original images to obtain the salient spatial features. Therefore, based on the decomposing transformation, a great deal of region-based fusion approaches have been proposed, for instance the image matting (IM) [4], the guided filtering (GF) [5] and the dense scale-invariant feature transform (DSIFT) [6]. These approaches analyze the original inputs with fixed receptive fields, which is incongruous to multiple scales of interesting objects. To overcome this, the multi-scale analysis (MSA) is then extremely adopted by implementing pyramids [[7], [8], [9]], wavelets [[10], [11], [12]], curvelets [13], non-subsampled contourlets [14], etc. According to these approaches, the image reconstructions are implemented with determined spatial transforms, nevertheless, how to determine reasonable transform parameters have not been enclosed.

Instead of the deterministic methodology, more present studies focus on learning mechanism that represents image features by a set of trainable linear kernels. The sparse representation (SR) based approaches [15] utilize the trainable sparse dictionary to achieve salient analysis and then reconstruct outputs by the fused sparse coefficients [16], [17]. Though the mainstream of dictionary training can be enhanced by the algorithms such as MOD [18], KSVD [19] and others, it remains two unfolded problems. One is the regularized dictionary might not be harmonious to the original images, thus, the spatial inconsistency is often involved [15], and the other is the compatible size of the dictionary is hard to be determined [20]. In Ref. [21], an expanded fusion approach is proposed based on a low-rank dictionary to make the sparse representations more robust. The expanded schemes might relieve the first problem, but their dictionaries are still designed empirically, and moreover, the sparse transformations are insufficient to represent nonlinear spatial details.

At the success of deep learning, convolution neural networks (CNNs) are reported with excellent performances for various vision tasks. The primary CNN in Ref. [22] did not receive much attention as computational limitations, until the deep architecture is proved to be practical convergence [23], after that, various CNNs have been proposed such as AlexNet [24], VGG [25], ResNet [26] and others. The default configurations of the famous CNNs are often pretrained with large data sets, thus these configurations can be conveniently transferred to current fusion tasks to relieve the limitation of the training resources. The deep architectures of CNNs have been introduced to fuse multi-source images in recent times. By formulating a discriminative mask based on the layered feature maps of CNNs, the deep fusion schemes are reported to exceed the shallow models [[27], [28], [29]]. In Ref. [30], a convolutional fusion approach is proposed by combining the outputs of each convolutional layers to obtain a final decision mask to guide the fusion procedure. Liu et al. propose a fusion approach for multi-focus images by using a CNN-based masking algorithm [27]. In their architecture, a single CNN is trained off-line to align the focus and unfocused regions by the layered binary feature maps. After that, Tang et al. propose another CNN-masked approach [31] as the expanded version. They adopt a more detailed mask scheme, which yields three types of masked maps together to gain the final decision map. According to these models, the block effects can be partially relaxed by the combinations of the layered features, however, the final decision map cannot be integrated in a unified fusion framework. It implies that after the mask training is finished, the post-processing is separately implemented, therefore some undesirable artifacts formerly involved are hard to be eliminated. Zhao et al. propose a supervised deep convolutional network which can extract the high and low frequency coefficients from the inputs to determine the final results [32]. The fusion architecture in Ref. [33] is similar to Ref. [34], and the pooling layers are removed from the network, so that the loss of spatial details will be reduced. In this case, nevertheless, the number of convolution layers has to be reduced as spatial scale limits.

At present, the CNN-based image fusion mainly focus on single specified scene. To fuse multi-focus images, in Ref. [35] , the resolution is globally enhanced by sharing the feature maps of CNNs, also in other fusion tasks the CNN architectures are utilized to align medical images [28], infrared and visible images [29]. In these fusion scenes, the discriminative CNNs are introduced as independent analysis procedures. Compared with them, deep generative networks are globally optimized by the gradient flow back-propagating across the whole network to minimize the generation loss of tasks, such as super-resolution reconstructions [36], generations of semantic images [37] and others [38]. These generative models implicitly integrate the analysis and the synthesis in single forward architecture. Furthermore, style transfer networks (STNs) proposed in Refs. [39], [40] can generate synthetic scenarios by explicitly regularizing each of the layered outputs. Inspired by these developments, in this study, we focus on formulating a unified generative network to explicitly generate high-quality fusion images.

In this paper, we propose a deep image fusion network (DFN) that makes the fusion procedure more concise and efficient to deal with various fusion scenes. Our main contributions can be specified as following three aspects.

  • We formulate a dual-deep CNN-based fusion framework which integrates the layered feature maps to directly guide the output generations. Compared with other implicitly generative models, the salient representations among different sources can be explicitly fused by the analysis procedure.

  • Furthermore, a novel differential fusion strategy based on weighted gradient flow is proposed to drive network optimization and finally formulate the fused outputs. This fusion strategy inherently regularizes the layered outputs in terms of a designed differential function, so the proposed strategy can be embedded into both the network training and testing. To our knowledge, there is no other fusion approach that unifies supervised deep architectures to generate fusion scenarios.

  • According to the proposed framework, the pretrained networks such as VGG16, VGG19 and ResNet50 can be conveniently transferred to current tasks by simple fine-tuning. In addition, we test and analyze our model in various fusion tasks to verify if the feature maps of CNNs are adaptable and stable to generate high-quality images.

The rest of this paper is structured as follows. Section 2 presents details of the proposed approach, including the fusion architecture, the fusion strategy and the training formulas; in Section 3 the experimental results and evaluations are presented according to various fusion tasks, and then discussions are made comparing with other state-of-the-art schemes. Finally, we conclude this paper in the last section.

Section snippets

Layered features in convolution networks

The deep architectures of CNNs have been widely used to obtain salient spatial features from coast to fine. These layered feature maps, corresponding to convolution kernels, are not only applied to discriminative tasks, but also to the tasks based on generative models, which recently has received huge attentions [[38], [39], [40]]. By the generative models, a coarse or noise image as the initial input is iteratively optimized until the model generates the desired images. According to the deep

Experimental settings

To evaluate fusion performance in different scenarios, several image sets are tested in the experiments including medical images, multi-focus images, infrared and visible images. Both subjective human perceptions and objective measurements are illuminated and discussed. In Ref. [41], the authors categorize objective measurements into four groups: information theory, image feature, image structural similarity and human perception. Similarly to it, we select one measurement from the each group to

Conclusions

The purpose of this study is to generate high-quality images from various sources utilizing the layered feature maps of CNNs. Thus, a deep image fusion network (DFN) is proposed by the trainable analysis and synthesis modules with the base architectures of fine-tuned VGG16, VGG19 or ResNet50 and a novel differential weighted strategy. In the experiments, the superiority of the proposed network are verified compared with other state-of-the-art methods. The fusion results briefly support four

Conflict of interest

There are no conflicts of interest.

Acknowledgments

This research was funded by National Natural Science Foundation of China (61563025,61562053),Yunnan Department of Science and Technology Project (2016FB109) and Scientific Research Foundation of Yunnan Provincial Department of Education (2017ZZX149).

References (50)

  • Y. Liu et al.

    A medical image fusion method based on convolutional neural networks

  • H. Chen et al.

    A human perception inspired quality metric for image fusion based on regional information

    Information Fusion

    (2007)
  • H. Yin et al.

    A novel sparse-representation-based multi-focus image fusion approach

    Neurocomputing

    (2016)
  • Q. Zhang et al.

    Multifocus image fusion using the nonsubsampled contourlet transform

    Signal Processing

    (2009)
  • H. Li et al.

    Performance improvement scheme of multifocus image fusion derived by difference images

    Signal Processing

    (2016)
  • A.M. Eskicioglu et al.

    Image quality measures and their performance

    IEEE Transactions on Communications

    (1995)
  • T. Stathaki

    Image Fusion: Algorithms and Applications

    (2008)
  • R.P. Broussard et al.

    Physiologically motivated image fusion for object detection using a pulse coupled neural network

    IEEE Transactions on Neural Networks

    (1999)
  • S. Li et al.

    Image fusion with guided filtering

    IEEE Transactions on Image Processing

    (2013)
  • W. Wang et al.

    A multi-focus image fusion method based on Laplacian pyramid

    Journal of Computers

    (2011)
  • M.J. Li et al.

    Image fusion algorithm based on gradient pyramid and its performance evaluation

    Development of Industrial Manufacturing

    (2014)
  • N. Uniyal et al.

    Image fusion using morphological pyramid consistency method

    International Journal of Computer Applications

    (2014)
  • B.K.S. Kumar

    Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform

    Signal, Image and Video Processing

    (2013)
  • S. Li et al.

    Group-sparse representation with dictionary learning for medical image denoising and fusion

    IEEE Transactions on Biomedical Engineering

    (2012)
  • K. Wang et al.

    A novel geometric dictionary construction approach for sparse representation based image fusion

    Entropy

    (2017)
  • Cited by (23)

    • Deep learning methods for medical image fusion: A review

      2023, Computers in Biology and Medicine
    • A multi-autoencoder fusion network guided by perceptual distillation

      2022, Information Sciences
      Citation Excerpt :

      Due to the space limit, some of fusion results are selected for subjective analysis. In this comparison, the proposed method is compared with SR [49],Curvelet [26], NSCT-SR [26], CNN-3 [23], DeepFusion [41], IFCNN-MAX [48] and U2Fusion [43] on 20 image pairs selected randomly from Harvard medical dataset. Some intuitive results are presented and analyzed in the next, followed by the objective assessment.

    • Joint image fusion and super-resolution for enhanced visualization via semi-coupled discriminative dictionary learning and advantage embedding

      2021, Neurocomputing
      Citation Excerpt :

      Such technology has attracted an increasing attention of researchers and made significant research progress in recent years. The existing image fusion methods can be roughly classified into three categories, namely, multiscale transform (MST) based methods [1–5], dictionary learning based methods [6–10], and deep learning based methods [11–15]. In MST based methods, the common used MSTs include wavelet transform [2,16,17], dual tree complex wavelet transform (DTCWT) [1,18], Shearlet Transform [19,20], curvelet transforms [21], contourlet transform [22], and nonsubsampled contourlet transform (NSCT) [23].

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Sinisa Todorovic.

    View full text