A generative image fusion approach based on supervised deep convolution network driven by weighted gradient flow☆
Introduction
The aim of image fusion is to obtain a high-quality visual scene with salient spatial details considering each available imaging sources. To our knowledge, single sensor with limited imaging capability is impracticable to obtain comprehensive spatial information [1], [2]. Thus, an effective solution is to develop a hybrid system known as multi-source image fusion, which extends visual perceptions and the capability of subsequent processing. For recent years, the image fusion have been widely implemented in medical image processing, remote sensing, robot vision, surveillance analysis [3], etc. These fusion schemes mainly focus on relieving two aspects, that is, first the internal imaging models are hard to be parameterized to guide the fusion procedure, and secondly the aligned images captured in different space-time configurations is easily suffered by unstable external models. Despite a great deal of fusion approaches that have been proposed by improving these incompatibilities, the challenge remains in how to promote these fusion schemes under a generally adaptable framework to achieve desired performance among different fusion scenarios.
Over the years, studies of image fusion are mostly based on a methodology known as synthesis-by-analysis. In this framework, the dominance procedure is to decompose the original images to obtain the salient spatial features. Therefore, based on the decomposing transformation, a great deal of region-based fusion approaches have been proposed, for instance the image matting (IM) [4], the guided filtering (GF) [5] and the dense scale-invariant feature transform (DSIFT) [6]. These approaches analyze the original inputs with fixed receptive fields, which is incongruous to multiple scales of interesting objects. To overcome this, the multi-scale analysis (MSA) is then extremely adopted by implementing pyramids [[7], [8], [9]], wavelets [[10], [11], [12]], curvelets [13], non-subsampled contourlets [14], etc. According to these approaches, the image reconstructions are implemented with determined spatial transforms, nevertheless, how to determine reasonable transform parameters have not been enclosed.
Instead of the deterministic methodology, more present studies focus on learning mechanism that represents image features by a set of trainable linear kernels. The sparse representation (SR) based approaches [15] utilize the trainable sparse dictionary to achieve salient analysis and then reconstruct outputs by the fused sparse coefficients [16], [17]. Though the mainstream of dictionary training can be enhanced by the algorithms such as MOD [18], KSVD [19] and others, it remains two unfolded problems. One is the regularized dictionary might not be harmonious to the original images, thus, the spatial inconsistency is often involved [15], and the other is the compatible size of the dictionary is hard to be determined [20]. In Ref. [21], an expanded fusion approach is proposed based on a low-rank dictionary to make the sparse representations more robust. The expanded schemes might relieve the first problem, but their dictionaries are still designed empirically, and moreover, the sparse transformations are insufficient to represent nonlinear spatial details.
At the success of deep learning, convolution neural networks (CNNs) are reported with excellent performances for various vision tasks. The primary CNN in Ref. [22] did not receive much attention as computational limitations, until the deep architecture is proved to be practical convergence [23], after that, various CNNs have been proposed such as AlexNet [24], VGG [25], ResNet [26] and others. The default configurations of the famous CNNs are often pretrained with large data sets, thus these configurations can be conveniently transferred to current fusion tasks to relieve the limitation of the training resources. The deep architectures of CNNs have been introduced to fuse multi-source images in recent times. By formulating a discriminative mask based on the layered feature maps of CNNs, the deep fusion schemes are reported to exceed the shallow models [[27], [28], [29]]. In Ref. [30], a convolutional fusion approach is proposed by combining the outputs of each convolutional layers to obtain a final decision mask to guide the fusion procedure. Liu et al. propose a fusion approach for multi-focus images by using a CNN-based masking algorithm [27]. In their architecture, a single CNN is trained off-line to align the focus and unfocused regions by the layered binary feature maps. After that, Tang et al. propose another CNN-masked approach [31] as the expanded version. They adopt a more detailed mask scheme, which yields three types of masked maps together to gain the final decision map. According to these models, the block effects can be partially relaxed by the combinations of the layered features, however, the final decision map cannot be integrated in a unified fusion framework. It implies that after the mask training is finished, the post-processing is separately implemented, therefore some undesirable artifacts formerly involved are hard to be eliminated. Zhao et al. propose a supervised deep convolutional network which can extract the high and low frequency coefficients from the inputs to determine the final results [32]. The fusion architecture in Ref. [33] is similar to Ref. [34], and the pooling layers are removed from the network, so that the loss of spatial details will be reduced. In this case, nevertheless, the number of convolution layers has to be reduced as spatial scale limits.
At present, the CNN-based image fusion mainly focus on single specified scene. To fuse multi-focus images, in Ref. [35] , the resolution is globally enhanced by sharing the feature maps of CNNs, also in other fusion tasks the CNN architectures are utilized to align medical images [28], infrared and visible images [29]. In these fusion scenes, the discriminative CNNs are introduced as independent analysis procedures. Compared with them, deep generative networks are globally optimized by the gradient flow back-propagating across the whole network to minimize the generation loss of tasks, such as super-resolution reconstructions [36], generations of semantic images [37] and others [38]. These generative models implicitly integrate the analysis and the synthesis in single forward architecture. Furthermore, style transfer networks (STNs) proposed in Refs. [39], [40] can generate synthetic scenarios by explicitly regularizing each of the layered outputs. Inspired by these developments, in this study, we focus on formulating a unified generative network to explicitly generate high-quality fusion images.
In this paper, we propose a deep image fusion network (DFN) that makes the fusion procedure more concise and efficient to deal with various fusion scenes. Our main contributions can be specified as following three aspects.
- •
We formulate a dual-deep CNN-based fusion framework which integrates the layered feature maps to directly guide the output generations. Compared with other implicitly generative models, the salient representations among different sources can be explicitly fused by the analysis procedure.
- •
Furthermore, a novel differential fusion strategy based on weighted gradient flow is proposed to drive network optimization and finally formulate the fused outputs. This fusion strategy inherently regularizes the layered outputs in terms of a designed differential function, so the proposed strategy can be embedded into both the network training and testing. To our knowledge, there is no other fusion approach that unifies supervised deep architectures to generate fusion scenarios.
- •
According to the proposed framework, the pretrained networks such as VGG16, VGG19 and ResNet50 can be conveniently transferred to current tasks by simple fine-tuning. In addition, we test and analyze our model in various fusion tasks to verify if the feature maps of CNNs are adaptable and stable to generate high-quality images.
The rest of this paper is structured as follows. Section 2 presents details of the proposed approach, including the fusion architecture, the fusion strategy and the training formulas; in Section 3 the experimental results and evaluations are presented according to various fusion tasks, and then discussions are made comparing with other state-of-the-art schemes. Finally, we conclude this paper in the last section.
Section snippets
Layered features in convolution networks
The deep architectures of CNNs have been widely used to obtain salient spatial features from coast to fine. These layered feature maps, corresponding to convolution kernels, are not only applied to discriminative tasks, but also to the tasks based on generative models, which recently has received huge attentions [[38], [39], [40]]. By the generative models, a coarse or noise image as the initial input is iteratively optimized until the model generates the desired images. According to the deep
Experimental settings
To evaluate fusion performance in different scenarios, several image sets are tested in the experiments including medical images, multi-focus images, infrared and visible images. Both subjective human perceptions and objective measurements are illuminated and discussed. In Ref. [41], the authors categorize objective measurements into four groups: information theory, image feature, image structural similarity and human perception. Similarly to it, we select one measurement from the each group to
Conclusions
The purpose of this study is to generate high-quality images from various sources utilizing the layered feature maps of CNNs. Thus, a deep image fusion network (DFN) is proposed by the trainable analysis and synthesis modules with the base architectures of fine-tuned VGG16, VGG19 or ResNet50 and a novel differential weighted strategy. In the experiments, the superiority of the proposed network are verified compared with other state-of-the-art methods. The fusion results briefly support four
Conflict of interest
There are no conflicts of interest.
Acknowledgments
This research was funded by National Natural Science Foundation of China (61563025,61562053),Yunnan Department of Science and Technology Project (2016FB109) and Scientific Research Foundation of Yunnan Provincial Department of Education (2017ZZX149).
References (50)
- et al.
Image matting for fusion of multi-focus images in dynamic scenes
Information Fusion
(2013) - et al.
Multi-focus image fusion with dense SIFT
Information Fusion
(2015) - et al.
A wavelet-based image fusion tutorial
Pattern Recognition
(2004) - et al.
Pixel- and region-based image fusion with complex wavelets
Information Fusion
(2007) - et al.
Multifocus image fusion by combining curvelet and wavelet transform
Pattern Recognition Letters
(2008) - et al.
Infrared and visible image fusion scheme based on NSCT and low-level visual features
Infrared Physics and Technology
(2016) - et al.
A general framework for image fusion based on multi-scale transform and sparse representation
Information Fusion
(2015) - et al.
Multi-focus image fusion using dictionary-based sparse representation
Information Fusion
(2015) - et al.
Joint medical image fusion, denoising and enhancement via discriminative low-rank sparse dictionaries learning
Pattern Recognition
(2018) - et al.
Multi-focus image fusion with a deep convolutional neural network
Information Fusion
(2017)
A medical image fusion method based on convolutional neural networks
A human perception inspired quality metric for image fusion based on regional information
Information Fusion
A novel sparse-representation-based multi-focus image fusion approach
Neurocomputing
Multifocus image fusion using the nonsubsampled contourlet transform
Signal Processing
Performance improvement scheme of multifocus image fusion derived by difference images
Signal Processing
Image quality measures and their performance
IEEE Transactions on Communications
Image Fusion: Algorithms and Applications
Physiologically motivated image fusion for object detection using a pulse coupled neural network
IEEE Transactions on Neural Networks
Image fusion with guided filtering
IEEE Transactions on Image Processing
A multi-focus image fusion method based on Laplacian pyramid
Journal of Computers
Image fusion algorithm based on gradient pyramid and its performance evaluation
Development of Industrial Manufacturing
Image fusion using morphological pyramid consistency method
International Journal of Computer Applications
Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform
Signal, Image and Video Processing
Group-sparse representation with dictionary learning for medical image denoising and fusion
IEEE Transactions on Biomedical Engineering
A novel geometric dictionary construction approach for sparse representation based image fusion
Entropy
Cited by (23)
PTET: A progressive token exchanging transformer for infrared and visible image fusion
2024, Image and Vision ComputingA comparative review on multi-modal sensors fusion based on deep learning
2023, Signal ProcessingJoint coupled dictionaries-based visible-infrared image fusion method via texture preservation structure in sparse domain
2023, Computer Vision and Image UnderstandingDeep learning methods for medical image fusion: A review
2023, Computers in Biology and MedicineA multi-autoencoder fusion network guided by perceptual distillation
2022, Information SciencesCitation Excerpt :Due to the space limit, some of fusion results are selected for subjective analysis. In this comparison, the proposed method is compared with SR [49],Curvelet [26], NSCT-SR [26], CNN-3 [23], DeepFusion [41], IFCNN-MAX [48] and U2Fusion [43] on 20 image pairs selected randomly from Harvard medical dataset. Some intuitive results are presented and analyzed in the next, followed by the objective assessment.
Joint image fusion and super-resolution for enhanced visualization via semi-coupled discriminative dictionary learning and advantage embedding
2021, NeurocomputingCitation Excerpt :Such technology has attracted an increasing attention of researchers and made significant research progress in recent years. The existing image fusion methods can be roughly classified into three categories, namely, multiscale transform (MST) based methods [1–5], dictionary learning based methods [6–10], and deep learning based methods [11–15]. In MST based methods, the common used MSTs include wavelet transform [2,16,17], dual tree complex wavelet transform (DTCWT) [1,18], Shearlet Transform [19,20], curvelet transforms [21], contourlet transform [22], and nonsubsampled contourlet transform (NSCT) [23].
- ☆
This paper has been recommended for acceptance by Sinisa Todorovic.