Multisensor video fusion based on higher order singular value decomposition
Introduction
Video surveillance plays an important role in modern society and has been widely applied in many fields [1]. Traditionally, the color video sensor has been the only modality employed. It can work well under ideal conditions but is inadequate in certain cases, such as scenes with poor lighting and smoke or dust [2]. These issues can often be addressed by simultaneously employing multiple modality aligned video sensors to capture the contents of the same scene [2], [3], [4]. Thus, how to sufficiently and efficiently utilize the information captured from these different sensors is of considerable interest. In this paper, we discuss how this can be realized using video fusion, by which multiple aligned videos from different sensors are merged into a composite. The fused video contains more useful information than any of the individual input videos and can be used to better interpret the scene [2], [3], [5].
Fig. 1 illustrates an example of video fusion for scene surveillance. As shown in the first row, the moving persons are quite visually evident in the images taken with an infrared video camera. However, the scene environment (e.g., the building and the road) is virtually invisible. In the second row of images, taken by a conventional video camera, we can hardly observe that there are also two men (or moving targets) in the scene. By fusing the two input videos using the method proposed in this paper, we are able to obtain a new processed video, in which the moving targets from the infrared camera and the background images (or the environment of the scene) from the visible light camera are well integrated. This is achieved without object detection. As indicated in the last row of Fig. 1, the fused video shows that there are two men walking across the scene, one walking in front of the building and the other towards it.
Numerous fusion methods, especially at the signal-level, have been proposed in the literature [6], [7], [8] to achieve a result similar to that of Fig. 1. However, most of them are only applicable to static images even though current surveillance systems are based on video. Thus dynamic images and video fusion would seem to be more desirable [9].
Since the existing video fusion algorithms are founded on individual frames, they are in fact independently fused frame by frame [2], [9], [10]. Thus the spatial information in videos has dominated the literature on video fusion, perhaps more aptly referred to as image fusion. Recently, some algorithms [11], [12], [13] have been proposed based on the non-separate tri-dimensional Multi-Scale Transform (3D-MST). Examples are the 3D Surfacelet Transform (3D-ST) [14], the 3D Uniform Curvelet Transform [15] and the 3D Shearlet Transform [16]. On the other hand, the temporal information in the input videos has usually been ignored during the fusion process [11], [13]. As opposed to the approaches using individual frames, such fusion algorithms would simultaneously integrate multiple aligned video frames. Generally, we would expect that these methods would exhibit superior performance for extracting spatio-temporal information.
An important issue in this regard is how to actually merge the different subband coefficients of the input videos, i.e., the fusion rule. As with image fusion algorithms based on the 2D-MST, this is also central to video fusion methods based on the 3D-MST. We note that most of image fusion rules currently in the literature could also be extended to fuse videos from the standpoint of the spatio-temporal domain. An example would be the spatio-temporal energy matching fusion rule in [11]. However, a video obviously contains moving objects as well as stationary ones. And the temporal features generally arouse more attention than the spatial ones [17]. Nevertheless, these fusion rules treat spatial and temporal information similarly by using the same fusion strategies for both.
We have proposed an alternative approach in [12], where the fusion rule was based on a spatio-temporal structure tensor [18]. In this work, eigenvalue decomposition was first performed on the spatio-temporal structure tensor matrices. The resulting subbands of the input videos were then filtered into three types of regions (i.e., regions containing mainly (1) background spatial information, (2) moving objects or (3) non-salient spatio-temporal information). This was followed by a different fusion strategy specifically designed for each type of region. Such an approach produces better performance than some static image fusion rules but the improvement is at the cost of a greatly increased computational complexity.
In this paper, we suggest a novel video fusion algorithm based on the 3D-ST1 and higher order singular value decomposition (HOSVD) [19], [20], [21]. As shown in Fig. 2, the proposed method contains three parts. First it employs the 3D-ST as the MST to decompose input videos into different subbands. Then corresponding subbands from each input video are fused and, finally, reconstructed by performing the inverse 3D-ST. This approach is different from [12] in that the identification of the spatial or temporal information is achieved globally rather than pixel by pixel, which greatly reduces the computational complexity.
In fact, as one of the more efficient tensor decomposition techniques, the HOSVD has been widely employed in many fields, such as image denoising [22], face recognition [23], and texture analysis [24]. Also, two image fusion methods based on the HOSVD were proposed in [25], [26]. In the former, the authors constructed an image tensor using input image frames and employed the HOSVD to obtain a set of basis images. Then the fused image was determined by optimizing the projective coefficients of these basis images. In the latter, the authors exploited the HOSVD to define a local spatial saliency measure plus weights for each input image, and then obtained a fused image as the weighted average of input images.
Different from that in [25], [26], the main contributions of our proposed fusion method are as follows:
- (1)
We employ the HOSVD (in the ST domain) to fuse (aligned) videos and not static images. Thus we capture the temporal redundancy (correlation) among different frames in a video so that the static background scene and temporal moving targets can easily be separated from each other.
- (2)
The identification of the different kinds of information, including noise, is achieved globally. Because of this, the method is considerably faster than our earlier reported structure-tensor method, which requires identifying information locally.
- (3)
By using the HOSVD, any noise can be easily distinguished and suppressed in the spatio-temporal video data using a simple shrinkage function during fusion.
The rest of the paper is organized as follows. Section 2 presents the basic theory of the HOSVD and its application to video analysis. Section 3 describes the proposed video fusion algorithm in detail. Experimental results and some important conclusions are given in Sections 4 Experiments and analysis, 5 Conclusions, respectively.
Section snippets
HOSVD and its application to video analysis
Tensors (multi-way arrays) are generalizations of scalars, vectors and matrices to an arbitrary number of indices [27], [28]. In multi-linear algebra, tensors are represented as multiple-dimensional arrays or N-mode matrices. The order of a tensor is the number of indices in the array. Thus an Nth-order tensor has N indices and is denoted as: . In the following, we will discuss how to apply the tensor and HOSVD to analyze a video.
Video fusion based on HOSVD and the Surfacelet Transform
Having described in Section 2 the capabilities of HOSVD to separate spatial from temporal features, here we focus on using it for video fusion as detailed earlier in Fig. 2. In this section, we show that corresponding types of information from two input videos can be merged, using appropriate fusion strategies, to produce a single enhanced video. In addition, different from [12], the identification of different types of information in the videos is achieved globally rather than pixel-by-pixel.
Experiments and analysis
Three sets of experiments were performed for noise-free videos using ST-HOSVD1 and noisy videos using ST-HOSVD2:
- (1)
Three pairs of noise-free input videos were fused using different existing fusion methods in the literature and compared with the performance of ST-HOSVD1.
- (2)
The impact of the parameter k2 in Eq. (17) on ST-HOSVD2 was examined.
- (3)
Several pairs of noisy input videos were fused using ST-HOSVD2.
Six pairs of infrared and visible videos were employed in the experiments. For convenience, they are
Conclusions
A novel video fusion algorithm (ST-HOSVD1) combining the 3D-ST and the HOSVD is proposed in this paper. By performing the HOSVD on the video tensor constructed using the 3D-ST subband coefficients, the spatial and temporal details contained in the input videos are isolated from each other and then merged by different fusion rules. We show that the proposed ST-HOSVD1 method outperforms traditional approaches based on the spatio-temporal energy “maximum selection” and “matching” (i.e., ST-Maximum
Acknowledgements
This work is supported by the National Natural Science Foundation of China under Grant No. 61104212, by the Fundamental Research Funds for the Central Universities under Grant Nos. K5051304001 and NSIY211416 and by China Scholarship Council under Grant No. 201306965005. Martin D. Levine acknowledges the support of the Natural, Scientific and Engineering Research Council of Canada (NSERC) under Grant Number 228-266, RGPIN 1733-13.
Video Pair 1, Video Pair 2 and Video Pair 3 are kindly provided by
References (39)
- et al.
Multi-spectral fusion for surveillance systems
Comput. Electric. Eng.
(2010) - et al.
Pixel- and region-based image fusion with complex wavelets
Inform. Fus.
(2007) - et al.
Performance comparison of different multi-resolution transforms for image fusion
Inform. Fus.
(2011) - et al.
Multimodality image fusion by using both phase and magnitude information
Pattern Recogn. Lett.
(2013) - et al.
A novel video fusion framework using surfacelet transform
Opt. Commun.
(2012) - et al.
A novel face recognition method based on sub-pattern and tensor
Neurocomputing
(2011) - et al.
Morphological dilation image coding with context weights prediction
Signal Process.: Image Commun.
(2010) - et al.
Similarity-based multimodality image fusion with shiftable complex directional pyramid
Pattern Recogn. Lett.
(2011) - et al.
Video fusion performance evaluation based on structural similarity and human visual perception
Signal Process.
(2012) - et al.
Non-Gaussian model-based fusion of noisy images in the wavelet domain
Comput. Vis. Image Underst.
(2010)
A survey on visual surveillance of object motion and behaviors
IEEE Trans. Syst. Man Cybern. – Part C: Appl. Rev.
Fusing multiple video sensors for surveillance
ACM Trans. Multimedia Comput. Commun. Appl.
Change detection in synthetic aperture radar images based on fuzzy active contour models and genetic algorithms
Math. Prob. Eng.
Quality-based fusion of multiple video sensor for video surveillance
IEEE Trans. Syst. Man Cybern. – Part B: Cybern.
Multispectral bilateral video fusion
IEEE Trans. Image Process.
Multisensor video fusion based on spatial–temporal salience detection
Signal Process.
Image sequence fusion and denoising based on 3D shearlet transform
J. Appl. Math.
Multidimensional directional filter banks and surfacelets
IEEE Trans. Image Process.
Cited by (31)
A perceptual framework for infrared–visible image fusion based on multiscale structure decomposition and biological vision
2023, Information FusionCitation Excerpt :Although these filters can obtain more accurate decomposition in spatial domain, they can not ensure good information separation in scale-space [24], which makes the corresponding decomposition still suboptimal for image fusion. Besides the fusion algorithms performed explicitly in a multiscale framework, there are many other transform-based methods that use various mathematical or signal transformation tools, such as principal component analysis (PCA) [37], singular value decomposition (SVD) [38] and sparse representation, which can either be used directly for transformation or decomposition of source images, or utilized for the intermediate feature extraction to formulate the fusion strategies [39,40]. Cvejic et al. [9] presented a fusion algorithm in the independent component analysis (ICA) domain, in which the ICA coefficients for infrared and visible images were fused using a region-based strategy.
Tensor analysis with n-mode generalized difference subspace
2021, Expert Systems with ApplicationsDecreased dynamism of overlapping brain sub-networks in Major Depressive Disorder
2021, Journal of Psychiatric ResearchMARESye: A hybrid imaging system for underwater robotic applications
2020, Information FusionCitation Excerpt :Advances in remote visual sensor technology have been followed closely by advances in the artificial illumination. More recently, the image fusion has received significant attention and became a relevant research field with many systems [20] and algorithms [8,12,23,24,45] presented in the last years. The limited visual perception capability of AUVs restricts the use of such vehicles to medium/long range missions however, researchers are developing new techniques to enhance the visibility of underwater optical systems [22,25].
Multivariate graph learning for detecting aberrant connectivity of dynamic brain networks in autism
2019, Medical Image Analysis