Multisensor video fusion based on higher order singular value decomposition

doi:10.1016/j.inffus.2014.09.008

Information Fusion

Volume 24, July 2015, Pages 54-71

https://doi.org/10.1016/j.inffus.2014.09.008 Get rights and content

Highlights

•
Two aligned videos, one conventional, the other infrared, are used to view a scene.
•
Spatial and temporal features are isolated and fused independently using HOSVD.
•
Identification of different types of features is achieved globally.
•
Video noise is identified and suppressed by using HOSVD.
•
The method has superior fusion capabilities but low computational complexity.

Abstract

With the ongoing development of sensor technologies, more and more kinds of video sensors are being employed in video surveillance systems to improve robustness and monitoring performance. In addition, there is often a strong motivation to simultaneously observe the same scene by more than one kind of sensor. How to sufficiently and effectively utilize the information captured by these different sensors is thus of considerable interest. This can be realized using video fusion, by which multiple aligned videos from different sensors are merged into a composite.

In this paper, a video fusion algorithm is presented based on the 3D Surfacelet Transform (3D-ST) and the higher order singular value decomposition (HOSVD). In the proposed method, input videos are first decomposed into many subbands using the 3D-ST. Then the relevant subbands from all of the input videos are merged to obtain the corresponding subbands of the intended fused video. Finally, the fused video is constructed by performing the inverse 3D-ST on the merged subband coefficients. Typically, the spatial information in the scene backgrounds and the temporal information related to moving objects are mixed together in each subband. In the proposed fusion method, the spatial and temporal information are actually first separated from each other and then merged using the HOSVD. This is different from the currently published fusion rules (e.g., spatio-temporal energy “maximum” or “matching”), which are usually just simple extensions of static image fusion rules. In these, the spatial and temporal information contained in the input videos are generally treated equally and merged by the same fusion strategy. In addition, we note that the so-called “scene noise” in an input video has been largely ignored by the current literature. We show that this noise can be distinguished from the spatio-temporal objects of interest in the scene and then suppressed using the HOSVD. Clearly, this would be very advantageous for a surveillance system, particularly one dealing with scenes of crowds.

Experimental results demonstrate that the proposed fusion method exhibits a lower computational complexity than some existing published video fusion methods, such as the ones based on the structure tensor and the pulse-coupled neural network (PCNN). When the videos are noisy, a modified version of the proposed method is shown to perform better than specialized methods based on the Bivariate-Laplacian model and the PCNN.

Introduction

Video surveillance plays an important role in modern society and has been widely applied in many fields [1]. Traditionally, the color video sensor has been the only modality employed. It can work well under ideal conditions but is inadequate in certain cases, such as scenes with poor lighting and smoke or dust [2]. These issues can often be addressed by simultaneously employing multiple modality aligned video sensors to capture the contents of the same scene [2], [3], [4]. Thus, how to sufficiently and efficiently utilize the information captured from these different sensors is of considerable interest. In this paper, we discuss how this can be realized using video fusion, by which multiple aligned videos from different sensors are merged into a composite. The fused video contains more useful information than any of the individual input videos and can be used to better interpret the scene [2], [3], [5].

Fig. 1 illustrates an example of video fusion for scene surveillance. As shown in the first row, the moving persons are quite visually evident in the images taken with an infrared video camera. However, the scene environment (e.g., the building and the road) is virtually invisible. In the second row of images, taken by a conventional video camera, we can hardly observe that there are also two men (or moving targets) in the scene. By fusing the two input videos using the method proposed in this paper, we are able to obtain a new processed video, in which the moving targets from the infrared camera and the background images (or the environment of the scene) from the visible light camera are well integrated. This is achieved without object detection. As indicated in the last row of Fig. 1, the fused video shows that there are two men walking across the scene, one walking in front of the building and the other towards it.

Numerous fusion methods, especially at the signal-level, have been proposed in the literature [6], [7], [8] to achieve a result similar to that of Fig. 1. However, most of them are only applicable to static images even though current surveillance systems are based on video. Thus dynamic images and video fusion would seem to be more desirable [9].

Since the existing video fusion algorithms are founded on individual frames, they are in fact independently fused frame by frame [2], [9], [10]. Thus the spatial information in videos has dominated the literature on video fusion, perhaps more aptly referred to as image fusion. Recently, some algorithms [11], [12], [13] have been proposed based on the non-separate tri-dimensional Multi-Scale Transform (3D-MST). Examples are the 3D Surfacelet Transform (3D-ST) [14], the 3D Uniform Curvelet Transform [15] and the 3D Shearlet Transform [16]. On the other hand, the temporal information in the input videos has usually been ignored during the fusion process [11], [13]. As opposed to the approaches using individual frames, such fusion algorithms would simultaneously integrate multiple aligned video frames. Generally, we would expect that these methods would exhibit superior performance for extracting spatio-temporal information.

An important issue in this regard is how to actually merge the different subband coefficients of the input videos, i.e., the fusion rule. As with image fusion algorithms based on the 2D-MST, this is also central to video fusion methods based on the 3D-MST. We note that most of image fusion rules currently in the literature could also be extended to fuse videos from the standpoint of the spatio-temporal domain. An example would be the spatio-temporal energy matching fusion rule in [11]. However, a video obviously contains moving objects as well as stationary ones. And the temporal features generally arouse more attention than the spatial ones [17]. Nevertheless, these fusion rules treat spatial and temporal information similarly by using the same fusion strategies for both.

We have proposed an alternative approach in [12], where the fusion rule was based on a spatio-temporal structure tensor [18]. In this work, eigenvalue decomposition was first performed on the spatio-temporal structure tensor matrices. The resulting subbands of the input videos were then filtered into three types of regions (i.e., regions containing mainly (1) background spatial information, (2) moving objects or (3) non-salient spatio-temporal information). This was followed by a different fusion strategy specifically designed for each type of region. Such an approach produces better performance than some static image fusion rules but the improvement is at the cost of a greatly increased computational complexity.

In this paper, we suggest a novel video fusion algorithm based on the 3D-ST¹ and higher order singular value decomposition (HOSVD) [19], [20], [21]. As shown in Fig. 2, the proposed method contains three parts. First it employs the 3D-ST as the MST to decompose input videos into different subbands. Then corresponding subbands from each input video are fused and, finally, reconstructed by performing the inverse 3D-ST. This approach is different from [12] in that the identification of the spatial or temporal information is achieved globally rather than pixel by pixel, which greatly reduces the computational complexity.

In fact, as one of the more efficient tensor decomposition techniques, the HOSVD has been widely employed in many fields, such as image denoising [22], face recognition [23], and texture analysis [24]. Also, two image fusion methods based on the HOSVD were proposed in [25], [26]. In the former, the authors constructed an image tensor using input image frames and employed the HOSVD to obtain a set of basis images. Then the fused image was determined by optimizing the projective coefficients of these basis images. In the latter, the authors exploited the HOSVD to define a local spatial saliency measure plus weights for each input image, and then obtained a fused image as the weighted average of input images.

Different from that in [25], [26], the main contributions of our proposed fusion method are as follows:

(1)
We employ the HOSVD (in the ST domain) to fuse (aligned) videos and not static images. Thus we capture the temporal redundancy (correlation) among different frames in a video so that the static background scene and temporal moving targets can easily be separated from each other.
(2)
The identification of the different kinds of information, including noise, is achieved globally. Because of this, the method is considerably faster than our earlier reported structure-tensor method, which requires identifying information locally.
(3)
By using the HOSVD, any noise can be easily distinguished and suppressed in the spatio-temporal video data using a simple shrinkage function during fusion.

The rest of the paper is organized as follows. Section 2 presents the basic theory of the HOSVD and its application to video analysis. Section 3 describes the proposed video fusion algorithm in detail. Experimental results and some important conclusions are given in Sections 4 Experiments and analysis, 5 Conclusions, respectively.

Section snippets

HOSVD and its application to video analysis

Tensors (multi-way arrays) are generalizations of scalars, vectors and matrices to an arbitrary number of indices [27], [28]. In multi-linear algebra, tensors are represented as multiple-dimensional arrays or N-mode matrices. The order of a tensor is the number of indices in the array. Thus an Nth-order tensor has N indices and is denoted as: $χ \in R^{I_{1} \times I_{2} \times \dots \times I_{N}}$ . In the following, we will discuss how to apply the tensor and HOSVD to analyze a video.

Video fusion based on HOSVD and the Surfacelet Transform

Having described in Section 2 the capabilities of HOSVD to separate spatial from temporal features, here we focus on using it for video fusion as detailed earlier in Fig. 2. In this section, we show that corresponding types of information from two input videos can be merged, using appropriate fusion strategies, to produce a single enhanced video. In addition, different from [12], the identification of different types of information in the videos is achieved globally rather than pixel-by-pixel.

Experiments and analysis

Three sets of experiments were performed for noise-free videos using ST-HOSVD1 and noisy videos using ST-HOSVD2:

(1)
Three pairs of noise-free input videos were fused using different existing fusion methods in the literature and compared with the performance of ST-HOSVD1.
(2)
The impact of the parameter k₂ in Eq. (17) on ST-HOSVD2 was examined.
(3)
Several pairs of noisy input videos were fused using ST-HOSVD2.

Six pairs of infrared and visible videos were employed in the experiments. For convenience, they are

Conclusions

A novel video fusion algorithm (ST-HOSVD1) combining the 3D-ST and the HOSVD is proposed in this paper. By performing the HOSVD on the video tensor constructed using the 3D-ST subband coefficients, the spatial and temporal details contained in the input videos are isolated from each other and then merged by different fusion rules. We show that the proposed ST-HOSVD1 method outperforms traditional approaches based on the spatio-temporal energy “maximum selection” and “matching” (i.e., ST-Maximum

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant No. 61104212, by the Fundamental Research Funds for the Central Universities under Grant Nos. K5051304001 and NSIY211416 and by China Scholarship Council under Grant No. 201306965005. Martin D. Levine acknowledges the support of the Natural, Scientific and Engineering Research Council of Canada (NSERC) under Grant Number 228-266, RGPIN 1733-13.

Video Pair 1, Video Pair 2 and Video Pair 3 are kindly provided by

References (39)

S. Denman et al.
Multi-spectral fusion for surveillance systems
Comput. Electric. Eng.
(2010)
J.J. Lewis et al.
Pixel- and region-based image fusion with complex wavelets
Inform. Fus.
(2007)
S.T. Li et al.
Performance comparison of different multi-resolution transforms for image fusion
Inform. Fus.
(2011)
Q. Zhang et al.
Multimodality image fusion by using both phase and magnitude information
Pattern Recogn. Lett.
(2013)
Q. Zhang et al.
A novel video fusion framework using surfacelet transform
Opt. Commun.
(2012)
S.J. Wang et al.
A novel face recognition method based on sub-pattern and tensor
Neurocomputing
(2011)
J.J. Wu et al.
Morphological dilation image coding with context weights prediction
Signal Process.: Image Commun.
(2010)
Q. Zhang et al.
Similarity-based multimodality image fusion with shiftable complex directional pyramid
Pattern Recogn. Lett.
(2011)
Q. Zhang et al.
Video fusion performance evaluation based on structural similarity and human visual perception
Signal Process.
(2012)
A. Loza et al.
Non-Gaussian model-based fusion of noisy images in the wavelet domain
Comput. Vis. Image Underst.
(2010)

W.M. Hu et al.

A survey on visual surveillance of object motion and behaviors

IEEE Trans. Syst. Man Cybern. – Part C: Appl. Rev.

(2004)

L. Snidaro et al.

Fusing multiple video sensors for surveillance

ACM Trans. Multimedia Comput. Commun. Appl.

(2012)

J. Shi et al.

Change detection in synthetic aperture radar images based on fuzzy active contour models and genetic algorithms

Math. Prob. Eng.

(2014)

L. Snidaro et al.

Quality-based fusion of multiple video sensor for video surveillance

IEEE Trans. Syst. Man Cybern. – Part B: Cybern.

(2007)

O. Rockinger, Image sequence fusion using a shift-invariant wavelet transform, in: Proceedings of the International...

E.P. Bennett et al.

Multispectral bilateral video fusion

IEEE Trans. Image Process.

(2007)

Q. Zhang et al.

Multisensor video fusion based on spatial–temporal salience detection

Signal Process.

(2013)

L. Xu et al.

Image sequence fusion and denoising based on 3D shearlet transform

J. Appl. Math.

(2014)

Y.M. Lu et al.

Multidimensional directional filter banks and surfacelets

IEEE Trans. Image Process.

(2007)

Cited by (31)

Semi-supervised multi-sensor information fusion tailored graph embedded low-rank tensor learning machine under extremely low labeled rate
2024, Information Fusion
This paper investigates a demanding and meaningful task of intelligent fault diagnosis, in which multi-sensors signals are fused for semi-supervised analysis with few labeled fault data. Exploring effective strategies to solve this task from an industrial or academic perspective remains a challenging and resource-consuming task. To this issue, this study develops a new low-rank tensor-based semi-supervised classifier, called graph embedded low-rank tensor learning machine (GE-LRTLM), which can effectively alleviate the foregoing difficulties and increase the diagnosis precision in engineering applications. First, multi-sensor and multi-channel vibration signals are converted into the pixel matrices that are stacked as the multi-sensor information fusion feature tensors, making the coupling relationship between multi-sensor signals remained and achieving a reasonable fusion of multi-source features. Additionally, an advanced tensor decomposition method, tensor nuclear norm (TNN), is introduced in GE-LRTLM model to obtain the low-rank structure information of each feature tensor, implementing the extraction of the most important feature and patterns from the tensor data while ensuring that the tensor data structure remains unchanged. Ultimately, the manifold regularization and tensor-based graph construction method are introduced to obtain potential label information from unlabeled samples, achieving better description of the geometry similarity and distribution of tensor data. Numerous semi-supervised experiments are conducted across multiple datasets, and the results demonstrate that the proposed method can achieve a classification accuracy of 97% even when the number of labeled samples is extremely limited. Simultaneously, it also verifies that the combination of the constructed labeled and unlabeled multi-sensor information fusion tensor samples can promote the improvement of model accuracy.
A perceptual framework for infrared–visible image fusion based on multiscale structure decomposition and biological vision
2023, Information Fusion
Citation Excerpt :
Although these filters can obtain more accurate decomposition in spatial domain, they can not ensure good information separation in scale-space [24], which makes the corresponding decomposition still suboptimal for image fusion. Besides the fusion algorithms performed explicitly in a multiscale framework, there are many other transform-based methods that use various mathematical or signal transformation tools, such as principal component analysis (PCA) [37], singular value decomposition (SVD) [38] and sparse representation, which can either be used directly for transformation or decomposition of source images, or utilized for the intermediate feature extraction to formulate the fusion strategies [39,40]. Cvejic et al. [9] presented a fusion algorithm in the independent component analysis (ICA) domain, in which the ICA coefficients for infrared and visible images were fused using a region-based strategy.
Infrared–visible image fusion is of great value in many applications due to their highly complementary information. However, it is hard to obtain high-quality fused image for current fusion algorithms. In this paper, we reveal an underlying deficiency in current fusion framework limiting the quality of fusion, i.e., the visual features used in the fusion can be easily affected by external physical conditions (e.g., the characteristics of different sensors and environmental illumination), indicating that those features from different sources have not been ensured to be fused on a consistent basis during the fusion. Inspired by biological vision, we derive a framework that transforms the image intensities into the visual response space of human visual system (HVS), within which all features are fused in the same perceptual state, eliminating the external physical factors that may influence the fusion process. The proposed framework incorporates some key characteristics of HVS that facilitate the simulation of human visual response in complex scenes, and is built on a new variant of multiscale decomposition, which can accurately localize image structures of different scales in visual-response simulation and feature fusion. A bidirectional saliency aggregation is proposed to fuse the perceived contrast features within the visual response space of HVS, along with an adaptive suppression of noise and intensity-saturation in this space prior to the fusion. The final fused image is obtained by transforming the fusion results in human visual response space back to the physical domain. Experiments demonstrate the significant improvement of fusion quality brought about by the proposed method.
Tensor analysis with n-mode generalized difference subspace
2021, Expert Systems with Applications
The increasing use of multiple sensors, which produce a large amount of multi-dimensional data, requires efficient representation and classification methods. In this paper, we present a new method for multi-dimensional data classification that relies on two premises: (1) multi-dimensional data are usually represented by tensors, since this brings benefits from multilinear algebra and established tensor factorization methods; and (2) multilinear data can be described by a subspace of a vector space. The subspace representation has been employed for pattern-set recognition, and its tensor representation counterpart is also available in the literature. However, traditional methods do not use discriminative information of the tensors, degrading the classification accuracy. In this case, generalized difference subspace (GDS) provides an enhanced subspace representation by reducing data redundancy and revealing discriminative structures. Since GDS does not handle tensor data, we propose a new projection called n-mode GDS, which efficiently handles tensor data. We also introduce the n-mode Fisher score as a class separability index and an improved metric based on the geodesic distance for tensor data similarity. The experimental results on gesture and action recognition show that the proposed method outperforms methods commonly used in the literature without relying on pre-trained models or transfer learning.
Decreased dynamism of overlapping brain sub-networks in Major Depressive Disorder
2021, Journal of Psychiatric Research
Major Depressive Disorder (MDD) is increasingly recognized as a common brain disorder with aberrant brain networks. Alterations in dynamic functional brain networks have been widely reported in MDD. However, previous studies mainly focused on detecting non-overlapping sub-networks/communities, neglecting the possibility that one brain region may belong to multiple sub-networks/communities. In the present work, we utilized tensor decomposition method to detect overlapping communities and study the dynamism of overlapping sub-networks through 58 patients with MDD and 63 age- and sex-matched healthy controls (HC). The strength vectors of communities were calculated and two-sample t-test was performed to investigate the statistical significance of the differences in dynamism of MDD and HC groups. We found that communities detected in two groups were pairwise region-matching but overlapped brain regions were almost totally different. We considered two region-matching communities in the two groups as a sub-network. Compared to HCs, MDD patients showed significantly decreased dynamism in five sub-networks which could be functionally mapped to Visual Network (VN), Default Mode Network (DMN), Cognitive Control Network (CCN), Bilateral Limbic Network (BLN) and Auditory Network (AN). The results showed that MDD might only have a marginal effect on the holistic detection of communities and the changes of overlapped brain regions in MDD patients might be put down to the alteration of hubs. Further statistical analysis on nine sub-networks showed decreased dynamism of five sub-networks in MDD patients, which might help us achieve a better understanding of mechanism in MDD.
MARESye: A hybrid imaging system for underwater robotic applications
2020, Information Fusion
Citation Excerpt :
Advances in remote visual sensor technology have been followed closely by advances in the artificial illumination. More recently, the image fusion has received significant attention and became a relevant research field with many systems [20] and algorithms [8,12,23,24,45] presented in the last years. The limited visual perception capability of AUVs restricts the use of such vehicles to medium/long range missions however, researchers are developing new techniques to enhance the visibility of underwater optical systems [22,25].
This article presents an innovative hybrid imaging system that provides dense and accurate 3D information from harsh underwater environments. The proposed system is called MARESye and captures the advantages of both active and passive imaging methods: multiple light stripe range (LSR) and a photometric stereo (PS) technique, respectively. This hybrid approach fuses information from these techniques through a data-driven formulation to extend the measurement range and to produce high density 3D estimations in dynamic underwater environments. This hybrid system is driven by a gating timing approach to reduce the impact of several photometric issues related to the underwater environments such as, diffuse reflection, water turbidity and non-uniform illumination. Moreover, MARESye synchronizes and matches the acquisition of images with sub-sea phenomena which leads to clear pictures (with a high signal-to-noise ratio). Results conducted in realistic environments showed that MARESye is able to provide reliable, high density and accurate 3D data. Moreover, the experiments demonstrated that the performance of MARESye is less affected by sub-sea conditions since the SSIM index was 0.655 in high turbidity waters. Conventional imaging techniques obtained 0.328 in similar testing conditions. Therefore, the proposed system represents a valuable contribution for the inspection of maritime structures as well as for the navigation procedures of autonomous underwater vehicles during close range operations.
Multivariate graph learning for detecting aberrant connectivity of dynamic brain networks in autism
2019, Medical Image Analysis
Alterations in static functional brain networks have previously been reported in Autistic Spectrum Disorder (ASD). Although functional brain networks are known to be time-varying, alterations in time-varying or dynamic brain networks in ASD is largely unknown. Hence, in this study, we analyze resting-state fMRI data of ASD group versus Typically Developing Control (TDC) group to understand alterations in dynamic functional brain networks in ASD vis-à-vis healthy controls. We introduce a new framework for extracting overlapping dynamic functional brain networks to study these alterations. We utilize sliding window approach along with the recent Multivariate Vector Regression-based Connectivity (MVRC) method to construct functional connectivity (FC) matrices in each time-window. Further, we build three-mode subject-summarized spatio-temporal tensor in both ASD and TDC groups. This tensor is utilized to determine a set of overlapping dynamic functional brain networks and their temporal profiles. This helps us in studying alterations in dynamic brain networks in ASD subjects at the group-level. The proposed framework is tested on two publicly available resting-state fMRI dataset of ASD and normal controls. Our analyses on resting-state fMRI data indicate that dynamic functional brain networks of ASD subjects are altered compared to the TDC group. Two-sample t-test is used to establish the statistical significance of the differences observed in network strengths of the two groups. Compared to the TDC subjects, autistic subjects showed alterations in multiple functional brain networks including cognitive control, subcortical, auditory, visual, bilateral limbic, and default mode network. The proposed methodology is able to provide information on alterations in dynamic functional brain networks in ASD and may provide potential biomarkers for studying human brain disorders.

View all citing articles on Scopus

View full text

Multisensor video fusion based on higher order singular value decomposition

Highlights

Abstract

Introduction

Section snippets

HOSVD and its application to video analysis

Video fusion based on HOSVD and the Surfacelet Transform

Experiments and analysis

Conclusions

Acknowledgements

Comput. Electric. Eng.

Inform. Fus.

Inform. Fus.

Pattern Recogn. Lett.

Opt. Commun.

Neurocomputing

Signal Process.: Image Commun.

Pattern Recogn. Lett.

Signal Process.

Comput. Vis. Image Underst.

A survey on visual surveillance of object motion and behaviors

IEEE Trans. Syst. Man Cybern. – Part C: Appl. Rev.

Fusing multiple video sensors for surveillance

ACM Trans. Multimedia Comput. Commun. Appl.

Change detection in synthetic aperture radar images based on fuzzy active contour models and genetic algorithms

Math. Prob. Eng.

Quality-based fusion of multiple video sensor for video surveillance

IEEE Trans. Syst. Man Cybern. – Part B: Cybern.

Multispectral bilateral video fusion

IEEE Trans. Image Process.

Multisensor video fusion based on spatial–temporal salience detection

Signal Process.

Image sequence fusion and denoising based on 3D shearlet transform

J. Appl. Math.

Multidimensional directional filter banks and surfacelets

IEEE Trans. Image Process.