Elsevier

Information Fusion

Volume 24, July 2015, Pages 54-71
Information Fusion

Multisensor video fusion based on higher order singular value decomposition

https://doi.org/10.1016/j.inffus.2014.09.008Get rights and content

Highlights

  • Two aligned videos, one conventional, the other infrared, are used to view a scene.

  • Spatial and temporal features are isolated and fused independently using HOSVD.

  • Identification of different types of features is achieved globally.

  • Video noise is identified and suppressed by using HOSVD.

  • The method has superior fusion capabilities but low computational complexity.

Abstract

With the ongoing development of sensor technologies, more and more kinds of video sensors are being employed in video surveillance systems to improve robustness and monitoring performance. In addition, there is often a strong motivation to simultaneously observe the same scene by more than one kind of sensor. How to sufficiently and effectively utilize the information captured by these different sensors is thus of considerable interest. This can be realized using video fusion, by which multiple aligned videos from different sensors are merged into a composite.

In this paper, a video fusion algorithm is presented based on the 3D Surfacelet Transform (3D-ST) and the higher order singular value decomposition (HOSVD). In the proposed method, input videos are first decomposed into many subbands using the 3D-ST. Then the relevant subbands from all of the input videos are merged to obtain the corresponding subbands of the intended fused video. Finally, the fused video is constructed by performing the inverse 3D-ST on the merged subband coefficients. Typically, the spatial information in the scene backgrounds and the temporal information related to moving objects are mixed together in each subband. In the proposed fusion method, the spatial and temporal information are actually first separated from each other and then merged using the HOSVD. This is different from the currently published fusion rules (e.g., spatio-temporal energy “maximum” or “matching”), which are usually just simple extensions of static image fusion rules. In these, the spatial and temporal information contained in the input videos are generally treated equally and merged by the same fusion strategy. In addition, we note that the so-called “scene noise” in an input video has been largely ignored by the current literature. We show that this noise can be distinguished from the spatio-temporal objects of interest in the scene and then suppressed using the HOSVD. Clearly, this would be very advantageous for a surveillance system, particularly one dealing with scenes of crowds.

Experimental results demonstrate that the proposed fusion method exhibits a lower computational complexity than some existing published video fusion methods, such as the ones based on the structure tensor and the pulse-coupled neural network (PCNN). When the videos are noisy, a modified version of the proposed method is shown to perform better than specialized methods based on the Bivariate-Laplacian model and the PCNN.

Introduction

Video surveillance plays an important role in modern society and has been widely applied in many fields [1]. Traditionally, the color video sensor has been the only modality employed. It can work well under ideal conditions but is inadequate in certain cases, such as scenes with poor lighting and smoke or dust [2]. These issues can often be addressed by simultaneously employing multiple modality aligned video sensors to capture the contents of the same scene [2], [3], [4]. Thus, how to sufficiently and efficiently utilize the information captured from these different sensors is of considerable interest. In this paper, we discuss how this can be realized using video fusion, by which multiple aligned videos from different sensors are merged into a composite. The fused video contains more useful information than any of the individual input videos and can be used to better interpret the scene [2], [3], [5].

Fig. 1 illustrates an example of video fusion for scene surveillance. As shown in the first row, the moving persons are quite visually evident in the images taken with an infrared video camera. However, the scene environment (e.g., the building and the road) is virtually invisible. In the second row of images, taken by a conventional video camera, we can hardly observe that there are also two men (or moving targets) in the scene. By fusing the two input videos using the method proposed in this paper, we are able to obtain a new processed video, in which the moving targets from the infrared camera and the background images (or the environment of the scene) from the visible light camera are well integrated. This is achieved without object detection. As indicated in the last row of Fig. 1, the fused video shows that there are two men walking across the scene, one walking in front of the building and the other towards it.

Numerous fusion methods, especially at the signal-level, have been proposed in the literature [6], [7], [8] to achieve a result similar to that of Fig. 1. However, most of them are only applicable to static images even though current surveillance systems are based on video. Thus dynamic images and video fusion would seem to be more desirable [9].

Since the existing video fusion algorithms are founded on individual frames, they are in fact independently fused frame by frame [2], [9], [10]. Thus the spatial information in videos has dominated the literature on video fusion, perhaps more aptly referred to as image fusion. Recently, some algorithms [11], [12], [13] have been proposed based on the non-separate tri-dimensional Multi-Scale Transform (3D-MST). Examples are the 3D Surfacelet Transform (3D-ST) [14], the 3D Uniform Curvelet Transform [15] and the 3D Shearlet Transform [16]. On the other hand, the temporal information in the input videos has usually been ignored during the fusion process [11], [13]. As opposed to the approaches using individual frames, such fusion algorithms would simultaneously integrate multiple aligned video frames. Generally, we would expect that these methods would exhibit superior performance for extracting spatio-temporal information.

An important issue in this regard is how to actually merge the different subband coefficients of the input videos, i.e., the fusion rule. As with image fusion algorithms based on the 2D-MST, this is also central to video fusion methods based on the 3D-MST. We note that most of image fusion rules currently in the literature could also be extended to fuse videos from the standpoint of the spatio-temporal domain. An example would be the spatio-temporal energy matching fusion rule in [11]. However, a video obviously contains moving objects as well as stationary ones. And the temporal features generally arouse more attention than the spatial ones [17]. Nevertheless, these fusion rules treat spatial and temporal information similarly by using the same fusion strategies for both.

We have proposed an alternative approach in [12], where the fusion rule was based on a spatio-temporal structure tensor [18]. In this work, eigenvalue decomposition was first performed on the spatio-temporal structure tensor matrices. The resulting subbands of the input videos were then filtered into three types of regions (i.e., regions containing mainly (1) background spatial information, (2) moving objects or (3) non-salient spatio-temporal information). This was followed by a different fusion strategy specifically designed for each type of region. Such an approach produces better performance than some static image fusion rules but the improvement is at the cost of a greatly increased computational complexity.

In this paper, we suggest a novel video fusion algorithm based on the 3D-ST1 and higher order singular value decomposition (HOSVD) [19], [20], [21]. As shown in Fig. 2, the proposed method contains three parts. First it employs the 3D-ST as the MST to decompose input videos into different subbands. Then corresponding subbands from each input video are fused and, finally, reconstructed by performing the inverse 3D-ST. This approach is different from [12] in that the identification of the spatial or temporal information is achieved globally rather than pixel by pixel, which greatly reduces the computational complexity.

In fact, as one of the more efficient tensor decomposition techniques, the HOSVD has been widely employed in many fields, such as image denoising [22], face recognition [23], and texture analysis [24]. Also, two image fusion methods based on the HOSVD were proposed in [25], [26]. In the former, the authors constructed an image tensor using input image frames and employed the HOSVD to obtain a set of basis images. Then the fused image was determined by optimizing the projective coefficients of these basis images. In the latter, the authors exploited the HOSVD to define a local spatial saliency measure plus weights for each input image, and then obtained a fused image as the weighted average of input images.

Different from that in [25], [26], the main contributions of our proposed fusion method are as follows:

  • (1)

    We employ the HOSVD (in the ST domain) to fuse (aligned) videos and not static images. Thus we capture the temporal redundancy (correlation) among different frames in a video so that the static background scene and temporal moving targets can easily be separated from each other.

  • (2)

    The identification of the different kinds of information, including noise, is achieved globally. Because of this, the method is considerably faster than our earlier reported structure-tensor method, which requires identifying information locally.

  • (3)

    By using the HOSVD, any noise can be easily distinguished and suppressed in the spatio-temporal video data using a simple shrinkage function during fusion.

The rest of the paper is organized as follows. Section 2 presents the basic theory of the HOSVD and its application to video analysis. Section 3 describes the proposed video fusion algorithm in detail. Experimental results and some important conclusions are given in Sections 4 Experiments and analysis, 5 Conclusions, respectively.

Section snippets

HOSVD and its application to video analysis

Tensors (multi-way arrays) are generalizations of scalars, vectors and matrices to an arbitrary number of indices [27], [28]. In multi-linear algebra, tensors are represented as multiple-dimensional arrays or N-mode matrices. The order of a tensor is the number of indices in the array. Thus an Nth-order tensor has N indices and is denoted as: χRI1×I2××IN. In the following, we will discuss how to apply the tensor and HOSVD to analyze a video.

Video fusion based on HOSVD and the Surfacelet Transform

Having described in Section 2 the capabilities of HOSVD to separate spatial from temporal features, here we focus on using it for video fusion as detailed earlier in Fig. 2. In this section, we show that corresponding types of information from two input videos can be merged, using appropriate fusion strategies, to produce a single enhanced video. In addition, different from [12], the identification of different types of information in the videos is achieved globally rather than pixel-by-pixel.

Experiments and analysis

Three sets of experiments were performed for noise-free videos using ST-HOSVD1 and noisy videos using ST-HOSVD2:

  • (1)

    Three pairs of noise-free input videos were fused using different existing fusion methods in the literature and compared with the performance of ST-HOSVD1.

  • (2)

    The impact of the parameter k2 in Eq. (17) on ST-HOSVD2 was examined.

  • (3)

    Several pairs of noisy input videos were fused using ST-HOSVD2.

Six pairs of infrared and visible videos were employed in the experiments. For convenience, they are

Conclusions

A novel video fusion algorithm (ST-HOSVD1) combining the 3D-ST and the HOSVD is proposed in this paper. By performing the HOSVD on the video tensor constructed using the 3D-ST subband coefficients, the spatial and temporal details contained in the input videos are isolated from each other and then merged by different fusion rules. We show that the proposed ST-HOSVD1 method outperforms traditional approaches based on the spatio-temporal energy “maximum selection” and “matching” (i.e., ST-Maximum

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant No. 61104212, by the Fundamental Research Funds for the Central Universities under Grant Nos. K5051304001 and NSIY211416 and by China Scholarship Council under Grant No. 201306965005. Martin D. Levine acknowledges the support of the Natural, Scientific and Engineering Research Council of Canada (NSERC) under Grant Number 228-266, RGPIN 1733-13.

Video Pair 1, Video Pair 2 and Video Pair 3 are kindly provided by

References (39)

  • W.M. Hu et al.

    A survey on visual surveillance of object motion and behaviors

    IEEE Trans. Syst. Man Cybern. – Part C: Appl. Rev.

    (2004)
  • L. Snidaro et al.

    Fusing multiple video sensors for surveillance

    ACM Trans. Multimedia Comput. Commun. Appl.

    (2012)
  • J. Shi et al.

    Change detection in synthetic aperture radar images based on fuzzy active contour models and genetic algorithms

    Math. Prob. Eng.

    (2014)
  • L. Snidaro et al.

    Quality-based fusion of multiple video sensor for video surveillance

    IEEE Trans. Syst. Man Cybern. – Part B: Cybern.

    (2007)
  • O. Rockinger, Image sequence fusion using a shift-invariant wavelet transform, in: Proceedings of the International...
  • E.P. Bennett et al.

    Multispectral bilateral video fusion

    IEEE Trans. Image Process.

    (2007)
  • Q. Zhang et al.

    Multisensor video fusion based on spatial–temporal salience detection

    Signal Process.

    (2013)
  • L. Xu et al.

    Image sequence fusion and denoising based on 3D shearlet transform

    J. Appl. Math.

    (2014)
  • Y.M. Lu et al.

    Multidimensional directional filter banks and surfacelets

    IEEE Trans. Image Process.

    (2007)
  • Cited by (31)

    • A perceptual framework for infrared–visible image fusion based on multiscale structure decomposition and biological vision

      2023, Information Fusion
      Citation Excerpt :

      Although these filters can obtain more accurate decomposition in spatial domain, they can not ensure good information separation in scale-space [24], which makes the corresponding decomposition still suboptimal for image fusion. Besides the fusion algorithms performed explicitly in a multiscale framework, there are many other transform-based methods that use various mathematical or signal transformation tools, such as principal component analysis (PCA) [37], singular value decomposition (SVD) [38] and sparse representation, which can either be used directly for transformation or decomposition of source images, or utilized for the intermediate feature extraction to formulate the fusion strategies [39,40]. Cvejic et al. [9] presented a fusion algorithm in the independent component analysis (ICA) domain, in which the ICA coefficients for infrared and visible images were fused using a region-based strategy.

    • MARESye: A hybrid imaging system for underwater robotic applications

      2020, Information Fusion
      Citation Excerpt :

      Advances in remote visual sensor technology have been followed closely by advances in the artificial illumination. More recently, the image fusion has received significant attention and became a relevant research field with many systems [20] and algorithms [8,12,23,24,45] presented in the last years. The limited visual perception capability of AUVs restricts the use of such vehicles to medium/long range missions however, researchers are developing new techniques to enhance the visibility of underwater optical systems [22,25].

    View all citing articles on Scopus
    View full text