Elsevier

Neurocomputing

Volume 111, 2 July 2013, Pages 70-80
Neurocomputing

Fast saliency-aware multi-modality image fusion

https://doi.org/10.1016/j.neucom.2012.12.015Get rights and content

Abstract

This paper proposes a saliency-aware fusion algorithm for integrating infrared (IR) and visible light (ViS) images (or videos) with the aim to enhance the visualization of the latter. Our algorithm involves saliency detection followed by a biased fusion. The goal of the saliency detection is to generate a saliency map for the IR image, highlighting the co-occurrence of high brightness values (“hot spots”) and motion. Markov Random Fields (MRFs) are used to combine these two sources of information. The subsequent fusion step is employed to bias the end result in favor of the ViS image, except when a region shows clear IR saliency, in which case the IR image gains (local) dominance. By doing so, the fused image succeeds in depicting both the salient foreground object (gleaned from the IR image), against as an easily recognizable background as supplied by the ViS image. An evaluation of the proposed saliency detection method indicates improvements in detection accuracy when compared to state-of-the-art alternatives. Moreover, both objective and subjective assessments reveal the effectiveness of the proposed fusion algorithm in terms of visual context enhancement.1

Introduction

Recent advances in imaging, networking, data processing and storage technology have resulted in tremendous expansion of the use of multi-modality image/video in a variety of fields. A typical application is surveillance imaging where people usually combine the advantages of different imaging sensors in order to enhance the capability of vision systems. At the core of such an application is multi-modality image fusion, which enables to combine multiple images captured by different modalities into a single representation. This single fused image provides comprehensive information about the scene such that the operator does not need to check each image separately. This is nicely illustrated in the case of fire monitoring based on the combination of IR and ViS images, where the system is expected to locate the fire at an early stage. While a ViS image may allow the operator to readily spot a billowing smoke plume, the actual location of the fire is more easily deduced by inspecting the corresponding hot spot in the IR image. If we combine two images properly, one can see both bright spot and smoke in the fused image, enabling the operator to quickly and precisely locate the fire.

Image fusion has been studied extensively [1], [2]. Depending on the intended application, different fusion methods have been developed, but two basic research lines have gained prominence: pixel-based and region-based fusion. Pixel-based image fusion combines the images at the pixel level while region-based image fusion considers pixels constituting the same object as an entity. From the perceptual point of view, region-based fusion is often superior since meaningful objects always attract more attention than incoherent individual pixels. However, the usage of the region-based fusion is not straightforward due to several problems. Firstly, an assumption underlying region-based fusion is that segmented images from multiple modalities are similar in terms of region location and size. Unfortunately, this does not hold when two modalities are significantly different from each other, as is the case for long wavelength IR and ViS image. Secondly, image segmentation is computationally expensive so that the fusion based on it is not suited for applications, such as surveillance imaging, where real-time processing is paramount importance. Thirdly, the region-based fusion treats each segmented region on the same footing, irrespective of the region's saliency. For many applications, only a select few regions bear significance.

In this paper, an IR and ViS image/video fusion algorithm is proposed to enhance the visualization of a surveillance imaging system. The core idea is to take the saliency of the region into account during the fusion procedure. Our work differs from the existing work in three aspects.

  • We select the ViS image as a sort of reference or iconic image [19], and the fusion is biased in favor of the ViS image. The reason is that the ViS image often provides a familiar impression of the scene, thus reducing the cognitive load for the supervisor who has to recognize or locate the target object.

  • Instead of partitioning the image into seamless regions, we only extract regions that are salient in terms of the intended application. In this paper, we perform saliency detection on the IR image, as the thermography from the IR image can “see the object” without illumination.

  • We consider the saliency detection as a classification problem in which a Markov Random Field (MRF) is called upon to harness the co-occurrence of hot spots and motion. This model helps to generate a better saliency map, as consistency of neighboring pixels is adequately taken into consideration [8].

In this context, we designate a region as salient if it consistently tends to attract attention from viewers. Obviously, it is difficult to design a generic saliency extraction algorithm that can be applied to a multitude of applications. However, we believe that it is possible to define saliency in operational terms, once the context of an application provided. Since in our application (wildfire surveillance and monitoring) regions of interest — either fire or humans — tend to be hotter and moving when compared to the background, we define saliency in terms of high IR brightness and motion.

In the sequel, we first overview the literature in Section 2. In Section 3, we present our framework. We further introduce the idea of MRF-based saliency detection on IR image in Section 4. In Section 5, we describe our MR-based (wavelet) biased fusion algorithm. The experimental results are provided in Section 6. Finally, conclusions are drawn in Section 7.

Section snippets

Prior work on image fusion

Several related survey papers [1], [2], [3] for image fusion have appeared over the years, providing a broad overview of over one hundred papers. In keeping with most of the literature, we divide existing techniques into two categories: pixel-based fusion and region-based fusion.

Pixel-based image fusion algorithm [4], [5] is to combine images at the pixel level. The fusion schemes range from simple spatial pixel-value fusion to more complex transform fusion. The simplest form of the spatial

The system overview

Fig. 1 depicts the proposed system architecture with its main functional units and data flows. The functions of the key modules are as follows.

  • Hot spot detection in IR. The top 5% pixels in terms of IR brightness are regarded as hot spot pixels. Based on this, a probabilistic map is generated where the probability of a hot spot pixel is proportional to its intensity value.

  • Motion detection in IR. We exploit background subtraction to extract the motion pixels, assuming the camera is fixed. Here,

Markov random fields (MRFs)-based saliency detection

Saliency detection can be interpreted as a pixel labeling problem, where each pixel in the image is labeled as either salient or non-salient. The pixel labeling is a typical classification problem, which can be treated by a Markovian-based Maximum A-Posterior (MAP) approximation. Given the observed features f and the configuration of labels l, the posterior probability of l is:P(l|f)=P(f|l)P(l)P(f).Maximizing P(l|f) amounts to maximizing the product of the class conditional probability P(f|l)

Biased MR image fusion

We base our MR image fusion on the wavelet transform due to the resemblance between its filtering properties and the human vision process. Basically, our fusion consists of four steps: (1) Generate the saliency map from the IR image; (2) Carry out a wavelet transform for both the IR and ViS images; (3) Coefficient fusion at each scale using different fusion rules; (4) Perform the inverse wavelet transform and construct the fused image. Here, we will focus on explaining the coefficient fusion

Experimental results

Our proposed system is implemented in C++ on a Laptop PC platform (Dual core 2.53 GHz, 4 GB RAM) with a 64-bits operation system. We have tested our algorithm in two surveillance-related scenarios. In the first scenario, it is required to monitor an area within which a fire may occur.2 For this situation, one may separately observe the hot spot (flame) and the smoke on the IR image and ViS image. If we can fuse two images properly, one is able to

Conclusion

In this paper, we propose a fast saliency-aware image fusion algorithm, which is inspired by the region-based image fusion concept. The major difference between traditional algorithms and our algorithm lies in the fact that our algorithm takes the saliency of the object/region into account. The saliency of the region drives the fusion procedure. In order to generate a consistent saliency map from IR image, we feed the co-occurrence of hot spots and motion into a MRF model. Both objective and

Acknowledgments

The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7-ENV-2009-1) under Grant agreement no FP7-ENV-244088 “FIRESENSE—Fire Detection and Management through a Multi-Sensor Network for the Protection of Cultural Heritage Areas from the Risk of Fire and Extreme Weather”.

Jungong Han received his Ph.D. degree in Telecommunication and Information System from XiDian University, China, in 2004. During his Ph.D. study, he spent one year at Internet Media group of Microsoft Research Asia, China. From 2005 to 2010, he was with Signal Processing Systems group at the Technical University of Eindhoven, The Netherlands. In December of 2010, he joined the Centre for Mathematics and Computer Science (CWI) in Amsterdam, as a research staff member. In July of 2012, he started

References (33)

  • Q. Miao, B. Wang, A novel image fusion method using contourlet transform, in: Proceedings of the IEEE Conference on...
  • P. de Zeeuw, The multigrid image transform, in: Proceedings of the Image Processing Based on Partial Differential...
  • P. Burt et al.

    The Laplacian pyramid as a compact image code

    IEEE Trans. Commun.

    (1983)
  • E. Simoncelli, W. Freeman, The steerable pyramid: a flexible architecture for multi-scale derivative computation, in:...
  • P. Burt, Enhanced image capture through fusion, in: Proceedings of the Conference on Computer Vision, 1993, pp....
  • Z. Zhang, R. Blum, A region-based image fusion scheme for concealed weapon detection, in: Proceedings Conference on...
  • Cited by (0)

    Jungong Han received his Ph.D. degree in Telecommunication and Information System from XiDian University, China, in 2004. During his Ph.D. study, he spent one year at Internet Media group of Microsoft Research Asia, China. From 2005 to 2010, he was with Signal Processing Systems group at the Technical University of Eindhoven, The Netherlands. In December of 2010, he joined the Centre for Mathematics and Computer Science (CWI) in Amsterdam, as a research staff member. In July of 2012, he started a senior scientist position with Civolution technology in Eindhoven (a combining synergy of Philips Video Content Identification and Thomson STS).

    His research interests include multimedia security, multi-sensor data fusion, video content analysis, and computer vision. He has written and co-authored over 60 papers including 3 invited papers in these areas. One of his algorithm implementations has been commercialized and used by a start-up company. He is an associate editor of Elsevier Neurocomputing and Journal of Convergence Section C: Web and Multimedia. He has been (lead) guest editor for several journals, such as IEEE-T-SMC:B and Pattern Recognition Letters. He is a member of IEEE IDSP Standing Committee, and a voting member of IEEE Multimedia Communications Technical Committee.

    Eric J. Pauwels joined the computer vision research group at ESAT (Leuven University, Belgium) after completing his Ph.D. in Mathematics, and worked on various mathematical problems in computer vision, including differential, semi-differential and algebraic invariants and their application to object recognition. In 1999, he joined the Signals and Images research group at the Centre for Mathematics and Computer Science (CWI) in Amsterdam where he focuses on two topics: content based image retrieval, and multimodal camera and sensor networks for situational awareness in smart environments. He has contributed to numerous national and European projects and was the scientific coordinator of the FP6 Network of Excellence on Multimedia Understanding through Semantics, Computations and Learning (MUSCLE). He founded and acted as the first chairman for the ERCIM Working Group on Image and Video Understanding. He also organized and chaired the first international workshop on Distributed Sensing and Collective Intelligence in Biodiversity Monitoring.

    Paul de Zeeuw is a numerical mathematician, affiliated at the CWI, Amsterdam (NL), since 1979. He studied mathematics and computer science at the University of Leiden and obtained his Ph.D. thesis from the University of Amsterdam. He authored and co-authored many papers on multigrid algorithms for the solution of partial differential equations. One paper in particular is much cited and the accompanying computer code is widely used. He has also been participating in image processing projects, as a spin-off thereof two Matlab toolboxes have been built and made available on the web. Further, he has been author at the Dutch Open University on the topic of numerical linear algebra, and was the secretary of the Dutch-Flemish Numerical Analysis Society from 1997 till 2002, including being editor of its newsletter. He has acted as a reviewer of project proposals. Present focal points are applications of multi-resolution methods in image processing, including image fusion and content-based image retrieval.

    1

    This work was done while Jungong Han was at CWI.

    View full text