Keywords

1 Introduction

Thermography, also known as infrared (IR) imaging or thermal imaging, is a fast growing field both in research and industry with a wide area of applications. At power stations it is used to monitor the high voltage systems. Construction workers use it to check for defective insulation in houses and firefighters use it as a tool when searching for missing people in buildings on fire. It is also used in various other contexts for surveillance.

In the field of image analysis, especially within computer vision, the majority of the research have focused on regular visual images. Many tasks there comprise the usage of an interest point or feature detector in combination with a feature descriptor. These are, for example, used in subsequent processing to achieve panorama stitching, content based indexing, tracking, reconstruction, recognition etc. Hence, local features play essential roles there, and their development and evaluations have, for many years, been an active research area, resulting in rich knowledge on useful detectors and descriptors.

A research question we pose in this paper is how we can exploit those local features in other types of images, in particular, IR spectral band images, which has been less investigated. The fact that IR images and visual images have different characteristics, where IR images typically contain less high frequency information, necessitates an independent study on the performances of common local detectors in combination with descriptors in IR images. In this context, the contributions of this paper are:

  1. (1)

    the systematic evaluation of local detectors and descriptors in their combinations under six different image transformations using established metrics, and

  2. (2)

    a new IR image database (http://www.csc.kth.se/~atsuto/dataset.html).

1.1 Related Work

For visual images several detectors and descriptors have been proposed and evaluated in the past. Mikolajczyk and Schmid [1] carried out a performance evaluation of local descriptors in 2005. The local descriptors were then tested on both circular and affine shaped regions with the result of GLOH [1] and SIFT [2] to have the highest performance. They created a database consisting of images of different scene types under different geometric and photometric transformations, which later became a benchmark for visual images.

A thorough evaluation of affine region detectors was also performed in [3]. The focus was to evaluate the performance of affine region detectors under different image condition changes: scale, view-point, blur, rotation, illumination and JPEG compression. Best performance in many cases was obtained by MSER [4] followed by Hessian-Affine [5, 6] and Harris-Affine [5, 6].

Focusing on fast feature matching, another evaluation [7] was performed more recently for both detectors and descriptors: the comparison of descriptors shows that novel real valued descriptors LIOP [8], MRRID [9] and MROGH [9] outperform state-of-the-art descriptors of SIFT and SURF [10] at the expense of decreased efficiency. Our work is partly inspired by yet another recent evaluation [11] involving exhaustive comparisons on binary features although those are all for visual images.

IR images have been studied in problem domains such as face recognition [12, 13], object detection/tracking [14, 15], and image enhancement of visual images using near-infrared images [16], to name a few. With respect to local features, a feature point descriptor was addressed for both far-infrared and visual images in [17] while a scale invariant interest point detector of blobs was tested against common detectors on IR images in [18]. This work however did not use the standard benchmark evaluation framework which kept itself from being embedded in comparisons to other results.

The most relevant work to our objective is that of Ricaurte et al. [19] which evaluated classic feature point descriptors in both IR and visible light images under image transformations: rotation, blur, noise and scale. It was reported that SIFT performed the best among several considered descriptors in most of their tests while there is not a clear winner. Nevertheless, unlike their studies on visual images, the evaluation was still limited in that it did not test different combinations of detectors and descriptors while also opting out view-point changes. Nor was it based on the standard evaluation framework [1, 3].

To the best of the authors’ knowledge, a thorough performance evaluation for combinations of detectors and descriptors was yet to be made on IR images.

Fig. 1.
figure 1

Examples of images, under various deformations, that are included in the data set. Each image pair consists of a reference image, left, and a test image, right. (a,b) Viewpoint, (c,d) rotation, (e,f) scale, (g,h) blur, (i,j) noise, (k,l) downsampling.

2 Evaluation Framework

The benchmark introduced in [1, 3] made well established evaluation frameworks for measuring the performance of detectors and descriptors. We thus choose to use those to ensure the reliability and comparability of the results in this work.

2.1 Matching

To obtain matching features we use nearest neighbours (NN). To become a NN match the two candidates have to be the closest descriptors in descriptor space for both descriptors. The distance between features are calculated with the Euclidean distance for floating point descriptors whereas the Hamming distance is applied to binary descriptors.

Further, a descriptor is only allowed to be matched once, also known as a putative match [11]. Out of the acquired matches, correct matches are identified by comparing the result to the ground truth correspondences. The correspondences are the correct matching interest points between a test image and a reference image. For insight in how the ground truth is created see [1].

2.2 Region Normalization

When evaluating descriptors a measurement region larger than the detected region is generally used. The motivation is that blob detectors such as Hessian-Affine and MSER extract regions with large signal variations in the borders. To increase the distinctiveness of extracted regions the measurement region is increased to include larger signal variations. This scale factor is applied to the extracted regions from all detectors. A drawback of the scaling would be the risk of reaching outside the image border.

In this work we implement an extension of the region normalization used in [3] (source available) to expand the image by assigning values to the unknown area by bilinear interpolation on account of the border values.

As detected regions are of circular or elliptical shape all regions are normalized to circular shape of constant radius to become scale and affine invariant.

2.3 Performance Measures

Recall. Recall is a measure of the ratio of correct matches and correspondences (defined in [1]). The measure therefore describes how many of the ground truth matches were actually found.

(1)

1–Precision. The 1–Precision measure portraits the ratio between the number of false matches and total number of matches (defined in [1]).

(2)

Matching Score. MS is defined as the ratio of correct matches and again the number of detected features visible in the two images.

(3)

2.4 Database

We have generated a new IR image data set for this study. The images contained in the database can be divided into the categories structured and textured scenes. A textured scene has repetitive patterns of different shapes while a structured scene has homogeneous regions with distinctive edges. In Fig. 1, examples of structured scenes are presented in the odd columns of image pairs and those of textured scenes in the even columns of image pairs. Out of the standard images, captured by the cameras, the database is created by synthetic modification to include the desired image condition changes. An exception is for view-point changes where all images (of mostly planar scenes) were captured with a FLIR T640 camera without modification. The database consists of 118 images in total.

Deformation Specification. The image condition changes we include in the evaluation are six-fold: view-point, scale, rotation, blur, noise and downsampling.

  • Images are taken from different view-points starting at \(90^{\circ }\) angle to the object. The maximum view-point angle is about 50–60\({^{\circ }}\) relative to this.

  • Zoom is imitated by scaling the height and width of the image using bilinear interpolation. The zoom of the image is in the range \(\times \)[1.25–2.5] zoom.

  • Rotated images are created from the standard images by \({10}{^{\circ }}\) increments.

  • Images are blurred using a Gaussian kernel of size \(51\times 51\) pixels and standard deviation up to 10 pixels.

  • White Gaussian noise is induced with increasing variance from 0.0001 to 0.005 if the image is normalized to range between 0 and 1.

  • Images are downsampled to three reduced sizes; by a factor of 2, 4 and 8.

2.5 Implementation Details

Local features are extracted using OpenCV [20] version 2.4.10 and VLFeat [21] version 0.9.20 libraries. OpenCV implementations are used for SIFT, SURF, MSER, FAST, ORB, BRISK, BRIEF [22] and FREAK [23] while Harris-Affine, Hessian-Affine and LIOP are VLFeat implementations. Unless explicitly stated the parameters are the ones suggested by the authors.

IR images are loaded into Matlab R2014b using FLIR’s Atlas SDK. When loaded in Matlab the IR images contain 16 bit data which are quantized into 8 bit data and preprocessed by histogram equalization.

To calculate the recall, MS and 1–Precision, this work utilizes code from [3].

Parameter Selection. VLFeat includes implementations of Harris-Laplace and Hessian-Laplace with the possibility of affine shape estimation. To invoke the detectors functions there are parameters to control a peak and edge threshold.

The peak threshold affects the minimum acceptable cornerness measure for a feature to be considered as a corner in Harris-Affine and equivalently a blob by the determinant of the Hessian matrix in Hessian-Affine. According to the authors of [5] the used value for the threshold on cornerness was 1000. As no similar value is found to the Hessian-Affine threshold it is selected to 150. With the selected threshold the number of extracted features is in the order of magnitude as other detectors in the evaluation. The edge threshold is an edge rejection threshold and eliminates points with too small curvature. It is selected to the predetermined value of 10.

Regarding the region normalization, we choose a diameter of 49 pixels whereas 41 pixels is chosen arbitrary in [1]. The choice of a larger diameter is based on the standard settings in the OpenCV library for the BRIEF descriptor.

3 Evaluation Results

This section presents the results of combinations of the detectors and descriptors which are listed in Table 1. The evaluation is divided into floating point and binary point combinations, with the exceptions Harris-Affine combined with ORB and BRISK, and SURF combined with BRIEF and FREAK. The combination of Harris-Affine and ORB showed good performance in [11], while BRISK is combined with Harris-Affine as the descriptor showed good performance throughout this work. SURF is combined with binary descriptors as it is known to outperform other floating point detectors in computational speed. Combinations which are also tested in the evaluation on visual images in [7].

Evaluated combinations are entitled by a concatenation of the detector and descriptor with hes and har being short for Hessian-Affine and Harris-Affine. In case of no concatenation, e.g. orb, both ORB detector and descriptor are applied.

The performances are presented in precision-recall curves for the structured scenes, Fig. 2, and for the textured scenes, Fig. 3. We also present the average results of both scene types in recall, precision and MS for each transformation. Here the threshold is set to accept all obtained matches as a threshold would be dependent on descriptor size and descriptor type.

3.1 Precision-Recall Curve

Recall and 1–Precision are commonly combined to visualize the performance of descriptors. It is created by varying an acceptance threshold for the distance between two NN matched features in the descriptor space. If the threshold is small, one is strict in acquiring correct matches which leads to high precision but low recall. A high threshold means that we accept all possible matches which leads to low precision, due to many false positives, and a high recall since all correct matches are accepted. Ideally a recall equal to one is obtained for any precision. In real world applications this is not the case as noise etc. might decrease the similarity between descriptors. Another factor arises while regions can be considered as correspondences with an overlap error up to 50 %, hence descriptors will describe information in areas not covered by the other region. A descriptor with a slowly increasing curve indicates that it is affected by the image deformation.

3.2 Results

View-Point. The effect of view-point changes on different combinations is illustrated in Fig. 2a and b for the structured scene in Fig. 1a, while the results of the textured scene in Fig. 1b are presented in Fig. 3a and b. The average of the performances against perspective changes in the structured and textured scenes are presented in Table 2a.

Table 1. Included binary and floating point detectors and descriptors. Binary types are marked with (*)
Fig. 2.
figure 2

Performance against view-point (a) & (b), rotation (c) & (d), scale (e) & (f), blur (g) & (h), noise (i) & (j), downsampling (k) & (l) in structured scenes.

Fig. 3.
figure 3

Performance against view-point (a) & (b), rotation (c) & (d), scale (e) & (f), blur (g) & (h), noise (i) & (j), downsampling (k) & (l) in textured scenes.

From the result it is clear that the performance varies depending on the scene and the combination. In the structured scene all combinations show dependency to perspective changes by a slow continuous increase in recall. Best performances among floating point descriptors are obtained by mser-liop and hes-liop.

Among binary combinations the best performance is obtained by orb-brisk, for both scenes, with results comparable to the best performers in the floating point family of combinations. Consecutive in performance are orb and orb-freak indicating how combinations based on the ORB detector outperform other binary combinations based on BRISK and FAST.

Rotation. The results of the combinations due to rotation are illustrated in Fig. 2c and d, for the structured scene in Fig. 1c, and in Fig. 3c and d, for the textured scene in Fig. 1d. The average results are presented in Table 2b.

We observe that the overall performance is much higher for rotation than for view-point changes. The majority of combinations has high performance in both the structured and textured scene. Figure 2d shows an illustrative example of how different detectors and descriptors perform in different setups. For example surf-brief, with BRIEF known to be sensitive to rotation, has a poor performance while surf-freak and surf, still indicating a dependence to rotation, have a greatly improved performance. The poor performance of surf-brief is shown by its fixed curve at a low precision and recall in the lower right corner.

Overall best performance is achieved by hes-liop followed by har-liop among floating point combinations, while for binary methods best performance is obtained by orb-brisk, orb and orb-freak.

Scale. The effects from scaling are shown in Fig. 2e and f for the structured scene in Fig. 1e, and in Fig. 3e and f for the textured scene in Fig. 1f. The average results are presented in Table 2c.

The combinations show a stable behavior with similar performance in both scene types. Among floating point combinations, best performance is achieved by mser-liop succeeded by hes-liop. The top performers within binary combinations are surf-brief, orb-freak and orb-brisk.

Blur. The results of combinations applied to images smoothed by a Gaussian kernel can be seen in Fig. 2g and h for the scene in Fig. 1g, and in Fig. 3g and h, for the scene in Fig. 1h. Combinations average results are presented in Table 2d.

Best performance among floating point combinations is attained by surf, outperforming other combinations in stability, which is visualized by a horizontal precision-recall curve. It is followed by sift and hes-liop. Overall best performance can be found in the category of binary combinations, with surf-brief as the top performer, outperforming floating point combinations. The consecutive performers are orb-brisk and orb which achieve best performance among corner based combinations, with comparable or better results than blob based combinations.

Table 2. Average performance when using NN as matching strategy by the measures: precision, recall and Matching Score (MS).

Noise. The performance of combinations applied to images with induced white Gaussian noise is presented in Fig. 2i and j for the structured scene in Fig. 1i. The corresponding results for the textured scene in Fig. 1j are shown in Fig. 3i and j. The average results of the two scenes against noise are presented in Table 2e.

The overall performance for various combinations is relatively high. Best performance among floating point combinations are attained by hes-liop and surf, stagnating at about the same level of recall in the precision-recall curves for both scenes and in Table 2e. The overall best performance in the case of induced noise is achieved by orb-brisk followed by orb and surf-brief showing better performance than the floating point category.

Downsampling. Last, we evaluate the effect on combinations caused by downsampling and present the results in Fig. 2k and l for the structured scene in Fig. 1k. For the textured scene in Fig. 1l the results are presented in Fig. 3k and l. The obtained average results for NN matching are presented in Table 2f.

Studying the precision-recall curves of floating point methods and Table 2f, the best performers on downsampled images are surf, hes-liop and har-liop and har-brisk. Among binary methods the best performance is obtained by surf-brief, with better results than surf, with surf-freak, orb-brisk and orb to come after. MSER does in Fig. 2k reach a 100 % recall which can be explained by that very few regions are detected.

4 Comparisons to Results in Earlier Work

4.1 IR Images

The most related work in the long wave infrared (LWIR) spectral band [19] shows both similarities and differences to the results in this work. Best performance against blur is in both evaluations obtained by SURF. For rotation and scale best performance is achieved by SIFT in the compared evaluation while LIOP, not included in mentioned evaluation, shows highest robustness to the deformation in this work.

Among binary combinations [19] presents a low performance for ORB and BRISK with their default detectors. In this work the low performance of the combination of BRISK is observed while the combination of ORB is a top performer among binary methods. An important difference between these two evaluations is that we have performed a comparison of numerous detector and descriptor combinations, which have led to the conclusion of a good match of the ORB detector and BRISK descriptor.

4.2 Visual Images

In the evaluation of binary methods for visual images in [24], it is obvious how the combination of detector and descriptor might affect the performance. When evaluating descriptors with their default detectors, BRISK and FREAK perform much worse than when combined with the ORB detector. Best overall performance was obtained by ORB detector in combination with FREAK or BRISK descriptors and ORB combined with FREAK is the suggested combination to use. In this work we have observed that BRISK with its default detector performed worse, in most categories the worst, compared to when in combination with the ORB detector while the combination of ORB and FREAK has lower performance in the LWIR spectral band. With the high performance of the combination of ORB and BRISK we can conclude that the choice of combination has large effect on the performance both in visual images and IR images.

Another similarity is the high performance by Hessian-Affine with LIOP in [8] and in this work as well as by the combination of SURF which shows high performance in both spectral bands.

5 Conclusions and Future Directions

We have performed a systematic investigation on the performance of state-of-the-art local feature detectors and descriptors on infrared images, justified by the needs in various vision applications, such as image stitching, recognition, etc. While doing so, we have also generated a new IR image data set and made it publicly available. Through the extensive evaluations we have gained useful insight as to what local features to use according to expected transformation properties of the input images as well as the requirement for efficiency. It should be highlighted that the combination of detector and descriptor should be considered as it can outperform the standard combination. As the consequence of our comparisons at large, Hessian-Affine with LIOP, and SURF detector with SURF descriptor have shown good performance to many of the geometric and photometric transformations. Among binary detectors and descriptors competitive results are received with the combination of ORB and BRISK.

Compared to the most relevant work by Ricaurte et al. [19] this work evaluated performances against viewpoint changes, the LIOP descriptor, float type detectors as Hessian-Affine and Harris-Affine including different combinations of detectors and descriptors, filling the gap of evaluations for IR images.

In future research we will extend the study from those hand crafted features to learning based representations such as RFD [25] as well as those [26, 27] obtained by deep convolutional networks which were shown to be very effective for a range of visual recognition tasks [2830]. Fischer et al. [31] demonstrated that those descriptors perform consistently better than SIFT also in the low-level task of descriptor matching. Although the networks are typically trained on the ImageNet data set consisting of visual images, it will be interesting to see if such a network is applicable to extracting descriptors in IR images (via transfer learning), or one would need yet another large data set of IR images to train a deep convolutional network. Nevertheless, the study in this direction is beyond the scope of this paper and left for the subject of our next comparison.