1 Introduction

Finding local descriptors for image points is an active and challenging field of research in image processing and computer vision communities. In this regard, several works can be named, suggesting different types of local descriptors for sparse feature matching or dense matching purposes. For the sparse feature matching, the following methods can be remarked: SIFT [11], SURF [2], BRIEF [4], ORB [17], BRISK [10], LIOP [19] and MRRID [7]. An evaluation of these methods can be found in [1, 12]. According to these works, although SIFT and SURF are relatively old methods, they still outperform many of the new methods by delivering higher rates of correct matches. Nevertheless, the SURF method is known to perform poorly in case of in plane rotations. We found out that this is due to a simple neglect in the calculation of the SURF descriptor, where the averaged vectors are not projected on the axis of oriented coordinate frames.

SIFT descriptors are generated based on the histogram of oriented gradients (for 8 bins) at 16 windows surrounding a point. In case of the SURF method, the sum of gradient components and the sum of absolute values of gradient components in 16 windows about a point are used to form a descriptor vector with the length of 64. Unfortunately SIFT and SURF are slow methods and are not appropriate for real time applications. Hence, later, methods based on binary descriptors such as BRIEF and ORB were proposed to meet real time requirements. The BRIEF descriptor is generated based on pairs of test points in a neighborhood of a point. As BRIEF is not rotation invariant, later ORB was proposed to achieve the rotation invariant version of BRIEF. In the BRISK method, binary patterns are extracted based on a regular radial symmetric pattern about a point. In this method the intensities of the neighborhood of the point are smoothed with Gaussian kernels of different sizes. These consideration make the precision of BRISK slightly better than BRIEF and ORB. Binary descriptors are fast to create, and also in the matching process, the Hamming distance (XOR operation) and bit counting can be used to speed up the matching process dramatically.

The aforementioned methods work well for sparse matching purposes; nevertheless, the good performances are due to the long descriptors created based on large neighborhoods of features. However, as we may need to use feature descriptors for other purposes such as dense image matching, object and action recognition, using such long descriptors demands high computational loads. In literature, compact descriptors such as census [20], MLDP [14] and HOG [5] were used for the computation of dense optical flows. HOG is also used in object and human action recognition. Among the mentioned descriptors, HOG showed a promising performance because it considers both the magnitudes and the directions of gradients in a local window [16]. Nevertheless, the computation of HOG is expensive as it necessitates the calculation of gradient angles. On the other hand, in HOG the spacial distribution of gradients is neglected, which degrades its discrimination. In this paper, we propose a new compact descriptor with the length of 8 bins, inspired from the SURF descriptor. We name the descriptor distributed averages of gradients (DAG), which outperforms HOG both in the sense of computation time and also discrimination.

The paper is structured as follows: In Sect. 2, the gradient based descriptors are reviewed. Section 3 describes how DAG is constructed. In Sect. 4, the application of DAG for the computation of dense optical flow and face detection is presented. DAG is evaluated in Sect. 5. Section 6 concludes this paper.

2 Gradient-Based Descriptors

In this section SIFT and SURF descriptors are reviewed as they are tightly related to HOG and DAG. In the SIFT method, HOG descriptors at 16 different windows surrounding a feature point are calculated. Then descriptors are formed by ordering the values of bins in vectors with the lengths of 128. Calculation of HOG is based on local gradients of an image about a point in a window. Given an image \(I(x,y): \varOmega \rightarrow \mathcal R\), gradient vectors for a pixel (xy) are computed as follows:

$$\begin{aligned} \mathbf v(x,y)=[v_x \ v_y]^T=[I_x(x,y) \ I_y(x,y)]^T \end{aligned}$$
(1)

where \(I_x=\frac{\partial I}{\partial x}\) and \(I_y=\frac{\partial I}{\partial y}\). To form HOG, the magnitudes m and angles \(\alpha \) of gradients should be calculated:

$$\begin{aligned} \alpha&= atan2 (v_y,v_x) \nonumber \\ m&=\sqrt{v^2_x+v^2_y} \end{aligned}$$
(2)

For the calculation of HOG for a point such as (xy), the gradients in a rectangular neighborhood of the point is taken into account. In this regard, n bins (\(b_1 \ldots b_n\)) spanning \(0^\circ \ldots \ 360^\circ \) should be created. The value of each bin is formed as follows:

$$\begin{aligned} b_i=\sum _{\beta _i<\alpha _j<\beta _{i+1}} m_j \end{aligned}$$
(3)

where \(\beta _i=\frac{360}{n}(i-1)\). A popular number for bins is eight, which has been used in SIFT. HOG has two main shortcomings: first it needs relatively high computation loads for the calculation of gradient angles. Second it discards the geometry of the occurrence of gradients. Obviously, in SIFT, HOG descriptors are calculated at different sub-windows about a feature, which means that the geometry of gradients will also be taken into account in another way.

Unlike SIFT, SURF descriptors are generated by averaging gradient components and also their absolute values to form a vector containing of 4 elements \([\sum v_x, \sum v_y, \sum |v_x|, \sum |v_y|]^T\). By calculating such a vector for 16 surrounding windows about a point, a descriptor vector with the size of 64 is obtained.

3 Distributed Averages of Gradients

As mentioned in the introduction section, long descriptors of the feature matching methods are keys to achieve high correct matching rates. However, in case of dense matching based on differential techniques, using such long descriptors gives rise to very high computation loads. On the other hand, compact versions of the descriptors mostly with the lengths of eight have been used for dense optical flow calculations. In this regard, we propose a compact descriptor inspired by the SURF descriptor, in which the averages of gradients in only four surrounding windows about each pixel are utilized (Fig. 1).

Fig. 1.
figure 1

DAG construction. Each arrow indicates the average of gradients in a window.

Additionally, unlike SURF which utilizes only small sub windows, in our proposed descriptor, window sizes can be changed but the number of windows remains always four. Additionally, these four windows are overlapping which make them robust against abrupt changes of the gradients at the borders. Furthermore, to keep the descriptor compact, we did not utilize the average of absolutes of gradient components. Instead the average of gradients for each sub window \(w_i: i=1,2,3,4\) is simply calculated as follows:

$$\begin{aligned} \mathbf v_{i}=[v_{i,x} \ v_{i,y}]^ T=\frac{1}{N}\sum _{(x,y) \in w_i} \mathbf v(x,y) \end{aligned}$$
(4)

where \(N=(S/2+1)^2\). By concatenation of the four vectors, a descriptor vector as following is formed:

$$\begin{aligned} \mathbf d=[v_{1,x} \ v_{1,y} \ v_{2,x},v_{2,y}, v_{3,x}, v_{3,y}, v_{4,x},v_{4,y}]^T \end{aligned}$$
(5)

The descriptor vector can also be normalized to achieve robustness against illumination changes. We abbreviate the normalized version as NDAG.

3.1 Rotation Invariant DAG

To make DAG rotation invariant, a normal vector in the direction of the average of gradients in a neighborhood of each point is taken into account: \(\mathbf g=[g_x \ g_y]^ T\). We also use the orthogonal vector to \(\mathbf g\), namely \(\mathbf k\), to form a local coordinate system based on \(\mathbf g\) and \(\mathbf k\). The four windows about the keypoint is scanned using four sets of vectors:

$$\begin{aligned} \{\mathbf g_1= -\mathbf g, \ \mathbf k_1= \mathbf k\}, \{\mathbf g_2=\mathbf g, \ \mathbf k_2= \mathbf g \}, \{\mathbf g_3=-\mathbf g, \ \mathbf k_3=- \mathbf g \}, \mathbf g_4=\mathbf g, \ \mathbf k_4=- \mathbf g \} \end{aligned}$$
(6)

To address pixels in surrounding windows, we simply use the following equation:

$$\begin{aligned} \mathbf x=[x \ y]^T= h \mathbf g_i +w \mathbf k_i \end{aligned}$$
(7)

where i is the index of a window, \(h=0,\ldots ,\frac{S}{2}+1\) and \(w=0,\ldots ,\frac{S}{2}+1\). After the calculation of the averages of gradients in four windows, it is important to project the averages on the axis of the rotated coordinate system (\(\mathbf g\) and \(\mathbf k\)). This step is not done in the SURF method which gave rise to its poor performance in case of in plane rotations.

4 Applications of DAG

In this section two applications of DAG are proposed: first for the computation of dense optical flow, and second for face detection.

4.1 Dense Optical Flow Using DAG

We apply DAG for the computation of the optical flow. To this end, we use combined local global methods proposed in [16]. Given two images I and \(I'\), DAG descriptors are extracted for all pixels in both images. As a result, for each image, one image with 8 channels is created to store the eight components of DAG descriptor. We name the 8-channel images S and \(S'\) associated to the images I and \(I'\) respectively. Consequently, we define a cost function consisting of data and regularization terms as explained in [16]:

$$\begin{aligned} \underset{u,v}{{\text {min}}}\ \ E(u,v) = \sum _{\varOmega } \left( \lambda E_{data} + \gamma E_{smooth} + E_{dual}\right) , \end{aligned}$$
(8)

where

$$\begin{aligned} E_{data}=\rho {(x,y,u,v)} \end{aligned}$$
(9)
$$\begin{aligned} E_{smooth}=\,\parallel \nabla u \parallel + \parallel \nabla v \parallel \end{aligned}$$
(10)
$$\begin{aligned} E_{dual}=\frac{1}{2\theta }(u-\hat{u})^2 \end{aligned}$$
(11)

where \(\lambda \), \(\gamma \) and \(\theta \) are the importance weights for each term. \(\rho (.)\) is a similarity function as follows:

$$\begin{aligned} \rho (x,y, u,v) = \sum _{i=1}^{8} \left( S_i'(x + u, y + v) - S_i(x,y)\right) ^2 \end{aligned}$$
(12)

where i is the channel index. To calculate optical flows, the cost function in Eq. 8 can be minimized based on the dual variable technique explained in [16].

4.2 DAG for Face Detection

One of the main task of a rescue robot is exploring a disaster environment to find living survivors. In this regard, the robot should be able to distinguish between the victims and other objects. Obviously, human face is an important feature by which victims can be determined more reliably comparing to other features. We have implemented a face detection framework which can work based on different types of descriptors. In this framework, a descriptor for a region of interest is extracted. Then a SVM (support vector machine) classifier for two classes, face and non-face, is trained. To achieve better detection rates, in the framework a multi-camera system is used, which consists of thermal and video cameras. The thermal images are segmented based on a temperature threshold to provide a list of regions which may contain faces. We align the thermal images with the video camera images using a simple calibration process. Thus, for each region in the thermal image, a corresponding region in the video image can be found (see Fig. 2). For each region in an image from the video camera, a descriptor based on DAG is generated. In this regard, each region is divided into regular blocks of the size \(N \times N\). Then for each block a DAG descriptor (or other descriptors) is generated. Afterwards, the descriptors are concatenated to form a global descriptor for each region. The extracted feature vector contains fine detailed information of an image such as edges, spots, corners and other local texture features.

Fig. 2.
figure 2

Face detection using thermal and video images.

5 Evaluation and Experimental Results

5.1 Sparse Matching

To evaluate the discriminating ability of DAG, we used the KITTI training dataset [8] for optical flows. This dataset includes 194 pairs of images and provide the ground truth of flows between each two images. We compared DAG, normalized DAG (N-DAG), HOG, normalized HOG (NHOG) and ORB as a baseline method. The reason that we selected ORB was that we wanted to evaluate the performance of different methods given the same inputs. In this regard, we extracted Shi-Thomasi features [18] at eight levels of pyramids with the scale factor 0.8 and computed descriptors for the features based on different methods. The total average of extracted features was 3953. We used two measures for the evaluation:

$$\begin{aligned} \text {precision}&=\frac{\text {number of correct matches}}{\text {number of matches}} \\ \text {recall}&= \frac{\text {number of correct matches}}{\text {number of all features to be matched}} \end{aligned}$$

To have a uniform matching strategy, we used 2 nearest neighbor (2-nn) matching method. In this method, for each feature in the first image, namely \(\mathbf f_1\) with the descriptor \(\mathbf d_1\), two matched features in the second image, namely \(\mathbf f_{2,1}\) and \(\mathbf f_{2,2}\), with the descriptors \(\mathbf d_{2,1}\) and \(\mathbf d_{2,2}\) are found. If the Euclidean distances of the descriptors are \(d_1=||\mathbf d_1-\mathbf d_{2,1}||\) and \(d_2=||\mathbf d_1-\mathbf d_{2,2}||\), a matching is assumed to be correct if \(d_1/d_2<0.8\); otherwise, no matching is considered for \(\mathbf f_1\).

Fig. 3.
figure 3

Precision and recall curves for high quality features (top) and low quality features (bottom).

We conducted two experiments to evaluate the performance of upright and rotated descriptors. In the first experiment, we evaluated the upright versions of all of the methods for two different groups of features: first, features with the qualities more than 0.005 and second features with the qualities between 0.001 and 0.005. We would like to see how the methods work for features with good and bad qualities. It is an important issue for dense matching purposes as typically most of the points have low gradient responses. In Fig. 3, the precision and recall curves for different methods with respect to the window widths can be seen. We can see that for small window sizes NDAG performs much better than the other methods but for larger windows ORB has better performance. Comparison of DAG and HOG shows that the precision and recall of DAG increase more and more as the window size increases; whereas the performance of HOG after the window size 21 degrades. It signifies that DAG is more capable to capture local informations in comparison to HOG at bigger window sizes. The reason that NDAG works better than DAG is that in small windows illumination changes affect the average of gradients adversely. Therefore, as normalization of DAG makes it robust against illumination changes, better performance of NDAG is reasonable. On the other hand, HOG is already robust against illumination changes thanks to its binning process. Therefore, its normalization only leads to losses of information.

Concerning the performances of the methods for low quality features, we see that DAG outperforms all other methods. The poor performance of ORB lies in the fact that most of the low quality features are located on almost homogeneous regions such as roads and in this case the binary descriptors are more vulnerable against the measurement noise rather than the gradient based methods.

In the second experiment, we rotated the second images of each image pairs to evaluate rotation invariance of DAG. In this experiment, we also ran the SIFT method, which might be interesting to readers how it works for these sequences. Unfortunately, SURF had a very poor performance and was not comparable with any of the other methods. Therefore, we did not take it into account in this experiment. Figure 4 depicts the comparison results. We see that all methods except SIFT experiences relatively large drops at the angles \(45^\circ k\) (\(k=1,\ldots 7\)). Nevertheless, ORB and then DAG have the best performances. Surprisingly, SIFT has a relatively poor performance. We investigated the problem and noticed that it originates from the nature of outdoor images in the KITTI dataset. In these images, repeatability of blob features are low, which gives rise to missing many of candidates for correct matching.

Fig. 4.
figure 4

Precision and recall curves in case of in plane rotations.

5.2 Computation of Dense Optical Flow and 3D Scene Reconstruction

We applied the DAG descriptor to compute dense optical flows for the KITTI [8] dataset as explained in Sect. 4.1. The KITTI dataset provides a very challenging testbed for the evaluation of optical flow algorithms. Pixel displacements in the data set are generally large, the images exhibit less texture regions, strongly varying lighting conditions, and many non-Lambertian surfaces, especially translucent windows and specular glass, and metal surfaces. Moreover, the high speed of the forward motion creates large regions on the image boundaries that move out of the field of view between frames, such that no correspondence can be established.

For evaluating the estimated optical flow, we calculated the average end-point error AEE, the average angular error AAE, and the percentage of pixels with an AEE of more than 3 pixels, which also known as bad pixels [8]. In all experiments, we used the fixed point iteration algorithm to optimize the objective function in Eq. 8. In order to deal with large displacement optical flows, we used the coarse to fine technique [3]. Furthermore, we used the normalized version of HOG and the normalized version of DAG. Because the normalized descriptors are robust against illumination changes, and also thanks to the normalization, a constant smoothness parameter can be applied regardless of the magnitudes of gradients. We used the following parameters setup for all descriptors, pyramids scale factor 0.9, the smoothness parameter \(\lambda = 10\), outer fixed point iteration was 4 image warp, and inner fixed point iteration was 10 times. For DAG and HOG, we used the window size 7 and 5 respectively, which yielded the best results.

For KITTI training dataset, Table 1 shows the average AEE and the average of the percentage of bad pixels for 194 sequences. As shown in Table 1 DAG outperforms HOG in the AEE but the percentage of bad pixels for HOG is slightly better than DAG. Furthermore, Table 2 shows a comparison among DAG, HOG, and MLDP for some sequence of KITTI training dataset. As shown in this table DAG outperforms other descriptors. In Fig. 5, the computed flows for the sequence 0 using different methods are visualized. We can see that DAG works very well in the estimation of flows in low texture regions. It supports the previous results for sparse feature matching.

One of the main application of dense optical flow is 3D scene reconstruction based on the structure from motion techniques. These techniques are especially important for outdoor environments, where RGB-D cameras cannot be used due to their limitations an also 3D range sensors cannot be an option due to their high prices. In Fig. 5, the reconstruction of the scene for sequence 0 of KITTI is visualized. We used the techniques proposed in [13] for the camera motion estimation and the 3D scene reconstruction. We observe that the structure of the scene at many places is reconstructed well. Especially, the ground plane is rebuilt very well, showing the ability of the combined techniques in providing fine information for terrain analysis purposes.

We have also computed the optical flows based on the DAG descriptor for the KITTI test dataset and submitted to KITTI website under the name of DAG-Flow. Table 3 shows a comparison among algorithms used local descriptors for optical flow estimation.

Table 1. Average of the AEE and the average of the percentage of outliers for the KITTI training dataset.
Table 2. The AEE and the percentage of bad pixels for some sequences of KITTI training dataset.
Table 3. KITTI test dataset. The average of the AEE and the average of the percentage of outliers for some methods using local descriptors for the optical flow estimation.
Fig. 5.
figure 5

KITTI training data set. Row 1: sequence 000000_10 and frame 000000_11. Row 2: color coding of the estimated optical flow and the error map using DAG. Row 3: estimated optical flow and error map using HOG [16] descriptor. Row 4: reconstructed scene from two views. (Color figure online)

The average elapsed time for the computation of the optical flow for each pair of images were obtained. The optimization algorithm model ran on an Intel Core2Duo with a 2.5 GHz processor executing C++ codes. The computation time for DAG was 12.10 s, while for HOG and MLDP was 22.70 s and 16.26 s respectively.

5.3 Face Detection

For evaluating the performance of DAG descriptor for the face detection, we have used the Frontal Face Dataset from the Caltech 101 categories [9]. Caltech data set contains images for 101 objects. The face dataset contains 450 images with the resolution of \(896 \times 592\) pixels. The images were captured from 27 people in different lighting conditions, various face expressions and different backgrounds. In our setup, every image is partitioned into \(15 \times 15\) block. For training and testing the detection algorithm, we used positive images from the Frontal face dataset, while images from other objects as negative images. For training the SVM classifier, we used 100 positive images and 100 negative images. We tested the algorithm using 450 positive image and 450 negative images. We evaluated the accuracy of the algorithm by calculating the false positive and false negative ratios. Interestingly, the face detection algorithm based on DAG detected all the 450 faces correctly and the number of false detection was always zero; whereas, the algorithm based on HOG failed to detect some faces (false negative). In Fig. 6 some samples of faces are presented, which the SVM classifier based on HOG failed to detect.

Fig. 6.
figure 6

Samples of false detection of SVM classifier based on HOG feature vector. Left frame 88, middle frame 193, and right frame 338.

6 Conclusion

In this paper a new compact descriptor for the purpose of dense image matching and object recognition is proposed. The descriptor is calculated based on local gradients of each pixel in an image by calculating the average gradients at four different regions surrounding a center point. The descriptor can be calculated much faster than histogram of oriented gradients as there is no need for calculation of the angle of each gradient and also the binning step is skipped. Additionally, it was shown that the proposed descriptor is more discriminative than HOG based on two different types of experiments conducted in this paper.