1 Introduction

Vehicle verification is an important task for access control in public traffic security. In vehicle access control, the license plate number is widely used as the reliable and unique ID of a vehicle image. However, in real world applications, license plates may be faked, occluded, missed, or can not be recognized. In all these cases, the license plates cannot be relied on. So there have been some research on image-based vehicle verification. Most conventional image verification methods verify images by comparing holistic image features. There are two drawbacks with these methods when dealing with vehicle images.

Firstly, in access control applications, the images of vehicles with the same attributes (such as type, maker, and color, etc.), and images shotted from nearby viewpoints are very similar to each other (Fig. 1). It is hard to distinguish these similar images using holistic image features.

Secondly, these methods can not interpret the verification (i.e. they do not provide any evidences for the verification), while interpretability is desirable in vehicle access control applications.

In order to solve these problems, we propose a method for interpretable vehicle verification. Our goal is to not only verify whether two images are identical, but also provide evidences for the verification. We focus on vehicle access monitoring applications where the images are typically shot from the frontal views. We notice that there are some local details such as customized paintings, decorations, inspection marks, and so on, which can tell fine-scale differences between vehicle images (Fig. 1 shows some examples). So we propose to automatically discover their fine-grained differences. Based on these fine-grained differences, we can tell where the differences located and explain how much they differ from each other.

Fig. 1.
figure 1

Five pairs of vehicle images. Each pair consists of two similar images, which have the same attributes including maker, model, year and color. The two images in the first three pairs are from different vehicles, and the images in (d) and (e) are from identical vehicles. We labeled the local differences presented in the first three image pairs with green boxes. Note that the images in the first three pairs are very similar to each other, and can only be distinguished from these local fine-grained differences. (Color figure online)

We develop a multi-task deep convolutional encoder-decoder network for this problem. The encoder is a Siamese network [4] composed of two identical convolutional neural nets (CNN) [3]. We use the encoder to classify two images as identical or not. The decoder is a deconvolutional network [6]. Given a pair of vehicle images, we use the decoder to predict a saliency map [11] which indicates the significance of their difference at every pixel. Our network design enables end-to-end training. We validate our method using real world vehicle images, and show that our approach achieves much better performance than the other methods.

The main innovation of our method is to combine verification task with regression task based on saliency map. In other words, our approach not only predicts whether two images are identical, but also provides the evidences for its predictions. This is achieved by predicting a score and a saliency map, where the score indicates the dissimilarity of the two images, and the saliency map indicates the subtle regions which differentiate the images.

The rest of the paper is organized as follows. Section 2 reviews the related work. Then Sect. 3 introduce algorithm in detail. In Sect. 4, we conduct various experiments. Finally, we conclude this paper in Sect. 5.

2 Related Work

Image verification has been researched immensely in recent years. The typical ideas of image verification are either to classify whether two images are of the same ID or to predict a similarity score by comparing their features [15]. Face verification [14] and signature verification [7] are the most successful endeavors. For example, the common outline of face verification consists of four stages: face detection [8], face alignment, feature representation and classification. In feature representation stage, feature extractor is designed to extract key features like structure of eyes and nose, but ignore other element. But in vehicle verification task, there is no such key features, for every small image patch is crucial. For example, in the same class, two vehicle images with inspection marks or without should be different. So above method cannot perform well on such fine-grained verification tasks.

Fig. 2.
figure 2

Image alignment via feature matching

Therefore, some improved methods are proposed, and have achieved good results. An intuitive improvement is to combine the classification and the similarity constraints together to form a CNN based multi-task learning framework [1]. For example, [13, 17] propose to combine softmax with triplet loss together to reach a better performance. Obviously, these methods improve conventional verification method because similarity constraints may provide additional information for training the network. However, these methods still have limitation. Similarity constraints represent each image as a feature vector. Although we can verify whether samples are identical by comparing similarity of feature vector in same class, these holistic feature vectors can hardly reveal those subtle yet significant different regions. As shown in Fig. 1, the different images look almost identical except in some very small regions. In vehicle image verification, it’s desirable to spot exactly the regions where two vehicles are different, so that we can interpret the verification results.

3 Approach

3.1 Image Alignment

Due to the difference in viewpoints, the misalignment between vehicle images are influential to precise location of the small different regions in the images. Therefore, the first step of our approach is to align two vehicle images. Note that we focus on the frontal vehical truck image verification which is ubiquitous in real-life transportation system. The front facade of a truck can be roughly modelled as a plane, therefore the images from different viewpoint are related via an perspective transformation. We first match the two images via SIFT [5] features, and then estimate the transformation between the images using RANSAC [2] algorithm. Two original images are transformed and cropped according to their overlapping regions. Figure 2 illustrates the alignment process.

Fig. 3.
figure 3

Deep convolutional Encoder-Decoder network for vehicle image verification.

3.2 Network Architecture

Let \({X_1,X_2}\) be two aligned images with the same size of \({H\times W}\). Our approach predicts a label \(y\in \{0,1\}\) and a saliency map \(S\in (0,1)^{H\times W}\) for \(X_1\) and \(X_2\):

$$\begin{aligned} (y,S) = f(X_1,X_2;\mathbf{\theta }) \end{aligned}$$
(1)

where y indicates whether \(X_1\) and \(X_2\) are identical (\(y=0\)) or not (\(y=1\)), \(S(i,j)\in [0,1]\) indicates the degree to which \({X_1(i,j)}\) differs from \({X_2(i,j)}\), and \(\varvec{\theta }\) is the parameter of our model.

We represent f(Eq. (1)) as a deep convolutional encoder-decoder network [16] (Fig. 3). The encoder is a Siamese architecture [4] to extract features from two images, and to verify the two images (predicting y). The decoder is a deconvolutional network [6] to regress a saliency map from these features. The two branches of the encoder are modified from VGG16 [12]. There are thirteen convolutional layers, five pooling layers and three fully connected layers in the VGG16. Since the dimension of the last fully connected layer usually represents the numbers of classes in the classification task, but this is unnecessary in our tasks. In addition, in the CNN network, the fully connected layer maps the feature map generated by the convolutional layers into a vector of fixed length. However, the dimension of the vector in the last fully connected layer of VGG16 is lower than the dimension of the vector of the first two fully connected layers, which will cause loss of image informations. Moreover, removing a fully connected layer can reduce redundancy of the parameter in the fully connected layer. Therefore, we remove the last fully connected layer and use the first two FC layers as feature extractor for label prediction. In this way, the convolutional network is composed of thirteen convolutional layers, five pooling layers and two fully connected layers.

Vehicle Verification. Let \(Z_1\) and \(Z_2\) be the feature vector for \(X_1\) and \(X_2\), respectively, which are extracted from the output of the 2nd FC layer of the encoder. We predict the label y for \(X_1\) and \(X_2\) using logistic regression:

$$\begin{aligned} Pr(y=1|X_1,X_2)=\frac{1}{1+e^{-(W_{y}^{T}\left| Z_1-Z_2\right| +b_{y})}} \end{aligned}$$
(2)

where \(\left| Z_1-Z_2\right| \) is a vector of the point-wise absolute difference of \(Z_1\) and \(Z_2\). \(W_y\) and \(b_y\) are parameters.

Saliency Prediction. Let \(V_1\) and \(V_2\) be the feature maps of \(X_1\) and \(X_2\), output from the POOL-5 layer of the convolutional network. We concatenate \(V_1\) and \(V_2\) into V, and regress a saliency map S from V using a deconvolutional network. The deconvnet is designed by reversing the VGG16 network. We concatenated feature maps V as input. The output of its last layer, F, the size of which is \(224\times 224\). We apply a \(1\times 1\) convolution on this feature map, followed by a sigmoid activation, to generate the saliency map S:

$$\begin{aligned} S(i,j)=\frac{1}{1+e^{-(W_{S}^{T}F(i,j)+b_{S})}} \end{aligned}$$
(3)

where F(ij) is the feature vector of pixel (ij) on the feature map F, \(W_S\) and \(b_S\) are parameters of the \(1\times 1\) convolution kernel.

3.3 Network Training

Given a labeled training set \(\{ X_{1}^{i},X_{2}^{i},y^{i},M^{i} \}_{i=1}^{N}\). \(M^{i}\in {\{ 0,1\}}^{H\times W}\) is the ground truth saliency map for \({X_{1}^{i}}\) and \({X_{2}^{i}}\). The parameter \(\mathbf \theta \) is estimated by minimizing the following multi-task loss:

$$\begin{aligned} l=l_{verify} + l_{saliency} \end{aligned}$$
(4)

where \(l_{verify}\) is the verification loss, which is typically defined as log-loss:

$$\begin{aligned} l_{verify}=\frac{1}{N}\sum _{k=1}^{N}{-y^{k}log({\rho }^{k})} -{(1-y^{k})log(1-{\rho }^k)} \end{aligned}$$
(5)

where \({\rho }^k=Pr(y^k=1|X_{1}^{k},X_{2}^{k})\) is the output of the logistic regressor (Eq. (2)).

\(l_{saliency}\) represents the saliency regression loss in Eq. (4), which is defined as the mean squared error between the ground truth map and the predicted map:

$$\begin{aligned} l_{saliency}=\frac{1}{N}\sum _{k=1}^{N}\sum _{i=1}^{H}\sum _{j=1}^{W}{(M^{k}(i,j)-S^{k}(i,j))^{2}} \end{aligned}$$
(6)

where \(S^{k}\) is the predicted saliency map for \({X_{1}^{k}}\) and \({X_{2}^{k}}\).

We use stochastic gradient descent algorithm to train our network. The hyper-parameters for training are almost the same as VGG16 [12], Except for the initial global learning rate which is 0.01 in our experiments. And we reduce the learning rate by 0.1 after 1000 iterations.

3.4 Improved Loss Function

We analyzed the verification results on the training set, and found that the average saliency of most of the false positives are less than those of the true positives. Their distributions are shown in Fig. 4. From Fig. 4, we can found that the average saliency of false positive samples is slightly lower than the average saliency of true positive samples, this is because the dissimilarity of false positive samples is lower than the dissimilarity of true positive samples. It means that the model of verification task can not distinguish highly similar vehicle image pairs (i.e. it outputs a lower \({\rho }\)). This is consistent with our intuitions. Two images with less different regions (low saliency value) are more hard to discriminate. In order to reduce the false positive rates of the verification, we modified the verification loss as:

$$\begin{aligned} l'_{verify}=\frac{1}{N}\sum _{k=1}^{N}{-\frac{1}{\mu _S^{k}}y^{k}log({\rho }^{k})}-(1-y^{k})log(1-{\rho }^k) \end{aligned}$$
(7)

where \({\rho }^k=Pr(y^k=1|X_{1}^{k},X_{2}^{k})\) is the output of the logisitc regressor (Eq. (2)), and \(\mu _{S}^{k}\in (0,1)\) is the average saliency of the saliency map \(M^{k}\).

Fig. 4.
figure 4

The distribution of average saliency.

In this new loss function, we add a penalty factor, \(\frac{1}{\mu _S^{k}}\), to the first term of the verification loss. The reason is that when the average saliency \({\mu _S^{k}}\) is small, the logistic regressor tends to output a small \({\rho }\), which will lead to a false positive, so we add a larger penalty factor to \(-log({\rho })\). When \({\mu _S^{k}}\) is large, the output of the regressor will be large, results in a true positive, so we add a smaller penalty factor to \(-log({\rho })\).

We compare the output probability \({\rho }\) of the logistic regressor using the two different loss function (5) and (7). We found that probability \({\rho }\)’s trained with the modified loss (7) are larger than those trained with (5). By using the new model, the verification accuracy reached to \( 88.7\%\), and the false positive rate reduce to 0.025 (as shown in Table 1).

In addition, we analyzed the relationship between label prediction and average saliency by calculating the correlation coefficient. The formula of correlation coefficient is:

$$\begin{aligned} C=\frac{\frac{1}{N-1}\times {\sum _{i=1}^{N}(\mu _S^i-\bar{S})\times (\rho ^i-\bar{\rho })}}{\sqrt{\sum _{i=1}^{N}(\mu _S^i-\bar{S})^2\times (\rho ^i-\bar{\rho })^2}} \end{aligned}$$
(8)

where \(\mu _S^i\) is the average saliency of the saliency map, and \(\bar{S}\) is the mean of average saliency for all samples. \(\rho ^i\) represents the output of logistic regressor, and \(\bar{\rho }\) represents the mean of output of logistic regressor for all samples.

The correlation coefficient is 0.8034 when trained with loss (5), and increased to 0.8407 when trained using loss (7). This means that using our modified loss function, the output probability of logistic regressor is more correlated with the difference saliency of the input images.

4 Experiments and Analysis

4.1 Dataset

In order to validate our method on verificating very similar vehicle images, we labeled 5K pairs of frontal view truck images. The trucks in each image pair share the same maker, type and color. The label of each pair is determined according to their license plate number. We randomly select 4K pairs as training data, of which 2.5K pairs are positive samples (different vehicle image pairs) and the other pairs are negative samples (identical image pairs). We use the remaining 1000 image pairs as test data.

Fig. 5.
figure 5

Samples of qualitative results. (a) Vehicle pairs with local differences. (b) Identical vehicle pairs.

For each image pair, we have labeled every different part of two images by using a rectangle except for the crews in the cockpits. The ground truth labels are converted to a binary saliency map with the same size of the images, where every pixel in the rectangles are set to 255, and others are set to 0. Some of the data and their ground truth saliency maps are shown in Figs. 5 and 8. When training our network, we augment the training data set by randomly cropping a subimage. The augmented dataset contains 8000 image pairs.

4.2 Qualitative Results

We tested 1000 vehicle image pairs and calculated the verification accuracy. The accuracy reached to \(88.7\%\) (as shown in Table 1). Figure 5 shows ten sets of qualitative results. As we can see, the local differences such as decorations, inspection marks can be detected precisely. Besides, some interference factors can be ignored, such as the slight stains and the crews in the cockpit. Because these interference factors are labeled as background in the training data.

4.3 Evaluation of Saliency Map

In this section, we evaluated the generated saliency map from various two predicted saliency map and ground-truth map. The first step is to binarize saliency map, then we calculate MAE using the following formula:

$$\begin{aligned} MAE=\frac{1}{W*H}\sum _{i=1}^{W}\sum _{j=1}^{H}|\bar{M}(i,j)-\bar{S}(i,j)| \end{aligned}$$
(9)
Table 1. Comparison of two loss functions.

where \(\bar{S}\) is the binarized saliency map for \({X_{1}^{i}}\) and \({X_{2}^{i}}\), and where \(\bar{M}\) is the ground-truth map for \({X_{1}^{i}}\) and \({X_{2}^{i}}\). W and H indicate the width and height of the map, respectively. Then, we calculated the pixel accuracy, which is a simple metric that aims to mark the ratio of correct pixels to total pixels. The formular for calculating pixel accuracy as follows:

$$\begin{aligned} PA=\frac{\sum _{i=1}^{k}P_{ii}}{\sum _{i=1}^{K}\sum _{j=1}^{K}P_{ij}} \end{aligned}$$
(10)

where \(P_{ii}\) represents the numbers of pixels that are predicted correctly, and \(P_{ij}\) are the numbers of total pixels. The values are shown in Table 2.

Table 2. Quantitative results of saliency map prediction.

4.4 Quantitive Comparison with Other Methods

For comprehensive experiment, we compared and analyzed our approach with other methods. The corresponding experimental details and analysis will be discussed as follows.

Object Detection. This approach uses object detection [9] method to consider local differences between two images. The main idea is to stack two images as the input and to take differences between two images as the object. The details are as follows: firstly, two RGB images are stacked into a 6-channel image [18] and fed into the network. Then we follow the pipeline of state-of-the-art object detection framework. In particular, we use Faster R-CNN [10] as the detection pipeline, i.e., “CNN feature extraction + region proposal + classification”. Input layer is a 6-channel image.

Analysis. In this section, we analyzed the object detection and our approach. Both methods consider the local informations between vehicle image pairs. In object detection method, the feature of “difference” is learned via CNN, and regions of possibly local differences are proposed by Region Proposal Network in Faster R-CNN [10]. The advantage of this method is that features of local differences will not fade away in a whole image.

Fig. 6.
figure 6

The top is the saliency map. Binary map and connected regions map are in the middle and bottom, respectively.

In our approach, we developed a multi-task deep convolutional encoder-decoder network. First, we used a convolutional neural network (CNN) to classify two images as identical or not. Then, we used deconvolutional networks to predict a saliency map which indicates the degree of their difference at every pixel. We not only considered the holistic feature, but also analyzed the differences of local informations. Specifically, we not only predicted the label, but also generated the saliency map to interpret the local differences. Besides, we analyzed the correlation between label prediction and average saliency. Therefore, our approach is more comprehensive than the object detection method.

Fig. 7.
figure 7

The PR curves of two methods.

In order to make the analysis more adequate, we calculated the precision and recall of two methods and compared the average precision. In detection object method, we calculated intersection-over-union (IOU \(\in (0,1)\)), which is the overlap ratio between the bounding box generated by the model and the ground truth. The best result is that the bounding box and ground truth overlap completely, i.e. the IOU is equal to 1. Then, we set a threshold \(\tau \), and specify that the corresponding bounding boxes are correctly detected when IOU is greater than or equal to thereshold \(\tau \). In this way, we finally calculated the precision and recall.

However, since we can hardly obtain the bounding boxes informations from the saliency map, we can not calculate IOU between saliency map and corresponding ground-truth directly in our approach. So we adopt the following strategy to evaluate our approach. We converted the predicted saliency map to a binary map at first. Then we extracted connected regions from binary map. The extraction of connected regions in binary map is an important process in image processing, which can be applied in many fields. We used 8-connected region algorithm to mark each pixel. The algorithm can calculate the numbers of connected regions during one-time image scanning, and can calculate the numbers of pixels in each connected region. Finally, according the numbers of pixels, we calculated the numbers of true positive pixels (TP), true negative pixels (TN), false positive pixels (FP), and false negative pixels (FN) in each connected region. Figure 6 shows the binary map and connected regions. In this way, we finally calculated the precision and recall by follow formula:

$$\begin{aligned} precision=\frac{TP}{TP+FP} \end{aligned}$$
(11)
$$\begin{aligned} recall=\frac{TP}{TP+FN} \end{aligned}$$
(12)

It should be noted that when the saliency map is binarized, the threshold is selected from 0 to 255. When we select each threshold, we calculate a set of corresponding precision and recall for all binary map. Then we averaged the precision and recall values for all image pairs in this threshold. In this way, 256 pairs of precision and recall values will be obtained. Using recall as the abscissa and precision as the ordinate, we plot the precision-recall (PR) curve. The PR curves of two methods is in Fig. 7. Besides, we calculated average precision (AP) of two methods. By calculation, The AP of our approach is 0.8068, while the value of the object detection is 0.8013. The AP of two methods are fairly close, among which our method has a small advantage.

4.5 Limitation

In the actual traffic scenes, due to the illumination changes between different viewpoints, vehicle images may be captured under different illumination, which will result in specular reflection at some viewpoints. Figure 8 demonstrates some inaccurate results affected by high light.

Fig. 8.
figure 8

Some inaccurate results. (a) There are illumination changes between viewpoints, which result in specular reflection at a specific viewpoint, so that the prediction of saliency map is inaccurate, but the label prediction is correct. (b) Specular reflection led to both the saliency map prediction and label prediction are incorrect.

5 Conclusion

We present a novel method to discriminate similar vehicle images. Our method use a convolutional neural network (CNN) to extract two vehicle images features. We first verify two vehicle whole images. We use logistic regression to measure holistic difference of features. Then we concatenate the features of two images, from which we predict a saliency map using a deconvolutional network. The saliency map shows the fine-grained differences between two vehicle images. Our network design enables end-to-end training. We validate our algorithm on a vehicle image dataset. Experimental results show that our approach is fast, effective while using very cheap annotation. For similar vehicle images, we can perform fast and efficient verification. Most importantly, we can provide evidences for the verification. In order to make the experiments more comprehensive, we have added two comparison methods and conducted comparative experiments, and experimental results show our approach achieves better performance than other methods. Furthermore, we can extend our framework into wider verification applications.