Keywords

1 Introduction

Unmanned aerial vehicles (UAVs) were first used primarily in military applications. More recently, UAVs have been used as remote sensing tools to present aerial views in both scientific research fields and the civilian domain [1, 6]. UAVs are able to fly at lower altitudes and collect images at much higher resolutions than the conventional platforms such as satellites and manned aircraft [16]. Furthermore, UAVs can be used in a variety of scenarios because of the high maneuverability, simple control, security, and reliability. Of course, the wide application is also down to the sensors equipped on the UAVs. Traditional UAVs equipped with visual sensors are widely used in surveying, mapping, and inspection, but both the type and quality of the data have been tremendously enhanced due to the recent advances in sensor technology [7, 10]. In particular, infrared sensors have excellent imaging capabilities as the wavelength of infrared sensors is beyond the visible spectrum. Furthermore, the cost of thermal sensors has decreased dramatically [4]. As a result, UAVs equipped with both visible-light cameras and thermal infrared cameras have proved useful in many applications, such as power line inspection, solar panel inspection, search and rescue, precision agriculture, fire fighting, etc.

Vehicle detection is an important object detection application [2, 19], and detection from aerial platforms has become a key aspect of autonomous UAV systems in rescue or surveillance missions [3]. Autonomous UAV systems have gained increased popularity in civil and military applications due to their convenience and effectiveness. With the rapid growth in the number of traffic vehicles in recent years, traffic regulation is facing a huge challenge [15], and autonomous location reporting for detected vehicles can alleviate the need for manual image analysis [14]. The methods of vehicle detection are mainly divided into background modeling and methods based on apparent features, but the main difficulty is that the images are influenced by light, viewing angles, and occlusions [5]. In view of the above difficulties, scholars have attempted to use the traditional machine learning methods, but the results, to date, have not matched researchers expectations [3]. In this paper, to address these issues, we focus on a deep learning method based on multi-source data.

In 2013, the region-based convolutional neural network (R-CNN) algorithm took detection accuracy to a new height. Since then, a number of new deep learning methods have been proposed, including Fast-R-CNN, Faster-R-CNN, the single shot multi-box detector (SSD) [11], and You only look once (YOLO). YOLO frames object detection as a regression problem to spatially separated bounding boxes and associated class probabilities [12, 13]. The improved YOLOv2 is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. Using a novel, multi-scale training method. YOLOv2 in the mean average precision (mAP) outperforms methods such as Faster-R-CNN with ResNet and SSD, while running significantly faster [13].

This paper makes the following main contributions: (1) we propose a novel multi-source data acquisition system based on the UAV platform; (2) we attempt to detect vehicles based on deep learning using multi-source data obtained by the UAV; and (3) we take the multi-source data into consideration to improve the accuracy of the detection result.

The rest of this paper is organized as follows. An overview of the approach to data acquisition and processing is presented in Sect. 2. Section 3 describes the experimental process and the experimental results. Finally, we conclude the paper and discuss possible future work in Sect. 4.

2 Approach

In the following, we present an overview of the proposed approach for vehicle detection using visible and thermal infrared images. The whole process consists of four main parts: (1) obtain multi-source data through the UAV system; (2) conduct image correction and registration through feature point extraction and homography matrix; (3) integrate the multi-source data by image fusion and band combination; and (4) train the data and detect the vehicles using the YOLO model. A flowchart of the proposed approach is shown in Fig. 1.

Fig. 1.
figure 1

Flowchart of the proposed approach.

2.1 Multi-source Data Acquisition System

The UAV platform consists of a flight controller, propulsion system, GPS, rechargeable battery, and cameras with high-definition image transmission [9]. The use of both visible and thermal infrared cameras is essential for multi-source data. However, very few aircraft systems support the two kinds of cameras at the same time, and there are problems with the synchronization and transmission of multi-source data. To address these problems, the UAV system for obtaining multi-source data needs two extra cameras and a PC motherboard, a 4G module, a base station, and a lithium battery. The framework of the multi-source acquisition system is shown in Fig. 2.

Fig. 2.
figure 2

Multi-source data acquisition system flow diagram.

2.2 Image Correction and Registration

Modern camera lenses always suffer from distortion. In order to alleviate the influence on the image registration caused by image distortion, distortion correction is essential. We adopt a simple checkerboard to correct the distortion of the visible-light camera, but the common checkerboard is ineffective for distortion correction of the images captured by the thermal infrared camera. The reason for this is that the black and white grids appear almost the same in the thermal infrared image. Fortunately, we find a square as shown in Fig. 3(a) which is made of different colors of marbles. Different colors of marbles have different temperatures under sunlight owing to the different reflectivity and absorptivity. Higher-temperature marbles appear as lighter areas in the thermal infrared image, and lower-temperature marbles are the opposite. The square in thermal infrared image is shown in Fig. 3(b). We employ the maximally stable extremal regions (MSER) algorithm to locate the black regions of the calibration area through appropriate thresholds. The thresholds include the size of the area and the average of the region pixels, due to the black points in the image being small. Then, based on the black regions, we calculate the centers of the regions, and we consider the centers as the corner points in the checkerboard. The points extracted are shown in Fig. 3(c). After obtaining the checkerboards for the two kinds of camera, we used the camera calibrator tool in MATLAB to correct the distortion.

Fig. 3.
figure 3

(a) The calibration area in visible image. (b) The calibration area in thermal infrared image. (c) Detecting the points in the calibration area in thermal infrared image.

In order to process the visible image and thermal infrared image, the thermal infrared image should be registered based on the visible image, because the visible image has a higher resolution than the thermal infrared image. In this framework, the registration requires translation, rotation, and scaling. The images have similar patterns, but the pixel intensities are quite different or even inverse in the same region, which contributes to the complexity and difficulty of the feature matching. Therefore, we select the ground control points (GCPs) manually. And the proposed framework matches the multi-source data by homography matrix [17].

2.3 Vehicle Detection

In the proposed detection framework, we detect the vehicles with the multi-source data, so the key point of this paper is how to use the multi-source data to increase the accuracy of detection compared with the result of single-source data. We propose two ways of combining the visible and thermal infrared data for a better detection result: one is image fusion based on image weighted fusion and image band combination; the other is decision fusion based on the result of the detection in the visible and thermal infrared images. The vehicle detection method used in this framework is YOLO.

YOLO. YOLO runs a single convolutional network to predict multiple bounding boxes and class probabilities for those boxes. The network of YOLO uses features from the entire image to predict each bounding box, and predicts all the bounding boxes across all the classes. The system models the detection as the following process [12]:

  1. 1.

    Divide the input image into an \(S \times S\) grid.

  2. 2.

    Predict B bounding boxes and confidence scores for those boxes in each grid cell. The confidence is defined as \(Pr(Object)*IOU_{\mathrm {pred}}^{\mathrm {truth}}\)(intersection over union), and each bounding box contains five predictions: x,y,w,h, and confidence, where (xywh) represent the center of the box and the width and height of the box.

  3. 3.

    Predict C conditional class probabilities in each grid cell, \(Pr(Class_i |Object)\), which are conditioned on the grid cells containing an object.

  4. 4.

    Multiply the class probabilities and confidence at the test time, which indicates the specific class confidence scores for each box [8, 12].

Through the above process, YOLO learns generalizable representations of objects from the entire image, and is extremely fast. We therefore use YOLO as the vehicle detection method in the proposed framework. In addition, we enhance YOLO by expanding the types of input image, making it possible to support not only natural images but also multi-band images.

Image Fusion

Weighted fusion. In order to add the information of the thermal infrared image to the information of the visible image, we adopt simple weighted fusion, which is called the weighted averaging (WA) method. This is the simplest and most direct image fusion method. In the thermal infrared image, the vehicle and the background have great differences in pixel values. Fusing the visible image with the pixel values of the thermal infrared image can restrain the complex background to a certain extent, so the weighted fusion we apply is straightforward and efficient. The implementation of weighted fusion is simple, as shown in the following formula:

$$\begin{aligned} B_i=w*V_i+(1-w)*I,i=1,2,3 \end{aligned}$$
(1)

where B, V, and I represent the new image with three bands, the visible image, and the thermal infrared image, respectively, w represents the weights, and i represents the band number.

Band combination. We combine the image bands in order to learn more image features in the deep learning framework. The visible image has three bands, and the thermal infrared image has one band. By combining the bands of the two images, the new image has four bands, which provide richer image information.

Decision Fusion. The detection results of specific vehicles in the multi-source data are usually different; in other words, the detection boxes in the visible and thermal infrared images are different, and the redundant or complementary results provided by the multi-source data can actually be aggregated. Therefore, we adopt decision fusion to combine the detection results based on two boxes: \(box_1=(x_1,y_1,w_1,h_1,p_1)^T\); \(box_2=(x_2,y_2,w_2,h_2,p_2)^T\). Where (xy) is the center coordinate of the box, (wh) are the width and height of the box, and p represents the confidence score of the box. The new box (\(box_{new}=(x_{new},y_{new},w_{new},h_{new})^T\)) is calculated as the formula (2). An example of the detection box fusion is shown in Fig. 4. In addition, we need to take the overlapping areas of the detection boxes into consideration. We therefore employ a threshold to judge whether to combine the two boxes or not.

$$\begin{aligned} {\left[ \begin{array}{lr} x_{new}\\ y_{new}\\ w_{new}\\ h_{new} \end{array} \right] } = {\frac{p_1}{p_1+p_2}}* {\left[ \begin{array}{lr} x_{1}\\ y_{1}\\ w_{1}\\ h_{1} \end{array} \right] } + {\frac{p_1}{p_1+p_2}}* {\left[ \begin{array}{lr} x_{2}\\ y_{2}\\ w_{2}\\ h_{2} \end{array} \right] } \end{aligned}$$
(2)
Fig. 4.
figure 4

Detection boxes fusion.

3 Experiments and Analysis

We obtained the multi-source data sets from the UAV acquisition system. The data sets were then preprocessed and divided into training data sets and test data sets. We then trained the data in the deep learning framework. Finally, we used the trained model to detect the vehicles in the test data sets, and we compared the vehicle detection results in the different kinds of data sets.

3.1 Data Collection

We used a DJI MATRICE 100 quadcopter as the UAV, for its user-friendly control system and flexible platform. We equipped the UAV with two cameras for the multi-source data. One camera was a USB-connected industrial digital camera with a pixel size of 5.0 \(\upmu \)m\(\,\times \,\)5.2 \(\upmu \)m. The other was a card-type infrared camera with the wavelength range 8–14 \(\upmu \)m. The developed image preservation system saved the multi-source data simultaneously in video mode.

After obtaining the video from the UAV system, we selected some frames at intervals to ensure the diversity in the data set. We then obtained the visible and thermal infrared images from the video frames by cropping (the size of the video frame was 1280\(\,\times \,\)960, and the size of each image was 640\(\,\times \,\)480).

3.2 Data Pre-processing

The cameras carried on the UAV system suffered from distortion. We therefore used a big checkerboard to correct the visible-light camera, and the square in previous introduction to correct the distortion of the thermal infrared camera. By selecting the GCPs manually, we used the homography matrix to complete the registration. We obtained the transformation relationship for every pixel between the two images based on the registration result. Thanks to the fixed cameras and the invariant model, we employed the transformation relationship for all the images.

The size of all the images was 640\(\,\times \,\)480. After the registration, the thermal infrared images had invalid areas, so we cropped the images based on the infrared images. The size of the cropped images was then 500\(\,\times \,\)280. A set of images is shown as an example in Fig. 5, where it can be seen that the registration result basically meets the processing requirements.

Fig. 5.
figure 5

(a) Visible image. (b) Registered thermal infrared image.

Based on the visible image data set and the thermal infrared image data set, we undertook the weighted image fusion with different weight values: \(w_1\) = 0.7; \(w_2\) = 0.8; \(w_3\) = 0.9. We then prepared the band combination data, which contained three bands of the visible image and one band of the infrared image. We thus obtained the six data sets.

The next step was to select the training data and the test data. We labeled the vehicles in the images. The labels contained five parameters (classxywh). The first parameter expressed the class, (xy) was the top-left corner of the vehicle, and (wh) was the width and height of the vehicle. In the experiment, we labeled 1000 vehicles for the training data and 673 vehicles for the test data in every data set. Some typical samples from the visible data set and the thermal infrared data set are shown in Fig. 6.

Fig. 6.
figure 6

(a) Tagged samples from the visible image. (b) Tagged samples from the thermal infrared image.

3.3 Model Training for Vehicle Detection

We used YOLO to train the six data sets, with a batch size of 15, a momentum of 0.9, a weight decay of 0.0005, a learning rate of 0.00005, and the input images resized to 448\(\,\times \,\)448. For the six data sets, we tried to make sure that every data set was convergent and saved the training results of every data set.

3.4 Vehicle Detection

Based on the training results, we used the training modules to detect the vehicles in the six test data sets, with each data set having its own training module. We used the same threshold (threshold = 0.5) for each data set, which is defined as the confidence. This gave us the vehicle confidence scores for each box.

To visualize the detection performance, we take two test images from the six data sets as an example. Specifically, the first image has 6 vehicles in the ground truth shown in Fig. 7(a), and the second image has 20 vehicles in the ground truth shown in Fig. 8(a). The results of the vehicle detection are shown in Figs. 7(b)–(f) and 8(b)–(f), and the detection result of the decision fusion image is shown in three bands of the four bands.

Fig. 7.
figure 7

(a) Ground truth. (b) Visible image detection (c) Thermal infrared image detection. (d) Weighted fusion detection (w = 0.8). (e) Band combination detection. (f) Decision fusion detection.

Fig. 8.
figure 8

(a) Ground truth. (b) Visible image detection (c) Thermal infrared image detection. (d) Weighted fusion detection (w = 0.8). (e) Band combination detection. (f) Decision fusion detection.

As shown in the Figs. 7 and 8, the vehicles labelled with 3, 4 in Fig. 7(a) are not detected vehicles in visible image, But they are detected vehicles in Fig. 7(c). And the vehicles labelled with 1,2, 3, 4, 5, 6, 7 in Fig. 8(a) are not detected vehicles in visible image. But some of these had been detected with other strategies. For example, the vehicles labelled with 2, 3, 6 are detected vehicles in weighted fusion image.

3.5 Comparative Experiment

To evaluate the vehicle detection performance, four commonly used criteria were computed: the false positive rate (FPR), the missing ratio (MR), accuracy (AC), and error ratio (ER). These criteria are defined as follows:

$$\begin{aligned} FPR={\frac{Number\ of falsely\ detected\ vehicle}{Number\ of\ detected\ vehicle}}\times 100\% \end{aligned}$$
(3)
$$\begin{aligned} MR={\frac{Number\ of\ missing\ vehicle}{Number\ of\ vehicle}}\times 100\% \end{aligned}$$
(4)
$$\begin{aligned} AC={\frac{Number\ of\ detected\ vehicle}{Number\ of\ vehicle}}\times 100\% \end{aligned}$$
(5)
$$\begin{aligned} ER=FPR+MR \end{aligned}$$
(6)

For every detected vehicle, if it had an intersection overlap with a test vehicle of greater than 0.5, we considered it to be a true detected vehicle. If a test vehicle was not found within the detected vehicle with an intersection overlap of greater than 0.5, we considered it to be a missing vehicle. Table 1 shows the detection performance of the different data sets [18].

Table 1. Comparison of the different data sets (test vehicle sample sets: 673)

Based on the experimental results, we can conclude that the proposed detection framework shows a good vehicle detection performance. The detection is less effective in the thermal infrared data set due to the lower resolution. Not only can the weighted fusion and band combination data sets improve the accuracy of the vehicle detection result, to a certain degree, but the decision fusion using two base image data sets can also increase the accuracy of the detection. The weighted fusion was best able to improve the accuracy of the vehicle detection in these data sets when the weight was set to 0.8.

4 Conclusions

In this study, our main aim was to design a small UAV system with two types of cameras to obtain multi-source data. By using a deep learning framework to detect vehicles in the visible and thermal infrared images which have been corrected and registered, the results show that the framework is an efficient way to detect vehicles. The proposed framework adopts weighted fusion, band combination, and decision fusion methods to use the multi-source data. The experimental results show that the addition of the thermal infrared image data set can improve the accuracy of the vehicle detection.

In the future, we will try to obtain extreme weather data, and find more efficient methods to increase the accuracy of the image registration. Moreover, we will explore more possibilities of using multi-source data to further improve the accuracy of vehicle detection.