Abstract
Detecting vehicles from autonomous unmanned aerial vehicle (UAV) systems is attracting the attention of more and more researchers. This technique has also been widely applied in traffic monitoring and management. Differing from the other object detection frameworks which just use data from a single source (usually visible images), we adopt multi-source data (visible and thermal infrared images) for a robust detection performance. Since deep learning techniques have shown great performance in object detection, we utilize “You only look once”(YOLO), which is a state-of-the-art real-time object detection framework for automatic vehicle detection. The main contributions of this paper are as follows. (1) Through integrating a thermal infrared imaging sensor and a visible-light imaging sensor on the UAV, we build a multi-source data acquisition system. (2) The rich information from the multi-source data is fully exploited in the proposed detection framework to further improve the accuracy of the detection result.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Unmanned aerial vehicles (UAVs) were first used primarily in military applications. More recently, UAVs have been used as remote sensing tools to present aerial views in both scientific research fields and the civilian domain [1, 6]. UAVs are able to fly at lower altitudes and collect images at much higher resolutions than the conventional platforms such as satellites and manned aircraft [16]. Furthermore, UAVs can be used in a variety of scenarios because of the high maneuverability, simple control, security, and reliability. Of course, the wide application is also down to the sensors equipped on the UAVs. Traditional UAVs equipped with visual sensors are widely used in surveying, mapping, and inspection, but both the type and quality of the data have been tremendously enhanced due to the recent advances in sensor technology [7, 10]. In particular, infrared sensors have excellent imaging capabilities as the wavelength of infrared sensors is beyond the visible spectrum. Furthermore, the cost of thermal sensors has decreased dramatically [4]. As a result, UAVs equipped with both visible-light cameras and thermal infrared cameras have proved useful in many applications, such as power line inspection, solar panel inspection, search and rescue, precision agriculture, fire fighting, etc.
Vehicle detection is an important object detection application [2, 19], and detection from aerial platforms has become a key aspect of autonomous UAV systems in rescue or surveillance missions [3]. Autonomous UAV systems have gained increased popularity in civil and military applications due to their convenience and effectiveness. With the rapid growth in the number of traffic vehicles in recent years, traffic regulation is facing a huge challenge [15], and autonomous location reporting for detected vehicles can alleviate the need for manual image analysis [14]. The methods of vehicle detection are mainly divided into background modeling and methods based on apparent features, but the main difficulty is that the images are influenced by light, viewing angles, and occlusions [5]. In view of the above difficulties, scholars have attempted to use the traditional machine learning methods, but the results, to date, have not matched researchers expectations [3]. In this paper, to address these issues, we focus on a deep learning method based on multi-source data.
In 2013, the region-based convolutional neural network (R-CNN) algorithm took detection accuracy to a new height. Since then, a number of new deep learning methods have been proposed, including Fast-R-CNN, Faster-R-CNN, the single shot multi-box detector (SSD) [11], and You only look once (YOLO). YOLO frames object detection as a regression problem to spatially separated bounding boxes and associated class probabilities [12, 13]. The improved YOLOv2 is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. Using a novel, multi-scale training method. YOLOv2 in the mean average precision (mAP) outperforms methods such as Faster-R-CNN with ResNet and SSD, while running significantly faster [13].
This paper makes the following main contributions: (1) we propose a novel multi-source data acquisition system based on the UAV platform; (2) we attempt to detect vehicles based on deep learning using multi-source data obtained by the UAV; and (3) we take the multi-source data into consideration to improve the accuracy of the detection result.
The rest of this paper is organized as follows. An overview of the approach to data acquisition and processing is presented in Sect. 2. Section 3 describes the experimental process and the experimental results. Finally, we conclude the paper and discuss possible future work in Sect. 4.
2 Approach
In the following, we present an overview of the proposed approach for vehicle detection using visible and thermal infrared images. The whole process consists of four main parts: (1) obtain multi-source data through the UAV system; (2) conduct image correction and registration through feature point extraction and homography matrix; (3) integrate the multi-source data by image fusion and band combination; and (4) train the data and detect the vehicles using the YOLO model. A flowchart of the proposed approach is shown in Fig. 1.
2.1 Multi-source Data Acquisition System
The UAV platform consists of a flight controller, propulsion system, GPS, rechargeable battery, and cameras with high-definition image transmission [9]. The use of both visible and thermal infrared cameras is essential for multi-source data. However, very few aircraft systems support the two kinds of cameras at the same time, and there are problems with the synchronization and transmission of multi-source data. To address these problems, the UAV system for obtaining multi-source data needs two extra cameras and a PC motherboard, a 4G module, a base station, and a lithium battery. The framework of the multi-source acquisition system is shown in Fig. 2.
2.2 Image Correction and Registration
Modern camera lenses always suffer from distortion. In order to alleviate the influence on the image registration caused by image distortion, distortion correction is essential. We adopt a simple checkerboard to correct the distortion of the visible-light camera, but the common checkerboard is ineffective for distortion correction of the images captured by the thermal infrared camera. The reason for this is that the black and white grids appear almost the same in the thermal infrared image. Fortunately, we find a square as shown in Fig. 3(a) which is made of different colors of marbles. Different colors of marbles have different temperatures under sunlight owing to the different reflectivity and absorptivity. Higher-temperature marbles appear as lighter areas in the thermal infrared image, and lower-temperature marbles are the opposite. The square in thermal infrared image is shown in Fig. 3(b). We employ the maximally stable extremal regions (MSER) algorithm to locate the black regions of the calibration area through appropriate thresholds. The thresholds include the size of the area and the average of the region pixels, due to the black points in the image being small. Then, based on the black regions, we calculate the centers of the regions, and we consider the centers as the corner points in the checkerboard. The points extracted are shown in Fig. 3(c). After obtaining the checkerboards for the two kinds of camera, we used the camera calibrator tool in MATLAB to correct the distortion.
In order to process the visible image and thermal infrared image, the thermal infrared image should be registered based on the visible image, because the visible image has a higher resolution than the thermal infrared image. In this framework, the registration requires translation, rotation, and scaling. The images have similar patterns, but the pixel intensities are quite different or even inverse in the same region, which contributes to the complexity and difficulty of the feature matching. Therefore, we select the ground control points (GCPs) manually. And the proposed framework matches the multi-source data by homography matrix [17].
2.3 Vehicle Detection
In the proposed detection framework, we detect the vehicles with the multi-source data, so the key point of this paper is how to use the multi-source data to increase the accuracy of detection compared with the result of single-source data. We propose two ways of combining the visible and thermal infrared data for a better detection result: one is image fusion based on image weighted fusion and image band combination; the other is decision fusion based on the result of the detection in the visible and thermal infrared images. The vehicle detection method used in this framework is YOLO.
YOLO. YOLO runs a single convolutional network to predict multiple bounding boxes and class probabilities for those boxes. The network of YOLO uses features from the entire image to predict each bounding box, and predicts all the bounding boxes across all the classes. The system models the detection as the following process [12]:
-
1.
Divide the input image into an \(S \times S\) grid.
-
2.
Predict B bounding boxes and confidence scores for those boxes in each grid cell. The confidence is defined as \(Pr(Object)*IOU_{\mathrm {pred}}^{\mathrm {truth}}\)(intersection over union), and each bounding box contains five predictions: x,y,w,h, and confidence, where (x, y, w, h) represent the center of the box and the width and height of the box.
-
3.
Predict C conditional class probabilities in each grid cell, \(Pr(Class_i |Object)\), which are conditioned on the grid cells containing an object.
-
4.
Multiply the class probabilities and confidence at the test time, which indicates the specific class confidence scores for each box [8, 12].
Through the above process, YOLO learns generalizable representations of objects from the entire image, and is extremely fast. We therefore use YOLO as the vehicle detection method in the proposed framework. In addition, we enhance YOLO by expanding the types of input image, making it possible to support not only natural images but also multi-band images.
Image Fusion
Weighted fusion. In order to add the information of the thermal infrared image to the information of the visible image, we adopt simple weighted fusion, which is called the weighted averaging (WA) method. This is the simplest and most direct image fusion method. In the thermal infrared image, the vehicle and the background have great differences in pixel values. Fusing the visible image with the pixel values of the thermal infrared image can restrain the complex background to a certain extent, so the weighted fusion we apply is straightforward and efficient. The implementation of weighted fusion is simple, as shown in the following formula:
where B, V, and I represent the new image with three bands, the visible image, and the thermal infrared image, respectively, w represents the weights, and i represents the band number.
Band combination. We combine the image bands in order to learn more image features in the deep learning framework. The visible image has three bands, and the thermal infrared image has one band. By combining the bands of the two images, the new image has four bands, which provide richer image information.
Decision Fusion. The detection results of specific vehicles in the multi-source data are usually different; in other words, the detection boxes in the visible and thermal infrared images are different, and the redundant or complementary results provided by the multi-source data can actually be aggregated. Therefore, we adopt decision fusion to combine the detection results based on two boxes: \(box_1=(x_1,y_1,w_1,h_1,p_1)^T\); \(box_2=(x_2,y_2,w_2,h_2,p_2)^T\). Where (x, y) is the center coordinate of the box, (w, h) are the width and height of the box, and p represents the confidence score of the box. The new box (\(box_{new}=(x_{new},y_{new},w_{new},h_{new})^T\)) is calculated as the formula (2). An example of the detection box fusion is shown in Fig. 4. In addition, we need to take the overlapping areas of the detection boxes into consideration. We therefore employ a threshold to judge whether to combine the two boxes or not.
3 Experiments and Analysis
We obtained the multi-source data sets from the UAV acquisition system. The data sets were then preprocessed and divided into training data sets and test data sets. We then trained the data in the deep learning framework. Finally, we used the trained model to detect the vehicles in the test data sets, and we compared the vehicle detection results in the different kinds of data sets.
3.1 Data Collection
We used a DJI MATRICE 100 quadcopter as the UAV, for its user-friendly control system and flexible platform. We equipped the UAV with two cameras for the multi-source data. One camera was a USB-connected industrial digital camera with a pixel size of 5.0 \(\upmu \)m\(\,\times \,\)5.2 \(\upmu \)m. The other was a card-type infrared camera with the wavelength range 8–14 \(\upmu \)m. The developed image preservation system saved the multi-source data simultaneously in video mode.
After obtaining the video from the UAV system, we selected some frames at intervals to ensure the diversity in the data set. We then obtained the visible and thermal infrared images from the video frames by cropping (the size of the video frame was 1280\(\,\times \,\)960, and the size of each image was 640\(\,\times \,\)480).
3.2 Data Pre-processing
The cameras carried on the UAV system suffered from distortion. We therefore used a big checkerboard to correct the visible-light camera, and the square in previous introduction to correct the distortion of the thermal infrared camera. By selecting the GCPs manually, we used the homography matrix to complete the registration. We obtained the transformation relationship for every pixel between the two images based on the registration result. Thanks to the fixed cameras and the invariant model, we employed the transformation relationship for all the images.
The size of all the images was 640\(\,\times \,\)480. After the registration, the thermal infrared images had invalid areas, so we cropped the images based on the infrared images. The size of the cropped images was then 500\(\,\times \,\)280. A set of images is shown as an example in Fig. 5, where it can be seen that the registration result basically meets the processing requirements.
Based on the visible image data set and the thermal infrared image data set, we undertook the weighted image fusion with different weight values: \(w_1\) = 0.7; \(w_2\) = 0.8; \(w_3\) = 0.9. We then prepared the band combination data, which contained three bands of the visible image and one band of the infrared image. We thus obtained the six data sets.
The next step was to select the training data and the test data. We labeled the vehicles in the images. The labels contained five parameters (class, x, y, w, h). The first parameter expressed the class, (x, y) was the top-left corner of the vehicle, and (w, h) was the width and height of the vehicle. In the experiment, we labeled 1000 vehicles for the training data and 673 vehicles for the test data in every data set. Some typical samples from the visible data set and the thermal infrared data set are shown in Fig. 6.
3.3 Model Training for Vehicle Detection
We used YOLO to train the six data sets, with a batch size of 15, a momentum of 0.9, a weight decay of 0.0005, a learning rate of 0.00005, and the input images resized to 448\(\,\times \,\)448. For the six data sets, we tried to make sure that every data set was convergent and saved the training results of every data set.
3.4 Vehicle Detection
Based on the training results, we used the training modules to detect the vehicles in the six test data sets, with each data set having its own training module. We used the same threshold (threshold = 0.5) for each data set, which is defined as the confidence. This gave us the vehicle confidence scores for each box.
To visualize the detection performance, we take two test images from the six data sets as an example. Specifically, the first image has 6 vehicles in the ground truth shown in Fig. 7(a), and the second image has 20 vehicles in the ground truth shown in Fig. 8(a). The results of the vehicle detection are shown in Figs. 7(b)–(f) and 8(b)–(f), and the detection result of the decision fusion image is shown in three bands of the four bands.
As shown in the Figs. 7 and 8, the vehicles labelled with 3, 4 in Fig. 7(a) are not detected vehicles in visible image, But they are detected vehicles in Fig. 7(c). And the vehicles labelled with 1,2, 3, 4, 5, 6, 7 in Fig. 8(a) are not detected vehicles in visible image. But some of these had been detected with other strategies. For example, the vehicles labelled with 2, 3, 6 are detected vehicles in weighted fusion image.
3.5 Comparative Experiment
To evaluate the vehicle detection performance, four commonly used criteria were computed: the false positive rate (FPR), the missing ratio (MR), accuracy (AC), and error ratio (ER). These criteria are defined as follows:
For every detected vehicle, if it had an intersection overlap with a test vehicle of greater than 0.5, we considered it to be a true detected vehicle. If a test vehicle was not found within the detected vehicle with an intersection overlap of greater than 0.5, we considered it to be a missing vehicle. Table 1 shows the detection performance of the different data sets [18].
Based on the experimental results, we can conclude that the proposed detection framework shows a good vehicle detection performance. The detection is less effective in the thermal infrared data set due to the lower resolution. Not only can the weighted fusion and band combination data sets improve the accuracy of the vehicle detection result, to a certain degree, but the decision fusion using two base image data sets can also increase the accuracy of the detection. The weighted fusion was best able to improve the accuracy of the vehicle detection in these data sets when the weight was set to 0.8.
4 Conclusions
In this study, our main aim was to design a small UAV system with two types of cameras to obtain multi-source data. By using a deep learning framework to detect vehicles in the visible and thermal infrared images which have been corrected and registered, the results show that the framework is an efficient way to detect vehicles. The proposed framework adopts weighted fusion, band combination, and decision fusion methods to use the multi-source data. The experimental results show that the addition of the thermal infrared image data set can improve the accuracy of the vehicle detection.
In the future, we will try to obtain extreme weather data, and find more efficient methods to increase the accuracy of the image registration. Moreover, we will explore more possibilities of using multi-source data to further improve the accuracy of vehicle detection.
References
Breckon, T.P., Barnes, S.E., Eichner, M.L., Wahren, K.: Autonomous real-time vehicle detection from a medium-level UAV. In: Proceedings of 24th International Conference on Unmanned Air Vehicle Systems, p. 29–1. sn (2009)
Chen, X., Xiang, S., Liu, C.L., Pan, C.H.: Vehicle detection in satellite images by parallel deep convolutional neural networks. In: 2013 2nd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 181–185. IEEE (2013)
Chen, X., Xiang, S., Liu, C.L., Pan, C.H.: Vehicle detection in satellite images by hybrid deep convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 11(10), 1797–1801 (2014)
Dai, C., Zheng, Y., Li, X.: Layered representation for pedestrian detection and tracking in infrared imagery. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops 2005, CVPR Workshops, p. 13. IEEE (2005)
Gaszczak, A., Breckon, T.P., Han, J.: Real-time people and vehicle detection from UAV imagery (2011)
Gleason, J., Nefian, A.V., Bouyssounousse, X., Fong, T., Bebis, G.: Vehicle detection from aerial imagery. In: 2011 IEEE International Conference on Robotics and Automation (ICRA), pp. 2065–2070. IEEE (2011)
Han, J., Zhang, D., Cheng, G., Guo, L., Ren, J.: Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning. IEEE Trans. Geosci. Remote Sens. 53(6), 3325–3337 (2015)
Redmon, J.: YOLO: real-time object detection. https://pjreddie.com/darknet/yolo/
Kaaniche, K., Champion, B., Pégard, C., Vasseur, P.: A vision algorithm for dynamic detection of moving vehicles with a UAV. In: Proceedings of the 2005 IEEE International Conference on Robotics and Automation, 2005, ICRA 2005, pp. 1878–1883. IEEE (2005)
Li, Z., Liu, Y., Hayward, R., Zhang, J., Cai, J.: Knowledge-based power line detection for UAV surveillance and inspection systems. In: 23rd International Conference on Image and Vision Computing New Zealand 2008, IVCNZ 2008, pp. 1–6. IEEE (2008)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. arXiv preprint arXiv:1612.08242 (2016)
Rodríguez-Canosa, G.R., Thomas, S., del Cerro, J., Barrientos, A., MacDonald, B.: A real-time method to detect and track moving objects (DATMO) from unmanned aerial vehicles (UAVs) using a single camera. Remote Sens. 4(4), 1090–1111 (2012)
Sun, Z., Bebis, G., Miller, R.: On-road vehicle detection: a review. IEEE Trans. Pattern Anal. Mach. Intell. 28(5), 694–711 (2006)
Turner, D., Lucieer, A., Watson, C.: Development of an unmanned aerial vehicle (UAV) for hyper resolution vineyard mapping based on visible, multispectral, and thermal imagery. In: Proceedings of 34th International Symposium on Remote Sensing of Environment, p. 4 (2011)
Ueshiba, T., Tomita, F.: Plane-based calibration algorithm for multi-camera systems via factorization of homography matrices. In: null, p. 966. IEEE (2003)
Zhang, F., Du, B., Zhang, L., Xu, M.: Weakly supervised learning based on coupled convolutional neural networks for aircraft detection. IEEE Trans. Geosci. Remote Sens. 54(9), 5553–5563 (2016)
Zhao, T., Nevatia, R.: Car detection in low resolution aerial images. Image Vis. Comput. 21(8), 693–703 (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Jiang, S., Luo, B., Liu, J., Zhang, Y., Zhang, L. (2017). UAV-Based Vehicle Detection by Multi-source Images. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 773. Springer, Singapore. https://doi.org/10.1007/978-981-10-7305-2_4
Download citation
DOI: https://doi.org/10.1007/978-981-10-7305-2_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7304-5
Online ISBN: 978-981-10-7305-2
eBook Packages: Computer ScienceComputer Science (R0)