Keywords

1 Introduction

The task of automatic object detection from satellite and aerial images is playing an increasingly important role in several remote sensing applications [4]. In particular, building detection [17] is considered a difficult task mainly because building areas and their surroundings have a high variability in color intensity values and complex attributes [7, 11, 13, 15].

The methods proposed in the literature to solve the building detection task are mainly distinguished according to data source (LiDAR [3], aerial images [18], Google Earth images [22], etc) and classification method (active contours [15], textural features and Adaboost [2], convolutional neural networks [11, 12], etc.).

This work aims to study the building detection task using RGB images. We use Google Earth (GE) images because they are freely and easily accessible. However, to deal with these images is very challenging since they show discrepancies, different brightness and shadows. Such characteristics occur mainly because they are acquired from different sources and with variation of the acquisition data, resolution and radiometric quality [9]. In particular, this work focus on GE images from Brazil. Considering the dimensions of this country, the GE images present a wide diversity of geographical aspects, besides the variation of image quality. These two facts make this particular building detection task much more difficult than with less diverse, higher quality images.

Several methods to detect buildings on GE images have been proposed recently [8, 9, 11, 22]. In general, methods for solving the building detection task rely on segmentation and feature extraction [13, 20]. Many solutions of these recent works evaluate a particular region on Earth [14]. Moreover, they still depend on handcrafted resources [4]. These facts limit applications in large scale, since it affects their generability to other domains. This paper focuses on methods that uses Deep Learning as a fundamental tool, which provides an alternative approach to automatically learn effective features from a training set. However, as Zhang et al. (2016) [21] and Karpatne et al. (2016) [13] note, there are still some interesting challenges to the use of Deep Learning to solve remote sensing tasks: (a) How to retain the representation learning performance on the Deep Learning methods with fewer adequate training samples and (b) How to cope with the complexity of remote sensing images, with large variance in both the backgrounds and the objects, which makes it difficult to learn robust and discriminative representations of scenes and objects with Deep Learning.

The U-net, proposed by Ronnenberger et al. [19] in 2015, was originally applied to the segmentation of biomedical images with excellent performance. We adapted the U-net to RGB images to be suitable for the building detection task, considering all the challenges cited above. We compare its results to the convolution neural network, the patch-based method proposed in [11].

The main contribution of this work is that we found a unique model with good performance for solving the building detection task in RGB images with large intra-class and background variation, and with a large variation in terms of image quality, while requiring only a small set of images for training. A secondary contribution is an annotated dataset composed of 126 annotated images from Brazil, which is publicly available at [1].

The remainder of this paper is organized as follows. Section 2 discusses relevant related work. Section 3 briefly introduces the two CNN architectures tested in this paper. Section 4 presents the dataset. Section 5 describes the test procedure and results. Finally, Sect. 6 concludes this paper by making some final remarks and suggesting future work.

2 Related Work

Since this study is dedicated to automatic building detection of Google Earth images using Deep Learning, we mainly focus on previous work related to building detection on Google Earth images, and on the use of Deep Learning in remote sensing tasks.

In recent years, Convolutional Neural Networks (CNNs) have been widely used to solve different remote sensing tasks. Zhang et al. [21] proposed a CNN that is composed of multiple feature-extraction stages, each one comprising a sequence of a convolutional layer. Related to the building detection task, we find a number of related works. Zhang et al. [22] proposed a deep CNN-based method to detect suburban buildings from GE images from the Yunnan province, China. Guo et al. [11] used supervised machine learning methods, including a CNN model, to identify village buildings using GE images from the Kaysone area, Laos. Also, Kampffmeyer et al. [12] used different deep learning approaches for land cover mapping in urban remote sensing, and tested them on a set of aerial images from Vaihingen, Germany.

According to Kampffmeyer et al. [12], there are two main approaches to segment images by the use of CNNs. The first one, called the patch-based approach, predicts every pixel in the image by looking at the small enclosing region (patch) of the pixel. The second one, called the pixel-based approach semantic segmentation using end-to-end learning [16], classifies each pixel in the image using a downscaling-upscaling approach. Guo et al.’s work [11] is an example of the patch-based approach, and Kampffmeyer et al.’s [12] is an example of the pixel-based approach.

The main classifiers that use CNNs are patch-based. Recent models, such as ALexNET, ZFNet, GoogleLeNet, VGGNet and RESNet, have been built to improve the performance on the ImageNet dataset, which is composed by millions of patches with \(224 \times 224\) pixels associated to thousands of categories [10].

A survey on object detection in optical remote sensing images can be found in [4]. In particular, [17] proposed a survey for building detection. Finally, we recommend [21] for a survey on Deep Learning for remote sensing data.

3 CNN Architecture

This section describes the two CNN architectures that we tested in this work. The first one was proposed in [11], henceforth called the Patch-based Classification Model (PCM). The second one is the U-net proposed in [19].

The PCM is a simple network. It is composed by two convolutional layers, each one followed by a pooling layer with stride 2 for downsampling. Here we include a dense layer with one neuron at the end to perform the classification. The first convolutional layer contains six \(5\times 5\) filters and the second contains 12 \(4\times 4\) filters. It considers as an input an RGB patch of size \(18\times 18\) pixels.

The U-net is only composed by CNN layers, with no dense layers at the end. It has two paths, a contracting path (left side) and an expansive path (right side). Each path has 4 steps of CNN. On the contracting side, each step consists of two sequential applications of: \(3\times 3\) convolutions filter, a rectified linear unit (ReLU) operation, and a \(2\times 2\) max-pooling operation for downsampling. On the other side of the U-net, each step of the expansive path is composed of an upsampling layer followed by a \(2\times 2\) convolution (blue box), a concatenation with the corresponding cropped feature map from the contracting path, and two \(3\times 3\) convolutions. Finally, a \(1\times 1\) convolution is used at the final layer to map each 64-component feature vector to the desired number of classes. For more details about the U-net see [19].

We selected the CNN model proposed by Guo et al. [11] to compare with the U-net because it is a recent work used to solve building detection on GE images, which is exactly our task.

Fig. 1.
figure 1

Some images from the dataset: diversity of the urbanist characteristics and image quality. (a) N.Mutum; (b) Trairí; (c) RJ-Botafogo;(d) RJ-B.Sucesso; (e) RJ-Tijuca; (f) Rio Bonito

4 Dataset

In this work we annotated 126 RGB images, each one with approximate \(900~\times ~900\) pixels. These images have been collected from GE by the use of the Google Maps API. We selected 12 location distributed in different regions of Brazil. They are distant in many hundred kilometers and they represent in a very concise and simplified way the diversity of image scenarios in this huge country. Figure 1 presents six examples of the collected images. They illustrate the diversity of the urbanist characteristics in the country and the variability of image quality. To each image in this dataset, we manually added a layer that indicates, in binary form, where there is a building or not. This annotated dataset is available in [1].

We divided the 126 annotated images in two subsets: one with 114 for training and validation and other with one images of each one of 12 selected areas in Brazil, for testing.

The PCM and the U-net models require input data in different formats. For PCM, we generated approximately 600,000 patches of \(18\times 18\) pixels from the 126 RGB images joined with the classification layer. To each patch a binary label was assigned indicating whether there is a building in that patch or not. For the U-net, we use a moving window strategy (with a stride of 150 pixels) to extract 772 images of size \(572 \times 572\) pixels from the 126 RGB images and their corresponding classification layer.

5 Test Procedure and Results

We implemented the PCM and the U-net models in python with Keras [5]. We executed the training process of each model in an Intel i7 CPU with 64 GB of RAM and with an NVidia Titan X GPU. Each model were trained in numbers of epochs sufficient to have the best prediction performance without evidence of overfitting. This evaluation was performed comparing performance using F1 statistic, in the training and validation data set. For the PCM model training phase we executed 250 epochs in 2.7 h. For the U-net model training phase we executed 450 epochs in about 45 h. It is important to record here that we used the image generator functionality from Keras to augment our training examples. For each image in each epoch, we applied a random transformation not only to prevent overfitting, but also to improve the model performance [6]. More precisely, we applied vertical flips, horizontal flips, and scaling ranging from 0.9 to 1. In the results below, the prediction value in the PCM model refers to the \(18\times 18\) patch classification. The prediction result can be evaluated by comparing it with the truth classification layer either patch by patch or pixel by pixel, which results in different performance values. In the training and validation process, we use patch by patch comparison for the PCM model and pixel by pixel comparison for the U-net model. The measures for evaluation were Precision, Recall e F1-score.

Table 1. PCM and U-net model prediction statistics in the test dataset, using pixel by pixel evaluation, in the format mean (standard deviation).

A positive classification means value 1 (“has building”) in the classification layer. Table 1 shows the results for the test dataset using patch by patch and pixel by pixel evaluation for the PCM, and pixel by pixel evaluation for the U-net. We selected 6 images from these locations to illustrate the results.

In the training and validation phases the U-net model produced better results, with F1-score \(= 93.9\%\) in the validation against 86.6% for the PCM model. In the test phase (see Table 1), the results obtained by the PCM model using pixel by pixel evaluation are very similar to those reported by Guo et al. [11], which runs the model on a village in Laos: \(Precision = 41.9\%\); \(Recall = 94.0\%\); F1-score \(= 58.0\%\). However, the PCM model produced worse results than the U-net model. The mean precision with PCM was 68.1% using the patch by patch evaluation and 46.5% using the pixel by pixel evaluation, which means that the PCM made several false positive predictions. Figures 2c and a exemplify the PCM predictions with several false positive areas. On these two images the PCM model precision are 38.3% and 59.9% (patch by patch evaluation), respectively (as one can see in Table 2).

We notice that when the images have a low density of building areas, the performance of the PCM shows worse results when compared with the U-Net, since it generates several false positives (see Figs. 2a and c for the PCM and Figs. 2b and d for the U-net). The precision of the PCM model for the images in Figs. 2g and e are 71.3% and 57.0% respectively, using pixel by pixel evaluation. In these images that have high density building areas, one can observe that the PCM model predicted all streets as building areas. The relative high precision in RJ-B.Sucesso-PCM is due to the high density of buildings in the image. In those test images, we could observe that the errors were probably due to the PCM model classifying a \(18\times 18\) patch. As a patch could partially include building areas, the PCM model would generate false positives, wrongly classifying streets and green areas as buildings. The image of RJ-Botafogo-U-net (Fig. 1c) shows a region with high buildings, parking areas, and shadows. These artifacts generated false positive areas in the result of both algorithms (see Figs. 2e and f). In images that have a medium level of building density, for example in residential area, the PCM model (see Figs. 2i and k) still present high incidence of false positives. However, the brightness contrast in these images also generates false positive and false negative areas. Both PCM and U-net generate more false negatives in images with high brightness.

Table 2. Aggregated prediction statistics for 12 locations in the test dataset.
Fig. 2.
figure 2

Prediction output in different building density area. Green pixels represent True Positives, red pixels represent False Positives, and yellow pixels represent False Negatives. (a) N.Mutum-PCM; (b) N.Mutum-U-Net; (c) Trairí-PCM; (d) Trairí-U-Net; (e) RJ-Botafogo-PCM; (f) RJ-Botafogo-U-Net; (g) RJ-B. Sucesso-PCM; (h) RJ-B. Sucesso-U-Net; (i) RJ-Tijuca-PCM; (j) RJ-Tijuca-U-Net; (k) Rio Bonito-PCM; (l) Rio Bonito-U-Net (Color figure online)

We observe in Table 1 that U-net shows a good mean F1-score of 82.6%, with a low standard deviation of 6.6%, considering all images in the test dataset. Figures 2b and h illustrate the good precision of the U-net on these test images. However, we notice that the lowest U-net recall was 66.8% for the Nova Mutum area (Fig. 2b). This indicates a high occurrence of false negatives, which are mainly concentrated on the object boundaries. We can also observe that in the images in Fig. 2 the streets and green area are well identified. Incorrect classifications have usually occurred in areas with high concentration of buildings, Figs. 2e and f, due to higher occurrence of shadows in these cases. However, in the prediction by PCM, Fig. 2g and i, we verified a great incidence of false positives on the streets. Considering the high density of constructions, this difference is not evident in the statistics in Table 2. In these examples, the U-net F1-scores vary between 74.2% and 93.3% and PCM-Pixel F1-scores vary between 28.9% and 79.1%. This indicates that the U-net model shows to be quite robust, considering the high variability on the building characteristics, the different regions backgrounds, and the different levels of quality of the images.

6 Conclusions

In this work we found a unique model with good performance in solving the building detection task. It is relevant considering that we have used a single model to perform the task in a dataset with large intra-class and background variation, and also with a large variation in terms of image quality. Moreover, the results were obtained using a small set of training data. This annotated data is also a contribution of this work. Thus, we can conclude that U-net adapted to RGB images is a very promising net architecture to deal with the challenges of using Deep Learning to solve build detection tasks. In the future, we plan to increase the dataset to include more regions in order to achieve better generalization and precision.