1 Introduction

Object detection is a fundamental and important problem in computer vision. It is usually a key step towards many real-world applications, including image retrieval, intelligent surveillance, autonomous driving, etc. Object detection has been extensively studied over the past few decades and huge progress has been made with the emergence of deep convolutional neural networks. Currently, there are two main frameworks for CNN-based object detection: (i) the one-stage framework, such as YOLO [27] and SSD [24], which applies an object classifier and regressor in a dense manner without objectness pruning; and (ii) the two-stage framework, such as Faster-RCNN [29], RFCN [3] and FPN [22], which extracts object proposals followed by per-proposal classification and regression.

Fig. 1.
figure 1

The overall error analysis of the performance of the FPN detector [22] over all categories on the large, medium, and small subsets of the COCO dataset [23], respectively. The plots in each sub-image are a series of precision-recall curves under different evaluation settings defined in [23]. From the comparisons, we can see that there is a large gap between the performance of small and large/medium sized objects.

Object detectors of both frameworks have achieved impressive results on objects of large/medium size in large-scale detection benchmarks (e.g. the COCO dataset [23]) as shown in Fig. 1(a) and (b). However, the performance on small sized objects (defined as in [23]) is far from satisfactory as shown in Fig. 1(c). From the comparisons, we can see that there is a large gap between the performance of small and large/medium sized objects. The main difficulty for small object detection (SOD) is that small objects lack appearance information needed to distinguish them from background (or similar categories) and to achieve better localization. To achieve better detection performance on these small objects, SSD [24] exploits the intermediate conv feature maps to represent small objects. However, the shallow fine-grained conv feature maps are less discriminative, which leads to many false positive results. On the other hand, FPN [22] uses the feature pyramid to represent objects at different scales, in which low-resolution feature maps with strong semantic information are up-sampled and fused with the high-resolution feature maps with weak semantic information. However, up-sampling might generate artifacts, which can degrade detection performance.

To deal with the SOD problem, we propose a unified end-to-end convolutional neural network based on the classical generative adversarial network (GAN) framework, which can be incorporated into any existing detector. Following the structure of the seminal GAN work [9, 21], there are two sub-networks in our model: a generator network and a discriminator network. In the generator, a super-resolution network (SRN) is introduced to up-sample a small object image to a larger scale. Compared to directly resizing the image with bilinear interpolation, SRN can generate images of higher quality and less artifacts at large up-scaling factors (\(4\times \) in our current implementation). In the discriminator, we introduce the classification and regression branches for the task of object detection. The real and generated super-resolved images pass through the discriminator network that jointly distinguishes whether they are real or generated high-resolution images, determines which classes they belong to, and refines the predicted bounding boxes. More importantly, the classification and regression losses are further back-propagated to the generator, which encourages the generator to produce higher quality images for easier classification and better localization.

Contributions. This paper makes the following three main contributions. (1) A novel unified end-to-end multi-task generative adversarial network (MTGAN) for small object detection is proposed, which can be incorporated with any existing detector. (2) In the MTGAN, the generator network produces super-resolved images and the multi-task discriminator network is introduced to distinguish the real high-resolution images from fake ones, predict object categories, and refine bounding boxes, simultaneously. More importantly, the classification and regression losses are back-propagated to further guide the generator network to produce super-resolved images for easier classification and better localization. (3) Finally, we demonstrate the effectiveness of MTGAN within the object detection pipeline, where detection performance improves a lot over several state-of-the-art baseline detectors, primarily for small objects.

2 Related Work

2.1 General Object Detection

As a classic topic, numerous object detection systems have been proposed during the past decade or so. Traditional object detection methods are based on handcrafted features and the deformable part model (DPM). Due to the limited representation of handcrafted features, traditional object detectors register subpar performance, particularly on small sized objects.

In recent years, superior performance in image classification and scene recognition has been achieved with the resurgence of deep neural networks including CNNs [19, 32, 34]. Similarly, the performance of object detection has been significantly boosted due to richer appearance and spatial representations, which are learned by CNNs [7] from large scale image datasets. Currently, a CNN-based object detector can be simply categorized as belonging to one of two frameworks: the two stage framework and the one stage framework. The region-based CNN (RCNN) [7] can be considered as a milestone of the two stage framework for object detection and it has achieved state-of-the-art detection performance. Each region proposal is processed separately in RCNN [7], which is very time-consuming. After that, ROI-Pooling is introduced in Fast-RCNN [6], which can share the computation between the proposal extraction and classification steps, thus improving the efficiency greatly. By learning both these stages end-to-end, Faster RCNN [29] has registered further improvement in both detection performance and computational efficiency. However, all detectors of this framework show unsatisfactory performance on small objects in the COCO benchmark, since they do not have any explicit strategy to deal with such objects. To detect small objects better, FPN [22] combines the low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections, in which the learned conv feature maps are expected to contain strong semantic information for small objects. Because of this, FPN shows superior performance over Faster RCNN for the task of detecting small objects. However, the low-resolution feature maps in FPN are up-sampled to create the feature pyramid, a process which tends to introduce artifacts into the features and consequently degrades detection performance. Compared to FPN, our proposed method employs the super-resolution network to generate images with high-resolution (4\(\times \) up-scaling) from images with low-resolution, thus, avoiding the artifact problem caused by the up-sampling operator in FPN.

In the one stage framework, the detector directly classifies anchors into specific classes and regresses bounding boxes in a dense manner. For example, in SSD [24] (a typical one-stage detector), the low-level intermediate conv feature maps of high-resolution are used to detect small objects. However, these conv features usually only capture basic visual patterns void of strong semantic information, which may lead to many false positive results. Compared to SSD-like detectors, our discriminator uses deep strong semantic features to better represent small objects, thus, reducing the false positive rate.

2.2 Generative Adversarial Networks

In the seminal work [9], the generative adversarial network (GAN) is introduced to generate realistic-looking images from random noise inputs. GANs have achieved impressive results in image generation [4], image editing [35], representation learning [25], image super-resolution [21] and style transfer [16]. Recently, GANs have been successfully applied to super-resolution (SRGAN) [21], leading to impressive and promising results. Compared to super-resolution on natural images, images of specific objects in the COCO benchmark for example are full of diversity (e.g. blur, pose and illumination), thus, making the super-resolution process on these images much more challenging. In fact, the super-resolution images generated by SRGAN are blurred especially for low-resolution small objects, which is not helpful to train an accurate object classifier. To alleviate this problem, we introduce novel losses into the loss function of the generator, i.e. the classification and regression losses are back-propagated to the generator network in our proposed MTGAN, which further guides the generator to reconstruct finer super-resolved images for easier classification and better localization.

3 MTGAN for Small Object Detection

In this section, we introduce the proposed method in detail. First, we give a brief description of the classical GAN network to lay the context for describing our proposed Multi-Task GAN (MTGAN) for small object detection. Then, the whole architecture of our framework is described (refer to Fig. 2 for an illustration). Finally, we present each part of our MTGAN network and define the loss functions for training the generator and discriminator, respectively.

Fig. 2.
figure 2

The pipeline of the proposed small object detection system (SOD-MTGAN). (A) The images are fed into the network. (B) The baseline detector can be any type of detector (e.g. Faster RCNN [29], FPN [22], or SSD [24]). It is used to crop positive (i.e. objects) and negative (i.e. background) examples from input images for training the generator and discriminator networks, or generate regions of interest (ROIs) for testing. (C) The positive and negative examples (or ROIs) are generated by off-the-shelf detectors. (D) The generator sub-network reconstructs a super-resolved version (\(4\times \) up-scaling) of the low-resolution input image. (D) The discriminator network distinguishes the real from the generated high-resolution images, predicts the object categories, and regresses the object locations, simultaneously. The discriminator network can use any typical architecture like AlexNet [20], VGGNet [32] or ResNet [12] as the backbone network. We use ResNet-50 or ResNet-101 in our experiments.

3.1 GAN

GAN [9] learns a generator network G and a discriminator network D simultaneously via an adversarial process. The training process alternately optimizes the generator and discriminator, which are in competition with each other. The generator G is trained to produce samples to fool the discriminator D, and D is trained to distinguish real from fake images produced by G. The GAN loss to be optimized is defined as follows:

(1)

where z is random noise, x denotes the real data, \(\theta \) and \(\omega \) denote the parameters of D and G respectively. Here, G tries to minimize the objective function, while D tries to maximize it as Eq. (2):

(2)

Similar to [9, 21], we design a generator network \(G_{w}\), which is optimized in an alternating manner with discriminator network \(D_{\theta }\) seeking to jointly solve the super-resolution, object classification, and bounding box regression problems for small object detection. Therefore, the overall loss is defined as follows:

(3)

where \(I^{LR}\) and \(I^{HR}\) denote low-resolution and high-resolution images, respectively. u is the class label and v is the ground-truth bounding-box regression target. Unlike [9], the input of our generator is a low-resolution image rather than random noise. Compared to [21], we have multiple tasks in the discriminator, where we distinguish the generated super-resolved images vs. real high-resolution images, classify the object category, and regress the object location jointly. Specifically, the general idea behind Eq. (3) is that it allows one to train a generator G with the goal of fooling a differentiable discriminator D that is trained to distinguish super-resolved images from real high-resolution images. Furthermore, our method (SOD-MTGAN) extends classical SRGAN [21] by adding two more parallel branches to classify the categories and regress the bounding boxes of candidate ROI images. Moreover, the classification loss and regression loss in the discriminator are back-propagated to the generator to further promote it to produce super-resolved images that are also suitable for easier classification and better localization. In the following subsection, we introduce the architecture of the MTGAN and the training losses in detail.

Table 1. The architecture of the generator and discriminator network. “conv” and “layer*” represent the convolutional layer, “x5” denotes a residual block which has 5 convolutional layers, “de-conv” means a up-sampling convolutional layer, “2x” denotes up-sampling by a factor of 2, and “fc” indicates a fully connected layer. Note that we only post the architecture of the discriminator network with ResNet-50.

3.2 Network Architecture

Our generator takes low-resolution images as input, instead of random noise, and outputs super-resolved images. For the purpose of object detection, the discriminator is designed to distinguish generated super-resolved images from real high-resolution images, classify the object categories, and regress the location jointly.

Generator Network (\(G_{w}\)). As shown in Table 1 and Fig. 2, we adopt a deep CNN architecture which has shown effectiveness for image de-blurring in [13] and face detection in [1]. Different from [13], our generator includes up-sampling layers (i.e. de-conv in Table 1). There are two up-sampling fractionally-strided conv layers, three conv layers, and five residual blocks in the network. Particularly, in these residual blocks, we use two conv layers with 3 \(\times \) 3 kernels and 64 feature maps followed by batch-normalization layers [15] and parametric ReLU [11] as the activation function. Each de-convolutional layer consists of learned kernels, which up-samples a low-resolution image to a 2\(\times \) super-resolved image, which is usually better than re-sizing the same image by an interpolation method [5, 17, 33].

Our generator first up-samples low-resolution small images, which include both object and background candidate ROI images, to 4\(\times \) super-resolved images via the de-convolutional layers, and then performs convolution to produce corresponding clear images. The outputs of the generator (clear super-resolved images) are easier for the discriminator to classify as fake or real and to perform object detection (i.e. object classification and bounding-box regression).

Discriminator Network (\(D_{\theta }\)). We employ ResNet-50 or ResNet-101 [12] as our backbone network in the discriminator, and Table 1 shows the architecture of the ResNet-50 network. We add three parallel fc layers behind the last average pooling layer of the backbone network, which play the role of distinguishing the real high-resolution images from the generated super-resolved images, classifying object categories, and regressing bounding boxes, respectively. For this specific task, the first fc layer (called \(fc_{GAN}\)) uses a sigmoid loss function [26], while the classification fc layer (called \(fc_{cls}\)) and regression fc layer (called \(fc_{reg}\)) use the softmax and smooth L1 loss [6] functions, respectively.

The input of the discriminator is a high-resolution ROI image, and the output of the \(fc_{GAN}\) branch is the probability (\(p_{GAN}\)) of the input image being a real image, the output of \(fc_{cls}\) branch is the probability (\(p_{cls}=(p_0, ..., p_K)\)) of the input image being each of \(K+1\) object categories, and the output of \(fc_{reg}\) branch is the bounding-box regression offsets (\(t=(t_x, t_y, t_w, t_h)\)) for the ROI candidate.

3.3 Overall Loss Function

We adopt the pixel-wise and adversarial losses from some state-of-the-art GAN approaches [16, 21] to optimize our generator. In contrast to [21], we remove the feature matching loss to decrease the computational complexity without sacrificing much in generation performance. Furthermore, we introduce the classification and regression losses into the generator objective function to drive the generator network to recover fine details from small scale images for easier detection.

Pixel-wise Loss. The input of our generator network is small ROI images instead of random noise [9]. A natural and simple way to enforce the output of the generator (i.e. the super-resolved images) to be close to the ground-truth images is by minimizing the pixel-wise MSE loss, and it is computed as Eq. (4):

(4)

where \(I_i^{LR}\), \(G_{w}(I^{LR}_i)\) and \(I_i^{HR}\) denote small low-resolution images, generated super-resolved images, and real high-resolution images, respectively. G represents the generator network, and w denotes its parameters. However, it is known that the solution to the MSE optimization problem usually lacks high-frequency content, which results in blurred images with overly smooth texture.

Adversarial Loss. To achieve more realistic results, we introduce the adversarial loss [21] to the objective loss, defined as Eq. (5):

(5)

The adversarial loss encourages the network to generate sharper high-frequency details so as to fool the discriminator D. In Eq. (5), \(D_\theta (G_w(I^{LR}_i))\) denotes the probability of the resolved image \(G_w(I^{LR}_i)\) being a real high-resolution image.

Classification Loss. In order to complete the task of object detection and to make the generated images easier to classify, we introduce the classification loss to the overall objective. Let \(\{I^{LR}_i, i{ = }1,2,\ldots ,N\}\) and \(\{I^{HR}_i, i{ = }1,2,\ldots ,N\}\) denote low-resolution images and real high-resolution images respectively, and \(\{u_i, i{ = }1,2,\ldots ,N\}\) represent their corresponding labels, where \(u_i \in \{0, ..., K\}\) indicates the object category. As such, we formulate the classification loss as:

(6)

where \(p_{I_i^{LR}}=D_{cls}(G_w(I^{LR}_i))\) and \(p_{I_i^{HR}}=D_{cls}(I^{HR}_i))\) denote the probabilities of the generated super-resolved image and the real high-resolution image belonging to the true category \(u_i\), respectively.

In our method, our classification loss plays two roles. First, it guides the discriminator to learn a classifier that classifies high-resolution images, albeit generated super-resolved and real high-resolution images, as real or fake. Second, it promotes the generator to recovery sharper images for easier classification.

Regression Loss. To enable more accurate localization, we also introduce a bounding box regression loss [6] to the objective function, defined in Eq. (7):

(7)

in which,

(8)

where \(v_i = (v_{i,x}, v_{i,y}, v_{i,w}, v_{i,h})\) denotes a tuple of the true bounding-box regression target, and \(t_i = (t_{i,x}, t_{i,y}, t_{i,w}, t_{i,h})\) denotes the predicted regression tuple. \(t_i^{HR}\) and \(t_i^{SR}\) denote the tuples for the i-th real high-resolution and generated super-resolved images, respectively. The bracket indicator function \([u_i \ge 1]\) equals to 1 when \(u_i \ge 1\) and 0 otherwise. For a more detailed description of the regression loss, we refer the reader to [6].

Similar to the classification loss, our regression loss also has two purposes. First, it encourages the discriminator to regress the location of the object candidates cropped from the baseline detector. Second, it promotes the generator to produce super-resolved images with fine details for more accurate localization.

Objective Function. Based on the above analysis, we combine the adversarial loss in Eq. (5), classification loss in Eq. (6) and regression loss in Eq. (7) with the pixel-wise MSE loss in Eq. (4). As such, our GAN network can be trained by optimizing the objective function in Eq. (9):

(9)

where \(\alpha \), \(\beta \), and \(\gamma \) are weights trading off the different terms. These weights are cross-validated in our experiments.

Directly optimizing Eq. (9) with respect to w for updating generator G makes w diverge to infinity rapidly, since large w always makes the objective attain a large loss. For better behavior, we optimize the objective function in a fixed point optimization manner, as done in previous GAN work [16, 21]. Specifically, we optimize for the parameter w of generator G while keeping the discriminator D fixed and then update its parameter \(\theta \) keeping the generator fixed. Below are the resulting two sub-problems that are iteratively optimized as:

$$\begin{aligned} \begin{aligned} \displaystyle \min _w\quad&\frac{1}{N}\sum _{i=1}^N(\alpha \log (1-D_\theta (G_w(I^{LR}_i)))-\beta \log (D_{cls}(G_w(I^{LR}_i))))\\&+ \frac{1}{N}\sum _{i=1}^N \gamma \sum _{j\in {\{x,y,w,h\}}}[u_i\ge 1]\mathrm{S}_{L_1}(t_{i,j}^{SR}-v_{i,j})+\frac{1}{N}\sum _{i=1}^N\Vert G_{w1}(I^{LR}_i)-I^{HR}_i\Vert ^2 \end{aligned} \end{aligned}$$
(10)

and

(11)

The loss function of generator G in Eq. (10) consists of adversarial loss Eq. (5), MSE loss Eq. (4), classification loss Eq. (6) and regression loss Eq. (7), which enforce that the reconstructed images be similar to real, object specific, and localizable high-resolution images with high-frequency details. Compared to the previous GANs, we add the classification and regression losses of generated super-resolved object images to the generator loss. By introducing these two losses, the super-resolved images recovered from the generator network are more realistic than those optimized by only using the adversarial and MSE losses.

The loss function of discriminator D in Eq. (11) introduces the classification loss Eq. (6) and the regression loss Eq. (7). The function of classification loss is to classify the categories of the real high-resolution and generated super-resolved images, which is parallel to the basic formulation of GAN [9] to distinguish real or generated high-resolution images. In the field of small object detection, as we all know, a few pixel drift may make the predicted bounding-boxes fail to fulfill the evaluation criteria. Therefore, we introduce the regression loss (regression branch) into the discriminator network for better localization.

4 Experiments

In this section, we validate our proposed SOD-MTGAN detector on a challenging public object detection benchmark (i.e. COCO dataset [23]), where includes some ablation studies and comparisons against other state-of-the-art detectors.

4.1 Training and Validation Datasets

We use the COCO dataset [23] for all experiments. As stated in [23], there are more small objects than large/medium objects in the dataset, approximately 41% of objects are small (\(area < 32^2\)). Therefore, we use this dataset for training and validating the proposed method. For the object detection task, there are 125 K images taken in natural settings and of everyday life (i.e. objects with much diversity). 80 K/40 K/5 K of the data is randomly selected for training, validation, and testing, respectively. Following previous works [2, 22], we use the union of 80 k training images and a subset of 35 k validation images (trainval135k) for training, and report ablation results on the remaining 5 k validation images (minival).

During evaluation, the COCO dataset is divided into three subsets (small, medium, and large) based on the areas of objects. The medium and large subsets contain objects with an area larger than \(32^2\) and \(96^2\) pixels, respectively, while the small subset contains objects with an area less than \(32^2\) pixels. In this paper, we focus on small object detection using our proposed MTGAN network. We report the final detection performance using the standard COCO metrics, which include \(\mathrm{AP}\) (averaged over all IoU thresholds, i.e. [0.5:0.05:0.95]), \(\mathrm{AP}_{50}\), \(\mathrm{AP}_{75}\) and \(\mathrm{AP}_S\), \(\mathrm{AP}_M\), \(\mathrm{AP}_L\) (AP at different scales).

Table 2. The detection performance (AP) of our proposed method SOD-MTGAN against the baseline methods on the COCO minival subset. The AP performance of Faster RCNN [29] and Mask-RCNN [10] are provided by [8]. Obviously, SOD-MTGAN outperforms the baseline methods, especially on the small subset where the AP performance increases more than 1.5%.

4.2 Implementation Details

In the generator network, we set the trade-off weights \(\alpha = 0.001\), \(\beta = \gamma = 0.01\). The generator network is trained from scratch and the weights in each layer are initialized with a zero-mean Gaussian distribution with standard deviation 0.02, and the biases are initialized with 0. To avoid undesirable local optima, we first train an MSE-based SR network to initialize the generator network. For the discriminator network, we employ the ResNet-50 or ResNet-101 [12] model pre-trained on ImageNet as our backbone network and add three parallel fc layers as described in Sect. 3.2. The fc layers are initialized by a zero-mean Gaussian distribution with standard deviation 0.1, and biases initialized with 0.

Our baseline detectors are based on Faster RCNN with ResNet50-C4 [12] and FPN with ResNet101 [22]. All hyper-parameters of the baseline detectors are adopted from the setup in [10]. For training our generator and discriminator networks, we crop positive and negative ROI examples from COCO [23] trainval135k set with our baseline detectors. The corresponding low-resolution images are generated by down-sampling the high-resolution images using bicubic interpolation with a factor 4. During testing, 100 ROIs are cropped by our baseline detector and then fed to our MTGAN network to produce final detection.

During training, we use the Adam optimizer [18] for the generator and the SGD optimizer for the discriminator network. The learning rate for SGD is initially set to 0.01 and then reduced by a factor of 10 after every 40k mini-batches. Training is terminated after a maximum of 80k iterations. We alternately update the generator and discriminator network as in [9]. Our system is implemented in PyTorch, and the source code will be made publicly available.

4.3 Ablation Studies

We first compare our proposed method with the baseline detectors to prove the effectiveness of the MTGAN for small object detection. Moreover, we verify the positive influence of the regression branch in the discriminator network by comparing the AP performance with/without this branch. Finally, to validate the contribution of each loss (adversarial, classification, and regression) in the loss function of the generator, we also conduct ablation studies by gradually adding each of them to the pixel-wise MSE loss. Unless otherwise stated, all the ablation studies use the ResNet-50 as the backbone network in the discriminator.

Influence of the Multi-task GAN (MTGAN). Table 2 (the \(2^{nd}\) vs. \(3^{rd}\) row and the \(4^{th}\) vs. \(5^{th}\) row) compares the performance of the baseline detectors against our method on the COCO minival subset. From Table 2, we observe that the performance of our MTGAN with ResNet-50 outperforms Faster-RCNN (the ResNet-50-C4 detector) by a sizable margin (i.e. 1.5% in AP) on the small subset. Similarly, MTGAN with ResNet-101 improves over the FPN detector with ResNet-101 by 1.6% in AP. The reason is that the baseline detectors perform the down-sampling operations (i.e. convolution with stride 2) when extracting conv feature maps. The small objects themselves contain limited information, and the majority of the detailed information will be lost after down-sampling. For example, if the input is a 16 \(\times \) 16 pixel object ROI, the result is a 1 \(\times \) 1 C4 feature map and nothing is preserved for the C5 feature map. These limited conv feature maps degrade the detection performance for such small objects. In contrast, our method up-samples the low-resolution image to a fine scale, thus, recovering the detailed information and making detection possible. Figure 3 shows some super-resolved images generated by our MTGAN generator.

Fig. 3.
figure 3

Some examples of super-resolved images generated by our MTGAN network from small low-resolution patches. The first column of each image group depicts the original low-resolution image, which is upsampled 4\(\times \) for visualization. The second column is the ground truth high-resolution image, while the third column is the corresponding super-resolved image generated by our generator network.

Influence of the Regression Branch. As shown in Fig. 1, imperfect localization is one of the main sources of detection error. This especially the case for small sized objects, where small shifts in their bounding boxes lead to failed detections when the standard strict evaluation criteria are used. The regression branch in the discriminator can further refine bounding boxes and lead to more accurate localization. From Table 3 (\(1^{st}\) and \(5^{th}\) row), we see that the AP performance on the small object subset improves by 0.9% when the regression branch is added, thus, demonstrating its effectiveness on the detection pipeline.

Influence of the Adversarial Loss. Table 3 (the \(2^{nd}\) and \(5^{th}\) row) shows that the AP on the small subset drops by 0.5% without the adversarial loss. The reason is that the generated images without adversarial loss are over-smooth and lack high frequency information, which is important for object detection. To encourage the generator to produce high-quality images for better detection, we use the adversarial loss to train our generator network.

Influence of the Classification Loss. From Table 3 (the \(3^{rd}\) and \(5^{th}\) row), we see that the AP performance increases by about 1% on the small subset when the classification loss is incorporated. Clearly, this validates the claim that the classification loss promotes the generator to recover finer detailed information for better classification. In doing so, the discriminator can exploit the fine details to predict the correct category of the ROI images.

Influence of the Regression Loss. As shown in Table 3 (the \(4^{th}\) and \(5^{th}\) row), the AP performance increases by nearly 1% on the small subset by using the regression loss to train the generator network. Similar to the classification loss, the regression loss drives the generator to recover some fine details for better localization. The increased AP demonstrates the necessity of the regression loss in the generator loss function.

Table 3. Performance of our SOD-MTGAN model trained with and without the regression branch, adversarial loss, classification loss, and regression loss on the COCO minival subset. “reg+” indicates the regression branch in the discriminator, “adv” denotes the adversarial loss in Eq. (5), “cls” represents the classification loss in Eq. (6), and “reg” indicates the regression loss in Eq. (7).

4.4 State-of-the-Art Comparison

We compare our proposed method (SOD-MTGAN) with several state-of-the-art object detectors [10, 12, 14, 22, 24, 28, 31] on the COCO \(test-dev\) subset. Table 4 lists the performance of every detector, from which we conclude that our method surpasses all other state-of-the-art methods on all subsets. More importantly, our SOD-MTGAN achieves the highest performance (24.7%) on the small subset, outperforming the second best object detector by about 3%. This AP improvement is most notable for the small object subset, which clearly demonstrates the effectiveness of our method on small object detection.

Table 4. The performance (AP) of the proposed SOD-MTGAN detector and other state-of-the-art methods on COCO \(test-dev\) subset. “+++” denotes the more complex training/test stages, which includes multi-scale train/test, horizontal flip train/test and OHEM [30] in the Faster RCNN.
Fig. 4.
figure 4

Qualitative results of the SOD-MTGAN detector. Green and red boxes denote the ground-truths and the results of our method. Best seen in color and zoomed in. (Color figure online)

4.5 Qualitative Results

Figure 4 shows some detection results generated by the proposed SOD-MTGAN detector. We observe that our method successfully finds almost all the objects, even though some ones are very small. This demonstrates the effectiveness of our detector on the small object detection problem. Figure 4 shows some failure cases including some false negative and positive results, which indicate that there is still room for progress in further improving small object detection performance.

5 Conclusion

In this paper, we propose an end-to-end multi-task GAN (MTGAN) to detect small objects in unconstrained scenarios. In the MTGAN, the generator upsamples the small blurred ROI images to fine-scale clear images, which are passed through the discriminator for classification and bounding box regression. To recover detailed information for better detection, the classification and regression losses in the discriminator are propagated back to the generator. Extensive experiments on the COCO dataset demonstrate that our detector improves state-of-the-art AP performance in general, where the largest improvement is observed for small sized objects.