1 Introduction

Nowadays, aquaculture has become one of the most promising avenues for coastal fishermen by breeding marine products [13], especially high-quality marine products in sea floor, such as sea cucumbers, sea urchins and scallops. Underwater operation in the traditional aquaculture are mainly carried out by manual labor, which comes with low efficiency and high risk. Meanwhile, due to the development of artificial intelligence and the decrease of manufacturing costs, a huge demand emerged for the application of underwater fishing robots, which are low-cost, reliable and affordable platforms for improving the efficiency of catching the marine products. Although underwater robots such as net cleaning robots have been widely used [13], the application of underwater fishing robots is still very challenging due to the difficulty of accurately detecting marine products in a complicated underwater environment.

With the development of Convolutional Neural Network (CNN), great improvements have been achieved on object detection on land, which are mainly divided into two categories: two-stage detectors and one-stage detectors. Two-stage detectors adopt a region proposal-based strategy, whose pipelines have two stages [3, 5,6,7, 9, 16]. The first stage generates a set of category-independent region proposals, and the second stage classifies them into foreground classes or background. One-stage detectors does not separate detection proposal, making the overall pipeline single stage [8, 10, 12, 14, 15]. Although some methods without relying on region proposal have been proposed, region proposal-based methods possess leading accuracy on benchmarks datasets (e.g., PASCAL VOC [4], ILSVRC [19], and Microsoft COCO [11] datasets). Faster R-CNN [16] is one of the most well-known object detection framework, which proposed an efficient and accurate Region Proposal Network (RPN) to generate region proposals. Since then, these RPN-like proposals are standards for recent two-stage object detectors.

Existing object detectors heavily depend on a significant number of accurate annotated images [4, 11, 19]. The annotation of such benchmark datasets often cost too much time and labors. To reduce the cost of obtaining accurate annotation, some weakly supervised and semi-supervised object detection frameworks have been proposed over the past years. At present, Weakly supervised detection mainly focuses on image-level annotation instead of the bounding-box annotation [20, 21, 23]. Semi-supervised object detectors are trained by using few annotated data and massive unannotated data [1, 2, 17]. Nevertheless, the reduction of annotation cost is usually at the cost of degrading model accuracy. Though many promising ideas have been proposed in weakly supervised and semi-supervised object detection, they are still far from comparable to strongly supervised ones.

Unlike land images with common object categories, underwater images own the characteristics of image degradation and color distortion due to the absorption and scattering of light through water. Besides, objects in the underwater environment are usually small and tend to cluster. These reasons cause annotating underwater objects difficult and time-consuming particularly. Therefore, as shown in Fig. 1, missing partial annotations often occurs in underwater image datasets. Under these circumstances, the negative examples are generated not only from the background but also the unannotated foreground, which will misguide the training of detectors. Existing strongly and weakly supervised detection algorithms cannot achieve satisfied results in underwater object detection.

Fig. 1.
figure 1

Example of underwater image and corresponding groundtruth in URPC2017.

To solve this problem, we propose a proposal-refined weakly supervised object detection method, focusing on training detectors with incomplete annotated dataset. We discover that there are great differences between foreground and background in underwater images. Inspired by this, we design a weakly-fitted segmentation network to segment the foreground and background of an image by only using incomplete annotated detection dataset. Then, we use the segmentation map to control the generation of positive and negative examples when training the detection network, which is conducive to the generation of high-quality proposals. The proposed method does not restricted to a specific object detection framework. In fact, it can be incorporated into any advanced ones. Our experiments are carried out on the Underwater Robot Picking Contest 2017 dataset (URPC2017). Experiments show that the proposed method greatly improves the accuracy of object detection compared to several baseline methods.

2 The Proposed Method

2.1 Overview

In order to reduce the influence of the missed annotations of the training images, we design a weakly-fitted segmentation network to separate the foreground from background, and then utilize the results generated in the segmentation network to guide the generation of positive and negative examples in training of the detector. Figure 2 shows the overview of the proposed architecture. It consists of two stages, where the first stage is a weakly-fitted segmentation network and the second stage is a proposal-refined object detection network. The details of each part of our model are introduced in the following sections.

Fig. 2.
figure 2

The architecture of the proposed object detection network.

2.2 Weakly-Fitted Segmentation

To segment the foreground and background of an underwater image, we utilize the idea of U-Net [18], which consists of a contracting path to capture context information and an expanding path to guarantee the accuracy of localization. The traditional well-trained U-Net cannot accurately separate the foreground from background on our underwater images because there are a lot of unannotated foreground area in the training dataset. To address this problem, we propose two modifications: (1) As shown in Fig. 3, We design a light-weighted U-Net to reduce the ability to fit the training dataset. More specifically, we use 7 convolutional layers in downsampling and 6 deconvolutional layers in upsampling, which consist of \(3\times 3\) convolutions with stride 2 without double and halve the number of feature channels at each downsampling and upsampling step. The design of asymmetric convolutional layer can reduce the fitting degree of model to incomplete annotated training dataset. After that, the image size is restored by bilinear interpolation. (2) To segment the foreground as much as possible, the network is back-propagated via a modified MSE loss, denoted as

$$\begin{aligned} L(y,y^*) = \frac{1}{N}\sum _{i = 0} ^N(y_i^* -y_i)^2+\lambda \frac{1}{N}\sum _{i = 0} ^Ny_i^*(y_i^* -y_i) \end{aligned}$$
(1)

where y is the output image of the weakly-fitted segmentation network, \(y^*\) is the ground-truth image generated by the bounding-box area of the underwater object detection dataset. i is the index of each pixel in an image, N is the number of pixels in an image. The value of \(y_i^*\) equals to 0 if it belongs to background while the value of \(y_i^*\) equals to 1 if it belongs to foreground. The term \(y_i^*(y_i^* -y_i)\) means the last item is activated only for foreground. So, the last item can enlarge the loss, which takes the foreground as background influenced by the confusing of incomplete annotated datasets. Moreover, the two terms are normalized by N and weighted by a balancing parameter \(\lambda \).

Fig. 3.
figure 3

The architecture of the weakly-fitted segmentation network.

2.3 Proposal-Refined Object Detection

The quality of the proposals has great influence on the performance of object detection. Therefore various studies focus on region proposal generation [22, 24]. Among them, Region Proposal Network (RPN) proposed by Faster R-CNN [16] is the most influential method in recent years. Accordingly, we build our strategy based on the Faster R-CNN framework in this paper.

The architecture of Faster R-CNN can be divided into two parts: Region proposal network (RPN) and region-of-interest (ROI) classifier. For training RPN, traditional methods assign a negative label to an anchor if its Intersection-over-Union (IoU) ratio is lower than 0.3 for all ground-truth boxes. However, as shown in Fig. 4, many false negative examples which contain unlabeled objects will be generated due to the incomplete annotated dataset. It will affect the learning of RPN network directly. To address this problem, We add an input which is generated in the first stage to RPN and ROI classifier. When RPN and ROI classifier assigns a negative label to an anchor, it not only refers to the IoU for the ground-truth, but also the segmentation map.

Fig. 4.
figure 4

The instance of ture and false negative examples.

The specific steps are as follows: (1) Firstly, the foreground of underwater images can be obtained by the weakly-fitted segmentation network, which is denoted as \(S_1\). Then, we subtract ground-truth boxes from \(S_1\) to gain the unlabeled foreground region \(S_2\). (2) For training RPN, the method of labeling positives is the same as traditional strategy. Nevertheless, assigning a negative label to an anchor needs to satisfy two conditions: (i) Its IoU ratio is lower than 0.3 for all ground-truth boxes. (ii) its IoU ratio is lower than or equal to \(\beta \) for \(S_2\). Similarly, the generation of positive and negative examples is constrained by both ground-truth and segmentation map during the training of the ROI classifier.

By controlling the generation of negative examples, we can eliminate the false negative examples, thus provide more accurate positive and negative examples for training object detection network to generate high-quality proposals. Following [16], classification loss and bounding-box regression loss are computed for both the RPN and the RoI classifiers

$$\begin{aligned} L_{total} = L_{cls}^{rpn}+c^*L_{reg}^{rpn}+L_{cls}^{roi}+p^*L_{reg}^{roi} \end{aligned}$$
(2)

where \(L_{cls}\) is the cross-entropy loss for classification, \(L_{reg}\) is the smooth L1 loss defined in [5] for regression, \(c^*L_{reg}^{rpn}\) and \(p^*L_{reg}^{roi}\) mean the regression loss activated only for positive anchors and non-background class proposals respectively. It is worth mentioning that although the proposed method is carried out on Faster R-CNN, it is applicable to other region proposal-based methods such as R-FCN [3], FPN [9], Mask R-CNN [7].

3 Experiment

3.1 Dataset and Metric

Our experiments are carried out on the Underwater Robot Picking Contest 2017 dataset (URPC2017), which contains 3 object categories (sea cucumber, sea urchin and scallop) with a total of 19967 underwater images. The dataset is divided into the train, val and test set, which have 17655, 1317 and 985 images respectively. In the dataset, the amount of complete annotated data is fewer than incomplete annotated data. We train our segmentation and detection network on the trainval set. The trainval set contains both complete and incomplete annotated images. The test set consists of accurate and complete annotated images. The dataset used to train the weakly-fitted segmentation network is generated from the bounding box area (see Fig. 5). Object detection accuracy is measured by mean Average Precision (mAP).

Fig. 5.
figure 5

The generation of segmentation dataset.

3.2 Implementation Details

For the training of weakly-fitted Segmentation network, we use a learning rate of 0.0001 for 70k iterations and set \(\lambda \) = 2 which makes the two terms in Eq. 1 roughly equally weighted after normalization. For the training of the proposal-refined object detection network, we use Faster R-CNN as our baseline detection framework. The VGG16 pre-trained on ImageNet is used as the backbone architecture for feature extraction due to the small scale datasets. The initial learning rate is set to 0.0002 for the first 50k and then decrease to 0.00002 in the following 20k iterations. The momentum and weight decay are set to 0.9 and 0.0005, respectively. Other hyper-parameters are identical as those defined in [16].

3.3 Experimental Results

The Influence of IoU Threshold. We explore the influence of IoU threshold \(\beta \) of the segmentation map for detector. \(\beta \) = 1 is the baseline result of the original Faster R-CNN, which is not constrained by segmentation map when generating negative examples. As shown in Table 1, \(\beta \) = 0.3 outperforms other choices, which is \(12.1\%\) better than the baseline. It indicates that containing a part of object in the negative examples is beneficial to improve detection performance. When \(\beta \) = 0, detector will be trained on a large number of easily classified background examples, which is unuseful to improve detection accuracy. Consequently, we choose \(\beta \) = 0.3 for the following experiments.

Table 1. Comparison results with different IoU thresholds of segmentation map.

The Results of Weakly-Fitted Segmentation. Figure 6(c) shows the qualitative results of weakly-fitted segmentation: (a) is the input image, (b) is the segmentation result of U-Net. Obviously, Under the same experimental setting, U-Net cannot completely separate the foreground and background. Because the unannotated foreground area affects the ability of U-Net to distinguish foreground from background. However, The proposed weakly-fitted segmentation network can segment the foreground and background of an underwater images, including the unannotated region in the underwater object detection dataset. Because the proposed method reduce the fitting degree of model to training data and increase the penalty for regarding foreground as background.

Fig. 6.
figure 6

Qualitative segmentation results on URPC2017 dataset.

The Results of Proposal-Refined Object Detection. To show how Faster R-CNN and proposal-refined detector improve during the learning, we plot mAP of the two detectors for different training iterations. As shown in Fig. 7, both detectors get improved at the beginning stage. But the proposal-refined detector always have a higher mAP than the Faster R-CNN, suggesting the effectiveness of the proposal-refined object detection network. Figure 8 shows the qualitative results of proposal-refined detector (top) compared with the benchmark Faster R-CNN (bottom). It can be seen that proposal-refined detector can detect more objects than the baseline framework, especially the small and challenging objects.

Fig. 7.
figure 7

The changes of mAP for Faster R-CNN and proposal-refined detector on URPC2017 dataset during the process of training.

Fig. 8.
figure 8

Qualitative detection results on URPC2017. Top: the results of Faster R-CNN baseline model. Bottom: the results of proposal-refined detector.

Comparisons with the State-of-the-Arts. In this section, we present experimental results of our proposed method applied in other outstanding object detection networks: R-FCN [3], FPN [9], Mask R-CNN [7]. As shown in Table 2, our method improves the mAP of the original object detectors by about 10%, indicating the effectiveness and robustness of the proposal-refined weakly supervised object detection. By eliminating false negative examples, the proposed method can solve the problem of accuracy decrease caused by incomplete annotated dataset.

Table 2. Comparison results of different methods.

4 Conclusion

In this paper, we propose a simple but efficient framework for object detection in underwater images with incomplete annotated dataset. Our proposal-refined weakly supervised object detection system is composed of two stages. The first stage is a weakly-fitted segmentation network that separates foreground from background. The second stage is the proposal-refined object detector that uses the segmentation map to generate high-quality proposals. Experiments show that the proposed method greatly improves the detection performance compared to several baseline methods. Through our method, we can not only reduce the cost of dataset annotation, but also offset the accuracy decrease caused by missed annotation. In addition, the idea of the proposed method can not only be applied to underwater object detection but also to other detect tasks with incomplete annotation.