Proposal-Refined Weakly Supervised Object Detection in Underwater Images

Lv, Xiaoqian; Wang, An; Liu, Qinglin; Sun, Jiamin; Zhang, Shengping

doi:10.1007/978-3-030-34120-6_34

Proposal-Refined Weakly Supervised Object Detection in Underwater Images

Xiaoqian Lv¹⁴,
An Wang¹⁴,
Qinglin Liu¹⁴,
Jiamin Sun¹⁴ &
…
Shengping Zhang¹⁴

Conference paper
First Online: 28 November 2019

2132 Accesses
9 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11901))

Abstract

Recently, Convolutional Neural Networks (CNNs) have achieved great success in object detection due to their outstanding abilities of learning powerful features on large-scale training datasets. One of the critical factors of their success is the accurate and complete annotation of the training dataset. However, accurately annotating the training dataset is difficult and time-consuming in some applications such as object detection in underwater images due to severe foreground clustering and occlusion. In this paper, we study the problem of object detection in underwater images with incomplete annotation. To solve this problem, we propose a proposal-refined weakly supervised object detection method, which consists of two stages. The first stage is a weakly-fitted segmentation network for foreground-background segmentation. The second stage is a proposal-refined detection network, which uses the segmentation results of the first stage to refine the proposals and therefore can improve the performance of object detection. Experiments are conducted on the Underwater Robot Picking Contest 2017 dataset (URPC2017) which has 19967 underwater images containing three kinds of objects: sea cucumber, sea urchin and scallop. The annotation of the training set is incomplete. Experimental results show that the proposed method greatly improves the detection performance compared to several baseline methods.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Nowadays, aquaculture has become one of the most promising avenues for coastal fishermen by breeding marine products [13], especially high-quality marine products in sea floor, such as sea cucumbers, sea urchins and scallops. Underwater operation in the traditional aquaculture are mainly carried out by manual labor, which comes with low efficiency and high risk. Meanwhile, due to the development of artificial intelligence and the decrease of manufacturing costs, a huge demand emerged for the application of underwater fishing robots, which are low-cost, reliable and affordable platforms for improving the efficiency of catching the marine products. Although underwater robots such as net cleaning robots have been widely used [13], the application of underwater fishing robots is still very challenging due to the difficulty of accurately detecting marine products in a complicated underwater environment.

With the development of Convolutional Neural Network (CNN), great improvements have been achieved on object detection on land, which are mainly divided into two categories: two-stage detectors and one-stage detectors. Two-stage detectors adopt a region proposal-based strategy, whose pipelines have two stages [3, 5,6,7, 9, 16]. The first stage generates a set of category-independent region proposals, and the second stage classifies them into foreground classes or background. One-stage detectors does not separate detection proposal, making the overall pipeline single stage [8, 10, 12, 14, 15]. Although some methods without relying on region proposal have been proposed, region proposal-based methods possess leading accuracy on benchmarks datasets (e.g., PASCAL VOC [4], ILSVRC [19], and Microsoft COCO [11] datasets). Faster R-CNN [16] is one of the most well-known object detection framework, which proposed an efficient and accurate Region Proposal Network (RPN) to generate region proposals. Since then, these RPN-like proposals are standards for recent two-stage object detectors.

Existing object detectors heavily depend on a significant number of accurate annotated images [4, 11, 19]. The annotation of such benchmark datasets often cost too much time and labors. To reduce the cost of obtaining accurate annotation, some weakly supervised and semi-supervised object detection frameworks have been proposed over the past years. At present, Weakly supervised detection mainly focuses on image-level annotation instead of the bounding-box annotation [20, 21, 23]. Semi-supervised object detectors are trained by using few annotated data and massive unannotated data [1, 2, 17]. Nevertheless, the reduction of annotation cost is usually at the cost of degrading model accuracy. Though many promising ideas have been proposed in weakly supervised and semi-supervised object detection, they are still far from comparable to strongly supervised ones.

Unlike land images with common object categories, underwater images own the characteristics of image degradation and color distortion due to the absorption and scattering of light through water. Besides, objects in the underwater environment are usually small and tend to cluster. These reasons cause annotating underwater objects difficult and time-consuming particularly. Therefore, as shown in Fig. 1, missing partial annotations often occurs in underwater image datasets. Under these circumstances, the negative examples are generated not only from the background but also the unannotated foreground, which will misguide the training of detectors. Existing strongly and weakly supervised detection algorithms cannot achieve satisfied results in underwater object detection.

To solve this problem, we propose a proposal-refined weakly supervised object detection method, focusing on training detectors with incomplete annotated dataset. We discover that there are great differences between foreground and background in underwater images. Inspired by this, we design a weakly-fitted segmentation network to segment the foreground and background of an image by only using incomplete annotated detection dataset. Then, we use the segmentation map to control the generation of positive and negative examples when training the detection network, which is conducive to the generation of high-quality proposals. The proposed method does not restricted to a specific object detection framework. In fact, it can be incorporated into any advanced ones. Our experiments are carried out on the Underwater Robot Picking Contest 2017 dataset (URPC2017). Experiments show that the proposed method greatly improves the accuracy of object detection compared to several baseline methods.

2 The Proposed Method

2.1 Overview

In order to reduce the influence of the missed annotations of the training images, we design a weakly-fitted segmentation network to separate the foreground from background, and then utilize the results generated in the segmentation network to guide the generation of positive and negative examples in training of the detector. Figure 2 shows the overview of the proposed architecture. It consists of two stages, where the first stage is a weakly-fitted segmentation network and the second stage is a proposal-refined object detection network. The details of each part of our model are introduced in the following sections.

2.2 Weakly-Fitted Segmentation

To segment the foreground and background of an underwater image, we utilize the idea of U-Net [18], which consists of a contracting path to capture context information and an expanding path to guarantee the accuracy of localization. The traditional well-trained U-Net cannot accurately separate the foreground from background on our underwater images because there are a lot of unannotated foreground area in the training dataset. To address this problem, we propose two modifications: (1) As shown in Fig. 3, We design a light-weighted U-Net to reduce the ability to fit the training dataset. More specifically, we use 7 convolutional layers in downsampling and 6 deconvolutional layers in upsampling, which consist of $3\times 3$ convolutions with stride 2 without double and halve the number of feature channels at each downsampling and upsampling step. The design of asymmetric convolutional layer can reduce the fitting degree of model to incomplete annotated training dataset. After that, the image size is restored by bilinear interpolation. (2) To segment the foreground as much as possible, the network is back-propagated via a modified MSE loss, denoted as

$$\begin{aligned} L(y,y^*) = \frac{1}{N}\sum _{i = 0} ^N(y_i^* -y_i)^2+\lambda \frac{1}{N}\sum _{i = 0} ^Ny_i^*(y_i^* -y_i) \end{aligned}$$

(1)

where y is the output image of the weakly-fitted segmentation network, $y^*$ is the ground-truth image generated by the bounding-box area of the underwater object detection dataset. i is the index of each pixel in an image, N is the number of pixels in an image. The value of $y_i^*$ equals to 0 if it belongs to background while the value of $y_i^*$ equals to 1 if it belongs to foreground. The term $y_i^*(y_i^* -y_i)$ means the last item is activated only for foreground. So, the last item can enlarge the loss, which takes the foreground as background influenced by the confusing of incomplete annotated datasets. Moreover, the two terms are normalized by N and weighted by a balancing parameter $\lambda $.

2.3 Proposal-Refined Object Detection

The quality of the proposals has great influence on the performance of object detection. Therefore various studies focus on region proposal generation [22, 24]. Among them, Region Proposal Network (RPN) proposed by Faster R-CNN [16] is the most influential method in recent years. Accordingly, we build our strategy based on the Faster R-CNN framework in this paper.

The architecture of Faster R-CNN can be divided into two parts: Region proposal network (RPN) and region-of-interest (ROI) classifier. For training RPN, traditional methods assign a negative label to an anchor if its Intersection-over-Union (IoU) ratio is lower than 0.3 for all ground-truth boxes. However, as shown in Fig. 4, many false negative examples which contain unlabeled objects will be generated due to the incomplete annotated dataset. It will affect the learning of RPN network directly. To address this problem, We add an input which is generated in the first stage to RPN and ROI classifier. When RPN and ROI classifier assigns a negative label to an anchor, it not only refers to the IoU for the ground-truth, but also the segmentation map.

The specific steps are as follows: (1) Firstly, the foreground of underwater images can be obtained by the weakly-fitted segmentation network, which is denoted as $S_1$. Then, we subtract ground-truth boxes from $S_1$ to gain the unlabeled foreground region $S_2$. (2) For training RPN, the method of labeling positives is the same as traditional strategy. Nevertheless, assigning a negative label to an anchor needs to satisfy two conditions: (i) Its IoU ratio is lower than 0.3 for all ground-truth boxes. (ii) its IoU ratio is lower than or equal to $\beta $ for $S_2$. Similarly, the generation of positive and negative examples is constrained by both ground-truth and segmentation map during the training of the ROI classifier.

By controlling the generation of negative examples, we can eliminate the false negative examples, thus provide more accurate positive and negative examples for training object detection network to generate high-quality proposals. Following [16], classification loss and bounding-box regression loss are computed for both the RPN and the RoI classifiers

$$\begin{aligned} L_{total} = L_{cls}^{rpn}+c^*L_{reg}^{rpn}+L_{cls}^{roi}+p^*L_{reg}^{roi} \end{aligned}$$

(2)

where $L_{cls}$ is the cross-entropy loss for classification, $L_{reg}$ is the smooth L1 loss defined in [5] for regression, $c^*L_{reg}^{rpn}$ and $p^*L_{reg}^{roi}$ mean the regression loss activated only for positive anchors and non-background class proposals respectively. It is worth mentioning that although the proposed method is carried out on Faster R-CNN, it is applicable to other region proposal-based methods such as R-FCN [3], FPN [9], Mask R-CNN [7].

3 Experiment

3.1 Dataset and Metric

Our experiments are carried out on the Underwater Robot Picking Contest 2017 dataset (URPC2017), which contains 3 object categories (sea cucumber, sea urchin and scallop) with a total of 19967 underwater images. The dataset is divided into the train, val and test set, which have 17655, 1317 and 985 images respectively. In the dataset, the amount of complete annotated data is fewer than incomplete annotated data. We train our segmentation and detection network on the trainval set. The trainval set contains both complete and incomplete annotated images. The test set consists of accurate and complete annotated images. The dataset used to train the weakly-fitted segmentation network is generated from the bounding box area (see Fig. 5). Object detection accuracy is measured by mean Average Precision (mAP).

3.2 Implementation Details

For the training of weakly-fitted Segmentation network, we use a learning rate of 0.0001 for 70k iterations and set $\lambda $ = 2 which makes the two terms in Eq. 1 roughly equally weighted after normalization. For the training of the proposal-refined object detection network, we use Faster R-CNN as our baseline detection framework. The VGG16 pre-trained on ImageNet is used as the backbone architecture for feature extraction due to the small scale datasets. The initial learning rate is set to 0.0002 for the first 50k and then decrease to 0.00002 in the following 20k iterations. The momentum and weight decay are set to 0.9 and 0.0005, respectively. Other hyper-parameters are identical as those defined in [16].

3.3 Experimental Results

The Influence of IoU Threshold. We explore the influence of IoU threshold $\beta $ of the segmentation map for detector. $\beta $ = 1 is the baseline result of the original Faster R-CNN, which is not constrained by segmentation map when generating negative examples. As shown in Table 1, $\beta $ = 0.3 outperforms other choices, which is $12.1\%$ better than the baseline. It indicates that containing a part of object in the negative examples is beneficial to improve detection performance. When $\beta $ = 0, detector will be trained on a large number of easily classified background examples, which is unuseful to improve detection accuracy. Consequently, we choose $\beta $ = 0.3 for the following experiments.

Table 1. Comparison results with different IoU thresholds of segmentation map.

Full size table

The Results of Weakly-Fitted Segmentation. Figure 6(c) shows the qualitative results of weakly-fitted segmentation: (a) is the input image, (b) is the segmentation result of U-Net. Obviously, Under the same experimental setting, U-Net cannot completely separate the foreground and background. Because the unannotated foreground area affects the ability of U-Net to distinguish foreground from background. However, The proposed weakly-fitted segmentation network can segment the foreground and background of an underwater images, including the unannotated region in the underwater object detection dataset. Because the proposed method reduce the fitting degree of model to training data and increase the penalty for regarding foreground as background.

The Results of Proposal-Refined Object Detection. To show how Faster R-CNN and proposal-refined detector improve during the learning, we plot mAP of the two detectors for different training iterations. As shown in Fig. 7, both detectors get improved at the beginning stage. But the proposal-refined detector always have a higher mAP than the Faster R-CNN, suggesting the effectiveness of the proposal-refined object detection network. Figure 8 shows the qualitative results of proposal-refined detector (top) compared with the benchmark Faster R-CNN (bottom). It can be seen that proposal-refined detector can detect more objects than the baseline framework, especially the small and challenging objects.

Comparisons with the State-of-the-Arts. In this section, we present experimental results of our proposed method applied in other outstanding object detection networks: R-FCN [3], FPN [9], Mask R-CNN [7]. As shown in Table 2, our method improves the mAP of the original object detectors by about 10%, indicating the effectiveness and robustness of the proposal-refined weakly supervised object detection. By eliminating false negative examples, the proposed method can solve the problem of accuracy decrease caused by incomplete annotated dataset.

Table 2. Comparison results of different methods.

Full size table

4 Conclusion

In this paper, we propose a simple but efficient framework for object detection in underwater images with incomplete annotated dataset. Our proposal-refined weakly supervised object detection system is composed of two stages. The first stage is a weakly-fitted segmentation network that separates foreground from background. The second stage is the proposal-refined object detector that uses the segmentation map to generate high-quality proposals. Experiments show that the proposed method greatly improves the detection performance compared to several baseline methods. Through our method, we can not only reduce the cost of dataset annotation, but also offset the accuracy decrease caused by missed annotation. In addition, the idea of the proposed method can not only be applied to underwater object detection but also to other detect tasks with incomplete annotation.

References

Chen, G., Liu, L., Hu, W., Pan, Z.: Semi-supervised object detection in remote sensing images using generative adversarial networks. In: 2018 IEEE International Geoscience and Remote Sensing Symposium, pp. 2503–2506 (2018)
Google Scholar
Choi, M.K., et al.: Co-occurrence matrix analysis-based semi-supervised training for object detection. In: 2018 25th IEEE International Conference on Image Processing, pp. 1333–1337 (2018)
Google Scholar
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems, pp. 379–387 (2016)
Google Scholar
Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Article Google Scholar
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 765–781. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_45
Chapter Google Scholar
Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 936–944 (2017)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. (2018). https://doi.org/10.1109/TPAMI.2018.2858826
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Naylor, R., Burke, M.: Aquaculture and ocean resources: raising tigers of the sea. Annu. Rev. Environ. Resour. 30, 185–218 (2005)
Article Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadil, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Google Scholar
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6517–6525 (2017)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Rhee, P.K., Erdenee, E., Kyun, S.D., Ahmed, M.U., Jin, S.: Active and semi-supervised learning for object detection with imperfect data. Cogn. Syst. Res. 45, 109–123 (2017)
Article Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2843–2851 (2017)
Google Scholar
Tang, P., et al.: Weakly supervised region proposal network and object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 370–386. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_22
Chapter Google Scholar
Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)
Article Google Scholar
Zhang, X., Feng, J., Xiong, H., Tian, Q.: Zigzag learning for weakly supervised object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4262–4270 (2018)
Google Scholar
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_26
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Weihai, China
Xiaoqian Lv, An Wang, Qinglin Liu, Jiamin Sun & Shengping Zhang

Authors

Xiaoqian Lv
View author publications
You can also search for this author in PubMed Google Scholar
An Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qinglin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jiamin Sun
View author publications
You can also search for this author in PubMed Google Scholar
Shengping Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shengping Zhang .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China
Yao Zhao
The Australian National University, Canberra, Australia
Nick Barnes
Peking University, Beijing, China
Baoquan Chen
The Technical University of Munich, Munich, Bayern, Germany
Rüdiger Westermann
Zhejiang University, Hangzhou, China
Xiangwei Kong
Beijing Jiaotong University, Beijing, China
Chunyu Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lv, X., Wang, A., Liu, Q., Sun, J., Zhang, S. (2019). Proposal-Refined Weakly Supervised Object Detection in Underwater Images. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11901. Springer, Cham. https://doi.org/10.1007/978-3-030-34120-6_34

Download citation

DOI: https://doi.org/10.1007/978-3-030-34120-6_34
Published: 28 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34119-0
Online ISBN: 978-3-030-34120-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)