Elsevier

Neural Networks

Volume 144, December 2021, Pages 247-259
Neural Networks

One-stage CNN detector-based benthonic organisms detection with limited training dataset

https://doi.org/10.1016/j.neunet.2021.08.014Get rights and content

Abstract

In this paper, focusing on the challenges in unique shape dimension and limited training dataset of benthonic organisms, an one-stage CNN detector-based benthonic organisms detection (OSCD-BOD) scheme is proposed. Main contributions are as follows: (1) The regression loss between the predicted bounding box and ground truth box is innovatively measured by the generalized intersection over union (GIoU), such that localization accuracy of benthonic organisms is dramatically enhanced. (2) By devising K-means-based dimension clustering, multiple benthonic organisms anchor boxes (BOAB) sufficiently exploring a priori dimension information can be finely derived from limited training dataset, and thereby significantly promoting the recall ability. (3) Geometric and color transformations (GCT)-based data augmentation technique is further resorted to not only efficiently prevent over-fitting training but also to significantly enhance detection generalization in complex and changeable underwater environments. (4) The OSCD-BOD scheme is eventually established in a modular manner by integrating GIoU, BOAB and GCT functionals. Comprehensive experiments and comparisons sufficiently demonstrate that the proposed OSCD-BOD scheme outperforms typical approaches including Faster R-CNN, SSD, YOLOv2, YOLOv3 and CenterNet in terms of mean average precision by 6.88%, 10.92%, 12.44%, 3.05% and 1.09%, respectively.

Introduction

With increasingly rapid development of machine vision and deep learning associated with vehicles and underwater robotics (Blas and Blanke, 2011, Elfwing et al., 2018, Su et al., 2020, Tan et al., 2018, Wang and Ahn, 2021, Wang et al., 2021, Wang and He, 2020, Wang and Su, 2019, Wang, Wang et al., 2020), high-intelligence underwater fishing robots are desired to autonomously pick seabed benthonic organisms of interest in marine ranching, e.g., echinus, scallop, starfish and holothurian (ESSH). Note that high-precision detection and recognition modules play rather significant roles in intelligent fishing robots since benthonic organisms actually feature unique shapes and scales, and live on diversified seabeds. To some extent, general object detection approaches (Chen et al., 2019, Girshick et al., 2014) become unsuitable for seabed benthonic organisms since rich semantic information can hardly be extracted under changeably complicated underwater environments. In the context, high-precision benthonic organisms detection (BOD) becomes rather challenging from both technical and practical sides.

Actually, the BOD performance critically depends on feature extraction, of which main approaches are associated with machine learning. However, traditional machine learning-based approaches heavily rely on hand-crafted extractors. To be specific, typical methods including SIFT (Lowe, 1999), SURF (Bay et al., 2006) and HOG (Dalal & Triggs, 2005) have been developed to extract moving fish features. It should be noted that aforementioned feature extractors can only extract low-level features, such as color, texture and shape, for underwater objects that are easily discriminated from underwater background. However, benthonic organisms are rather identical to the seabed background in terms of color and texture. In this context, it becomes rather difficult to extract effective feature information by hand-crafted extractor for detection and recognition of benthonic organisms (Zhao et al., 2019). With rapid development of hardware resources and computing acceleration, it is appealing to extract high-level BOD semantic features by virtue of the convolutional neural network (CNN) (Bai and Zhang, 2021, Fan et al., 2021, Gridach, 2021, Nápoles et al., 2021). Accordingly, typical CNN-based feature extractors have been established by AlexNet, ZF-Net, VGG and Mobilenet (Howard et al., 2017), etc. Recently, deeper advances in CNN-based feature extractors have been made by GoogleNet and ResNet (He et al., 2016).

With the aid of aforementioned CNN-based feature extractors, fruitful two-stage detectors, e.g., R-CNN, SPP-Net, Fast R-CNN, Faster R-CNN and Mask R-CNN (He et al., 2020) have been devised for object detection. In Li et al. (2015), R-CNN and Fast R-CNN architectures have been proposed for detecting and recognizing fish species under domain-specific underwater environment, whereby detection speed of Fast R-CNN and R-CNN is approximately 0.3 and 0.04 frames per second (FPS), respectively. In Han et al. (2020), a combined approach using Faster R-CNN architecture and Hypernet method has been implemented to detect marine organisms, whereby detection speed achieves 17 FPS. Similar detection architectures using data augmentation (DA) technology have also been developed to increase detection accuracy (Huang et al., 2019, Xu et al., 2018). However, foregoing two-stage detectors essentially take regional proposal as an intermediate step which is extremely time-consuming within the BOD task.

Recently, single-stage detectors with faster detection speed have been extensively developed for BOD, such as the OverFeat (Sermanet et al., 2013), G-CNN (Najibi et al., 2016), SSD (Liu et al., 2016) and YOLO (Redmon et al., 2016, Redmon and Farhadi, 2017, Redmon and Farhadi, 2018). By virtue of depthwise separable convolution strategy, the lightweight SSD-based detector (Ma et al., 2019) has been proposed to detect sea cucumbers, whereby 19.8 FPS detection in CPU mode can be achieved. By deploying squeeze-and-excitation unit in feature extracting network, the YOLOv3-UW scheme for detecting dense benthonic organisms has been implemented in Shi et al. (2019), whereby 46 FPS detection can sufficiently meet real-time requirements. It should be highlighted that, due to the lack of region proposal network for extracting the region of interest, single-stage detectors inevitably lead to higher localization errors than those of two-stage detectors (Lin, Dollár et al., 2017). In order to improve localization performance for the BOD, apart from optimizing backbone architecture and improving local feature extraction (He et al., 2020, Lin, Goyal et al., 2017), more effective regression loss functions have been widely explored. To be specific, the smooth L1 function (Liu et al., 2016, Ma et al., 2019, Qiu et al., 2019) has been proposed to calculate regression loss between the ground truth box (GTB) and the predicted bounding box (PBB). However, the smooth L1-based loss requires that bounding box localization information (bx,by,bw,bh) should be independent of each other. Nevertheless, the same smooth L1 loss can even lead to completely different intersection over union (IOU) between the GTB and the PBB (Rezatofighi et al., 2019). In Yu et al. (2016), the IOU-based regression loss has been proposed by taking the relevance of bounding box information into consideration. However, the IOU-based regression loss becomes ineffective in the situation where there does not exist any intersection between the GTB and the PBB, since IOU-based regression loss cannot measure the similarity between the GTB and PBB in non-overlapping cases, thereby resulting in rather poor localization performance.

In addition, the shape feature of benthonic organisms has been vaguely accommodated by common anchor boxes in previous works (Liu et al., 2020, Ma et al., 2019, Qiu et al., 2019). The unique width and height scales of benthonic organisms are usually assumed to be the same as generic objects, and thereby there arises significant inconsistency with the ground truth. In essence, within the common datasets including PASCAL VOC, MS COCO and ImageNet (Deng et al., 2009, Everingham et al., 2015, Lin et al., 2014, Russakovsky et al., 2015), the anchor boxes extract rather few a priori information for benthonic organisms. Note that combining appropriate a priori shape dimension with a detection network can significantly increase recall rate of the BOD (Redmon and Farhadi, 2017, Redmon and Farhadi, 2018, Ren et al., 2015).

Furthermore, the number of BOD training samples collected by remotely operated vehicles is far less than generic PASCAL VOC, MS COCO and ImageNet (Deng et al., 2009, Everingham et al., 2015, Lin et al., 2014, Russakovsky et al., 2015, Wang, Liu et al., 2020). In this context, on the one hand, small benthonic organisms training dataset (BOTD) will inevitably lead to over-fitting of the CNN with a huge number of weight parameters (Rezatofighi et al., 2019), and thereby resulting in poor performance of detection and recognition. On the other hand, limited training samples cannot sufficiently represent changeably complicated underwater environments including color cast, uneven illuminating, blurring, low contrast and changing views of camera (Ancuti et al., 2012, Ghani and Isa, 2015, Li et al., 2016), such that various situations of benthonic organisms can hardly be effectively captured.

In this paper, to conquer foregoing challenges, a novel one-stage CNN detector-based BOD (OSCD-BOD) scheme for the ESSH is established by virtue of the YOLO philosophy (Redmon et al., 2016, Redmon and Farhadi, 2017, Redmon and Farhadi, 2018). Key points can be summarized as follows:

  • The bounding box regression loss (BBRL) between the GTB and PBB pertaining to the BOD is exclusively defined by using a generalized intersection over union (GIoU), and thereby significantly enhancing both detection precision and recall rate, simultaneously.

  • To accelerate training while significantly increasing recall rate, especially comparing to common anchor boxes (CAB), the benthonic organisms anchor boxes (BOAB) are formulated by devising K-means-based dimension clustering on the BOTD.

  • Using geometric and color transformations (GCT)-based DA technique, limited marine benthonic organism datasets can be dramatically enlarged to an extended sample featuring different salt and pepper noises, light intensity, blurry and viewpoints, such that both over-fitting and detection sensitivity to changeably complicated underwater environments can be effectively avoided and significantly reduced, respectively.

  • Eventually, the proposed OSCD-BOD scheme is established by integrating the GIoU, BOAB and GCT modules, and thereby significantly enhancing the BOD ability in terms of accuracy and robustness, which provides advanced detection solution for marine biology community.

The rest of this paper is organized as follows. The BOD problem is formulated in Section 2. The OSCD-BOD scheme is proposed in Section 3. Experimental studies and comparisons are given in Section 5. Conclusions are drawn in Section 6.

Section snippets

Problem formulation

The PBB (magenta) and GTB (green) informations are shown in Fig. 1. The PBB regression center can be determined by xpc=σ(xrc)+xtlypc=σ(yrc)+ytl where (xrc,yrc) is the predicted raw center coordinate offset with respect to the top-left corner of current grid cell, (xtl,ytl) denotes coordinate offset of the current grid cell with respect to the top-left corner of the input image, and σ() is a logistic activation function for constraining predicted center coordinate within the range (0,1).

OSCD-BOD scheme

In this section, the proposed OSCD-BOD scheme is systematically designed in the sequel, and is achieved by devising 4 modules, i.e., the CNN-based detection network structure (DNS), the GIoU-based bounding box regression loss, K-means-based dimension clustering and GCT-based the data augmentation.

Evaluation indicator

The mean performance under corruption (mPC) and relative performance under corruption (rPC) are employed to evaluate detector performance in this paper. The mPC is defined as follows: mPC=1Ncc=1Nc1Nss=1NsPc,swhere Pc,s is detection performance on test images with corruption c under severity level s, the Nc and Ns refer to the number of corruptions and severity levels, respectively.

Accordingly, the rPC is defined by rPC=mPCPcleanwhere the Pclean represents the detection performance on

Experimental studies and comparisons

In order to demonstrate the effectiveness and superiority of the proposed OSCD-BOD scheme, the SSE-CAB-based DNS conducted on original dataset (i.e., SSE-CAB-OD-DNS) is implemented together with other 4 schemes including (1) the GIoU-CAB-based DNS conducted on original dataset (i.e., GIoU-CAB-OD-DNS), (2) the SSE-BOAB-based DNS conducted on original dataset (i.e., SSE-BOAB-OD-DNS), (3) the SSE-CAB-GCT-based DNS conducted on augmented dataset (i.e., SSE-CAB-GCT-DNS) and (4) the

Conclusion

In this paper, a real-time and accurate detection paradigm for the ESSH under changeably complicated underwater environments has been efficiently achieved by developing an OSCD-BOD scheme. Specifically, the GIoU technique has been elaborately designed to measure practical regression loss between the GTB and PBB. Moreover, the BOAB has been firstly formulated by virtue of K-means-based dimension clustering, and thereby significantly contributing to improving anchor box determinations.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank the Editor-in-Chief, the Associate Editor, and anonymous reviewers for their invaluable suggestions and comments.

Funding

This work is supported by the Liaoning Revitalization Talents Program (under Grant XLYC1807013), the Equipment Pre-Research Fund of Key Laboratory (under Grant 6142215200106), and the Fundamental Research Funds for the Central Universities (under Grant 3132019344).

References (55)

  • Chen, K., Li, J., Lin, W., See, J., Wang, J., Duan, L., Chen, Z., He, C., & Zou, J. (2019). Towards Accurate One-Stage...
  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of the IEEE...
  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image...
  • Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. (2019). CenterNet: Keypoint triplets for object detection. In...
  • EveringhamM. et al.

    The PASCAL visual object classes challenge: A retrospective

    International Journal of Computer Vision

    (2015)
  • GhaniA.S.A. et al.

    Enhancement of low quality underwater image through integrated global and local contrast correction

    Applied Soft Computing

    (2015)
  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and...
  • HanF. et al.

    Marine organism detection and classification from underwater vision based on the deep CNN method

    Mathematical Problems in Engineering

    (2020)
  • HeK. et al.

    Mask R-CNN

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2020)
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE...
  • Henderson, P., & Ferrari, V. (2016). End-to-end training of object class detectors for mean average precision. In Asian...
  • HowardA.G. et al.

    Mobilenets: Efficient convolutional neural networks for mobile vision applications

    (2017)
  • Kingma, D. P., & Ba, J. L. (2014). Adam: A method for stochastic optimization. In International Conference on Learning...
  • LiC.-Y. et al.

    Underwater image enhancement by dehazing with minimum information loss and histogram distribution prior

    IEEE Transactions on Image Processing

    (2016)
  • Li, X., Shang, M., Qin, H., & Chen, L. (2015). Fast accurate fish detection and recognition of underwater images with...
  • Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object...
  • Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings...
  • Cited by (0)

    View full text