One-stage CNN detector-based benthonic organisms detection with limited training dataset
Introduction
With increasingly rapid development of machine vision and deep learning associated with vehicles and underwater robotics (Blas and Blanke, 2011, Elfwing et al., 2018, Su et al., 2020, Tan et al., 2018, Wang and Ahn, 2021, Wang et al., 2021, Wang and He, 2020, Wang and Su, 2019, Wang, Wang et al., 2020), high-intelligence underwater fishing robots are desired to autonomously pick seabed benthonic organisms of interest in marine ranching, e.g., echinus, scallop, starfish and holothurian (ESSH). Note that high-precision detection and recognition modules play rather significant roles in intelligent fishing robots since benthonic organisms actually feature unique shapes and scales, and live on diversified seabeds. To some extent, general object detection approaches (Chen et al., 2019, Girshick et al., 2014) become unsuitable for seabed benthonic organisms since rich semantic information can hardly be extracted under changeably complicated underwater environments. In the context, high-precision benthonic organisms detection (BOD) becomes rather challenging from both technical and practical sides.
Actually, the BOD performance critically depends on feature extraction, of which main approaches are associated with machine learning. However, traditional machine learning-based approaches heavily rely on hand-crafted extractors. To be specific, typical methods including SIFT (Lowe, 1999), SURF (Bay et al., 2006) and HOG (Dalal & Triggs, 2005) have been developed to extract moving fish features. It should be noted that aforementioned feature extractors can only extract low-level features, such as color, texture and shape, for underwater objects that are easily discriminated from underwater background. However, benthonic organisms are rather identical to the seabed background in terms of color and texture. In this context, it becomes rather difficult to extract effective feature information by hand-crafted extractor for detection and recognition of benthonic organisms (Zhao et al., 2019). With rapid development of hardware resources and computing acceleration, it is appealing to extract high-level BOD semantic features by virtue of the convolutional neural network (CNN) (Bai and Zhang, 2021, Fan et al., 2021, Gridach, 2021, Nápoles et al., 2021). Accordingly, typical CNN-based feature extractors have been established by AlexNet, ZF-Net, VGG and Mobilenet (Howard et al., 2017), etc. Recently, deeper advances in CNN-based feature extractors have been made by GoogleNet and ResNet (He et al., 2016).
With the aid of aforementioned CNN-based feature extractors, fruitful two-stage detectors, e.g., R-CNN, SPP-Net, Fast R-CNN, Faster R-CNN and Mask R-CNN (He et al., 2020) have been devised for object detection. In Li et al. (2015), R-CNN and Fast R-CNN architectures have been proposed for detecting and recognizing fish species under domain-specific underwater environment, whereby detection speed of Fast R-CNN and R-CNN is approximately 0.3 and 0.04 frames per second (FPS), respectively. In Han et al. (2020), a combined approach using Faster R-CNN architecture and Hypernet method has been implemented to detect marine organisms, whereby detection speed achieves 17 FPS. Similar detection architectures using data augmentation (DA) technology have also been developed to increase detection accuracy (Huang et al., 2019, Xu et al., 2018). However, foregoing two-stage detectors essentially take regional proposal as an intermediate step which is extremely time-consuming within the BOD task.
Recently, single-stage detectors with faster detection speed have been extensively developed for BOD, such as the OverFeat (Sermanet et al., 2013), G-CNN (Najibi et al., 2016), SSD (Liu et al., 2016) and YOLO (Redmon et al., 2016, Redmon and Farhadi, 2017, Redmon and Farhadi, 2018). By virtue of depthwise separable convolution strategy, the lightweight SSD-based detector (Ma et al., 2019) has been proposed to detect sea cucumbers, whereby 19.8 FPS detection in CPU mode can be achieved. By deploying squeeze-and-excitation unit in feature extracting network, the YOLOv3-UW scheme for detecting dense benthonic organisms has been implemented in Shi et al. (2019), whereby 46 FPS detection can sufficiently meet real-time requirements. It should be highlighted that, due to the lack of region proposal network for extracting the region of interest, single-stage detectors inevitably lead to higher localization errors than those of two-stage detectors (Lin, Dollár et al., 2017). In order to improve localization performance for the BOD, apart from optimizing backbone architecture and improving local feature extraction (He et al., 2020, Lin, Goyal et al., 2017), more effective regression loss functions have been widely explored. To be specific, the smooth function (Liu et al., 2016, Ma et al., 2019, Qiu et al., 2019) has been proposed to calculate regression loss between the ground truth box (GTB) and the predicted bounding box (PBB). However, the smooth -based loss requires that bounding box localization information should be independent of each other. Nevertheless, the same smooth loss can even lead to completely different intersection over union (IOU) between the GTB and the PBB (Rezatofighi et al., 2019). In Yu et al. (2016), the IOU-based regression loss has been proposed by taking the relevance of bounding box information into consideration. However, the IOU-based regression loss becomes ineffective in the situation where there does not exist any intersection between the GTB and the PBB, since IOU-based regression loss cannot measure the similarity between the GTB and PBB in non-overlapping cases, thereby resulting in rather poor localization performance.
In addition, the shape feature of benthonic organisms has been vaguely accommodated by common anchor boxes in previous works (Liu et al., 2020, Ma et al., 2019, Qiu et al., 2019). The unique width and height scales of benthonic organisms are usually assumed to be the same as generic objects, and thereby there arises significant inconsistency with the ground truth. In essence, within the common datasets including PASCAL VOC, MS COCO and ImageNet (Deng et al., 2009, Everingham et al., 2015, Lin et al., 2014, Russakovsky et al., 2015), the anchor boxes extract rather few a priori information for benthonic organisms. Note that combining appropriate a priori shape dimension with a detection network can significantly increase recall rate of the BOD (Redmon and Farhadi, 2017, Redmon and Farhadi, 2018, Ren et al., 2015).
Furthermore, the number of BOD training samples collected by remotely operated vehicles is far less than generic PASCAL VOC, MS COCO and ImageNet (Deng et al., 2009, Everingham et al., 2015, Lin et al., 2014, Russakovsky et al., 2015, Wang, Liu et al., 2020). In this context, on the one hand, small benthonic organisms training dataset (BOTD) will inevitably lead to over-fitting of the CNN with a huge number of weight parameters (Rezatofighi et al., 2019), and thereby resulting in poor performance of detection and recognition. On the other hand, limited training samples cannot sufficiently represent changeably complicated underwater environments including color cast, uneven illuminating, blurring, low contrast and changing views of camera (Ancuti et al., 2012, Ghani and Isa, 2015, Li et al., 2016), such that various situations of benthonic organisms can hardly be effectively captured.
In this paper, to conquer foregoing challenges, a novel one-stage CNN detector-based BOD (OSCD-BOD) scheme for the ESSH is established by virtue of the YOLO philosophy (Redmon et al., 2016, Redmon and Farhadi, 2017, Redmon and Farhadi, 2018). Key points can be summarized as follows:
The bounding box regression loss (BBRL) between the GTB and PBB pertaining to the BOD is exclusively defined by using a generalized intersection over union (GIoU), and thereby significantly enhancing both detection precision and recall rate, simultaneously.
To accelerate training while significantly increasing recall rate, especially comparing to common anchor boxes (CAB), the benthonic organisms anchor boxes (BOAB) are formulated by devising K-means-based dimension clustering on the BOTD.
Using geometric and color transformations (GCT)-based DA technique, limited marine benthonic organism datasets can be dramatically enlarged to an extended sample featuring different salt and pepper noises, light intensity, blurry and viewpoints, such that both over-fitting and detection sensitivity to changeably complicated underwater environments can be effectively avoided and significantly reduced, respectively.
Eventually, the proposed OSCD-BOD scheme is established by integrating the GIoU, BOAB and GCT modules, and thereby significantly enhancing the BOD ability in terms of accuracy and robustness, which provides advanced detection solution for marine biology community.
The rest of this paper is organized as follows. The BOD problem is formulated in Section 2. The OSCD-BOD scheme is proposed in Section 3. Experimental studies and comparisons are given in Section 5. Conclusions are drawn in Section 6.
Section snippets
Problem formulation
The PBB (magenta) and GTB (green) informations are shown in Fig. 1. The PBB regression center can be determined by where is the predicted raw center coordinate offset with respect to the top-left corner of current grid cell, denotes coordinate offset of the current grid cell with respect to the top-left corner of the input image, and is a logistic activation function for constraining predicted center coordinate within the range .
OSCD-BOD scheme
In this section, the proposed OSCD-BOD scheme is systematically designed in the sequel, and is achieved by devising 4 modules, i.e., the CNN-based detection network structure (DNS), the GIoU-based bounding box regression loss, K-means-based dimension clustering and GCT-based the data augmentation.
Evaluation indicator
The mean performance under corruption () and relative performance under corruption () are employed to evaluate detector performance in this paper. The is defined as follows: where is detection performance on test images with corruption under severity level , the and refer to the number of corruptions and severity levels, respectively.
Accordingly, the is defined by where the represents the detection performance on
Experimental studies and comparisons
In order to demonstrate the effectiveness and superiority of the proposed OSCD-BOD scheme, the SSE-CAB-based DNS conducted on original dataset (i.e., SSE-CAB-OD-DNS) is implemented together with other 4 schemes including (1) the GIoU-CAB-based DNS conducted on original dataset (i.e., GIoU-CAB-OD-DNS), (2) the SSE-BOAB-based DNS conducted on original dataset (i.e., SSE-BOAB-OD-DNS), (3) the SSE-CAB-GCT-based DNS conducted on augmented dataset (i.e., SSE-CAB-GCT-DNS) and (4) the
Conclusion
In this paper, a real-time and accurate detection paradigm for the ESSH under changeably complicated underwater environments has been efficiently achieved by developing an OSCD-BOD scheme. Specifically, the GIoU technique has been elaborately designed to measure practical regression loss between the GTB and PBB. Moreover, the BOAB has been firstly formulated by virtue of K-means-based dimension clustering, and thereby significantly contributing to improving anchor box determinations.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors would like to thank the Editor-in-Chief, the Associate Editor, and anonymous reviewers for their invaluable suggestions and comments.
Funding
This work is supported by the Liaoning Revitalization Talents Program (under Grant XLYC1807013), the Equipment Pre-Research Fund of Key Laboratory (under Grant 6142215200106), and the Fundamental Research Funds for the Central Universities (under Grant 3132019344).
References (55)
- et al.
Stereo vision with texture learning for fault-tolerant automatic baling
Computers and Electronics in Agriculture
(2011) - et al.
Sigmoid-weighted linear units for neural network function approximation in reinforcement learning
Neural Networks
(2018) - et al.
Learning dual-margin model for visual tracking
Neural Networks
(2021) PyDiNet: Pyramid dilated network for medical image segmentation
Neural Networks
(2021)- et al.
Faster R-CNN for marine organisms detection and recognition using data augmentation
Neurocomputing
(2019) - et al.
Long-term cognitive network-based architecture for multi-label classification
Neural Networks
(2021) - et al.
Improved recurrent neural network-based manipulator control with remote center of motion constraints: Experimental results
Neural Networks
(2020) - Ancuti, C., Ancuti, C. O., Haber, T., & Bekaert, P. (2012). Enhancing underwater images and videos by fusion. In...
- et al.
Speaker recognition based on deep learning: An overview
Neural Networks
(2021) - Bay, H., Tuytelaars, T., & Van Gool, L. (2006). SURF: Speeded up robust features. In European Conference on Computer...