Abstract
Underwater object detection has numerous applications in protecting, exploring, and exploiting aquatic environments. However, underwater environments pose a unique set of challenges for object detection including variable turbidity, colour casts, and light conditions. These phenomena represent a domain shift and need to be accounted for during design and evaluation of underwater object detection models. Although methods for underwater object detection have been extensively studied, most proposed approaches do not address challenges of domain shift inherent to aquatic environments. In this work we propose a data-centric framework for combating domain shift in underwater object detection with image enhancement. We show that there is a significant gap in accuracy of popular object detectors when tested for their ability to generalize to new aquatic domains. We used our framework to compare 14 image processing and enhancement methods in their efficacy to improve underwater domain generalization using three diverse real-world aquatic datasets and two widely used object detection algorithms. Using an independent test set, our approach superseded the mean average precision performance of existing model-centric approaches by 1.7–8.0 percentage points. In summary, the proposed framework demonstrated a significant contribution of image enhancement to underwater domain generalization.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Aquatic environments occupy 71% of Earth’s surface and provide important ecosystem services [22], are a growing source of food production [17], and help regulate climate [87]. However, monitoring these environments poses significant challenges due to their inaccessibility, scale, and dynamic nature. Stakeholders often employ underwater camera systems—stationary, remotely operated, or autonomous—to gain insights into otherwise inaccessible aquatic environments. The escalating amount of data collected in this way requires autonomous processing using computer vision and machine learning. There are numerous applications for computer vision in environmental monitoring [3, 15, 61, 85], underwater exploration [39, 40], and food production [49, 101, 111]. However, underwater environments present a unique set of challenges including variable turbidity, colour casts, and light conditions. These variable conditions necessitate development of specialised underwater computer vision models.
The success of deep learning, especially convolutional neural networks (CNNs) [42] in the field of computer vision, gave also rise to fast and accurate object detection algorithms such as Faster Region-based CNN (Faster R-CNN) [80] and You Only Look Once (YOLO) [79]. Object detection deals with localizing objects in an image and classifying identified objects into predefined categories (classes) [73]. Many underwater object detection models have been developed to detect different marine organisms by fine-tuning and engineering new layers for Faster R-CNN and YOLO or by introducing advanced data augmentation [30, 64, 65, 70, 83]. Despite their popularity, how well these models respond to domain shifts, that is, their ability to generalize to different water turbidity, colour casts, or light conditions remains relatively understudied [10, 72].
Datasets for underwater object detection models are often collected as videos which are then split into individual frames (images) for training and evaluation, with the number of extracted images exceeding the number of collected videos by orders of magnitudes. The evaluation is commonly performed by randomly splitting the collected images into a training set and a test set, which means that different parts of the same video sequence are used in both training and evaluation. This approach can overestimate detection accuracy because it neglects that water turbidity, colour casts, and light conditions change rapidly across time, locations, and ecological habitats. We argue that to make underwater object detection more useful in applied settings, the evaluation needs to account for the domain shifts inherent to aquatic environments. That is, depending on the application, there should be no overlap between a training set and a test set in terms of video identity, date and location of recording, or ecological habitats.
Many image processing and enhancement techniques, both general [28, 35, 77, 81, 110] and specific to underwater environments [24, 31, 105], have been developed to improve the visual quality of images by adjusting contrast and colour balance, enhancing the dynamic range, removing haze, or adjusting for light attenuation in underwater imagery. It has been suggested that these methods could improve downstream tasks, including underwater object detection [31, 105]. Nonetheless, several studies [34, 59, 100, 103] resulted in no or only small improvements in detection accuracy. These studies tested the effects of image enhancement without considering domain shift. We hypothesise that image processing and enhancement methods could provide greater accuracy improvements when models are evaluated for their ability to generalize to domains outside of the training set by reducing the complexity caused by distinct domain-associated visual features. This idea is related to Kolmogorov complexity [8, 41], which defines the complexity of an object as the length of its shortest possible description. From this perspective, image processing and enhancement techniques could reduce data complexity by diminishing the domain-associated visual features. As a result, a lower-complexity data may result in improved learning and inference [37]. To test our hypothesis, we propose a framework for combating domain shift in underwater object detection with image enhancement and evaluate 14 different image processing and enhancement methods across three diverse aquatic datasets and two object detection algorithms. The contributions of this paper can be summarised as follows:
-
1.
We proposed a data-centric framework for evaluating domain shift in underwater object detection with a robust and comprehensive cross-validation approach.
-
2.
The framework identified a considerable disparity in underwater object detection performance depending on whether within-domain or out-of-domain evaluation was used. This disparity has profound implications on how underwater object detectors should be evaluated in practice.
-
3.
To assess the effectiveness of image enhancement for underwater domain generalization, we proposed a measure based on silhouette scores and demonstrated its correlation with out-of-domain detection accuracy.
-
4.
We provided empirical evidence that a fast implementation of MSRCR with a limited kernel size is one of the most effective enhancement methods for combating domain shift, consistently across datasets and detection algorithms.
2 Related work
2.1 Underwater object detection
Because of the unique challenges in aquatic environments, many object detection algorithms have been specifically tailored for detecting underwater objects. The vast majority of underwater object detection models have been derived from the Faster R-CNN [30, 64, 83] and YOLO [26, 58, 60, 62, 63, 65, 70] architectures. Some have been optimised to detect specific object categories such as different marine benthic organisms [62] or jellyfish species [26, 83], while others have been designed to address underwater challenges common across various target categories. To this end, model-centric approaches have been proposed to enhance model architecture. Muksit et al. [70] optimised feature map upsampling to improve detection of small objects such as small fish. Liu et al. [58] and Gao et al. [26] integrated new attention modules to improve detection of blurred objects, a frequent result of forward scatter in underwater image acquisition. Another method inspired by the phenomena of underwater image acquisition introduced a new loss function with a criterion incorporating a physical model of underwater illumination [63]. In contrast to the model-centric approaches, Huang et al. [30] proposed a data-centric approach that introduced data augmentation techniques based on the inverse process of underwater image restoration to simulate different levels of water turbulence, thereby improving training diversity. A common topic in underwater object detection is the use of image enhancement. Most often, image enhancement has been incorporated as an auxiliary regularisation task, in which case the model is trained to restore a corrupted image and detect objects in joined fashion [11, 23, 60, 102]. In a different way, Dai et al. [18] developed a detection architecture that extracts features from both original and enhanced images using gated fusion, allowing adaptive feature extraction from different sources. Despite these advancements, none of the aforementioned studies evaluated detection accuracy under domain shifts, which is the focus of our work.
Underwater domain generalization remains relatively understudied. Ottaviani et al. [72] performed a longitudinal study at the Western Mediterranean Expandable Seafloor Observatory and observed that detection accuracy of fish progressively decreased over the duration of their study. Liu et al. [57] proposed DG-YOLO, a domain generalization approach, which combined domain alignment with a data augmentation technique called water quality transfer (WQT) to synthetically increase diversity of training domains. Most recently, Chen et al. [10] proposed Domain Mixup and Contrastive Learning (DMCL) and a dataset, referred to as S-UODAC, of marine benthic organisms with seven synthetic domains. DMCL is a model-centric method, which uses domain sampling and semantic consistency to achieve the state-of-the-art results in domain generalization for underwater object detection. We compare the accuracy of our data-centric approach with DMCL on the S-UODAC dataset.
2.2 Underwater image enhancement
Image processing and enhancement methods aim to improve visual quality of images by adjusting brightness, contrast, colour balance, reducing noise, sharpening edges, and removing artifacts such as haze. In aquatic environments, the main focus of image enhancement is to counteract the selective light attenuation, which causes loss of colour accuracy; the forward scatter, which results in blurring; and the backscatter, which causes a haze-like appearance, reducing clarity and contrast. Underwater image enhancement methods vary in approach: some rely on physical models [2, 32] to correct for light attenuation and scattering [12, 24, 38, 48, 86, 98, 105, 106], while others use deep learning architectures such as CNNs [46], generative adversarial networks (GANs) [14, 31, 45, 51], or transformers [76]. Hybrid methods have also emerged that combine both paradigms, leveraging physical insights to constraint training of deep learning models [21, 47, 75]. Although the deep learning approaches tend to be faster than the physical ones, they have been mostly trained and evaluated using images of a relatively low resolution, typically \(256\times 256\), which limits their applicability to downstream tasks such as underwater object detection.
Several studies evaluated popular image processing and enhancement methods on underwater imagery using various quantitative measures to assess the visual quality of the enhanced images and the effect of image enhancement on downstream tasks such as object detection [9, 34, 59, 100, 103]. There are two key differences between those studies and our work. Firstly, previous studies used a single dataset or several related datasets of limited diversity. Specifically, the studies focused on detecting only five categories of benthic marine organisms (sea urchins, sea cucumbers, starfish, scallops, and seaweeds) and used relatively small test sets of 300–1,111 images. With limited diversity and relatively small test sets, there was insufficient evidence that the conclusions were transferable to other aquatic environments and organisms. Secondly, except for Chen et al. [9], previous studies did not evaluate domain generalization, which is the focus of our proposed framework. That is, the previous studies used random splitting to create the training and test sets and, therefore, did not account for domain shifts due to variable turbidity, colour casts, and light conditions. Chen et al. [9] compared two image restoration methods in the context of a synthetically created domain shift. However, their design was circular because enhancement methods were applied to restore synthetically manipulated images rather than images from real-world domains. Additionally, their domain generalization evaluation was not quantitative and was comprised of selected example images in which image restoration improved detections. In contrast to the previous studies, we propose a framework to test domain generalization performance of image processing and enhancement methods in the context of underwater object detection using a robust and comprehensive cross-validation approach applied to three diverse, real-world, aquatic datasets.
2.3 Domain generalization
Domain generalization deals with the design and evaluation of machine learning models that can generalize across different but related domains, which were not encountered during training. Domain generalization thus ensures that models perform robustly in real-world scenarios susceptible to domain shifts (variable data distributions), such as those caused by changes in environmental conditions. Different approaches have been proposed for domain generalization of computer vision models [93, 109]. Most of these approaches deal with image classification (object recognition) and introduce techniques based on domain alignment, data augmentation, auxiliary learning tasks, meta-learning, and optimisation objectives. Domain alignment aims to learn domain-invariant features that remain consistent across different domains [25, 50, 66, 69, 89, 90]. Data augmentation works towards the same goal by increasing the diversity of training domains by synthetic manipulation of the inputs [52, 88, 91, 108]. The inclusion of auxiliary learning tasks is a type of regularisation that encourages the model to learn more abstract and thus generalizable representations [7, 95]. Meta-learning represents a distinct approach based on the learning-to-learn framework, which adopts episodic training paradigm, i.e. the training domains are split into meta-train and meta-test at each iteration to simulate domain shift [4, 20, 33]. It has been shown that domain generalization can be improved also by introducing new optimisation objectives such as the sharpness-aware gradient matching [78, 94]. Additionally, some image classification methods focus on a specific case of unsupervised domain generalization [27, 107], in which the majority of source (training) domains contain unlabelled data.
Much less attention has been paid to domain generalization for object detection (however, some of the approaches for image classification can be applied to object detection). Lin et al. [54] proposed a domain-invariant disentanglement network, a type of a domain alignment approach, which works on both global- and instance-level representations for generalizable object detection. Similarly, Wu and Deng [97] employed both global- and instance-level contrastive loss and introduced cyclic-disentanglement capable of generalizing to unseen target domains after being trained on a single source domain. Lee et al. [44] combined object-aware data augmentations and object-aware contrastive loss for domain alignment.
The majority of the domain generalization methods, aside from the data augmentation techniques, are model-centric approaches, which modify the model architecture or training to improve generalization. In contrast, our framework offers a straightforward data-centric approach. The simplicity of pre-processing inputs with image enhancement can be seamlessly integrated with off-the-shelf object detectors, making it both versatile and easy to implement across various detection tasks.
3 Methodology
We designed a data-centric framework (Fig. 1) for testing domain generalization in underwater object detection with a robust and comprehensive cross-validation approach. To demonstrate the usefulness of the proposed framework, we used it to evaluate 14 image processing and enhancement methods. The framework allowed us to identify image enhancement methods that could robustly and consistently deliver improvements in terms of detection accuracy across several datasets.
3.1 Proposed framework
Our framework focuses on delivering replicable, robust, application-orientated results (Fig. 1). Therefore, the framework requires a minimum of two distinct datasets. The first dataset, referred to as the discovery dataset, is used for design, optimisation, and model selection, while the second dataset, referred to as the replication dataset, is used for an independent assessment to test if the selected method’s performance can be reproduced. More than one dataset can be used in both the discovery stage and the replication stage.
Domain shift is inherent to aquatic environments [19, 72]. Therefore, we propose to always split datasets into subsets (such as training and test sets or cross-validation folds) based on domains rather than individual images. That is, domain-based splitting guarantees that all images from a given domain are entirely contained within a single data subset.
Given that most underwater object detection datasets typically contain less than 10,000 images [56, 85], our framework utilises a cross-validation approach. Cross-validation splits images of the given dataset into a chosen number of non-overlapping groups (folds). Each fold is used for testing once, while the other folds are concatenated and used for training. This comprehensive approach thus uses every single image to estimate the detection performance without comprising the training procedure. The framework further splits the union of folds dedicated to training into training and validation subsets (again using domain-based splitting). The validation set can be used for hyperparameter tuning, early stopping and model selection.
Considering the variation in the quantity of available images across different domains, the domain-based splitting may result in non-uniformly sized cross-validation folds, which in turn may result in substantial variability of the estimated detection accuracy. Therefore, in our framework, each cross-validation procedure is repeated three times with different random seeds controlling both cross-validation splits and neural network parameter initiation. Finally, the detection accuracy is evaluated by concatenating detections from all test folds, and the mean and standard deviation across the three random seeds is used to select the best method. This concludes the discovery stage of the framework. In the replication stage, three randomly seeded replicates of domain-based cross-validation are used to train and evaluate the performance of the best method on the replication datasets.
3.2 Datasets
Three publicly available underwater object detection datasets were used: DeepFish, Moreton Bay Environmental Education Centre Low Visibility dataset (MBEEC-Low-Vis), and Jellytoring (Table 1). DeepFish was used as the discovery dataset, and MBEEC-Low-Vis and Jellytoring were used as the replication datasets (Fig. 1). Additionally, we used the Synthetic Underwater Object Detection Algorithm Contest (S-UODAC) dataset to compare with previously reported results of existing approaches for underwater domain generalization.
We used the extended version [70] of the original DeepFish dataset [84], which contains 4,505 images with only one category referred to as “fish” (a catch-all label for multiple different species). Of the four datasets, DeepFish contains the most diverse habitats—20 in total—including mangroves, boulders, coral reef, and seagrass. The habitats were used as domains in our experiments.
The full MBEEC-Low-Vis dataset, which contains 19,041 images with 19 categories of fish species, was filtered to retain the six species that had more than 5,000 annotations, reducing the number of images to 16,540. The six species were Australasian snapper, eastern striped grunter, paradise threadfin bream, smallmouth scad, smooth golden toadfish, and yellowfin bream/tarwhine. Bounding box annotations for fish which did not belong to one of the six species were removed. Given that the original dataset was extracted from videos with five images per second, the redundancy was reduced by subsampling to a single image per second. Thus, the final dataset contained 3,327 images. The images were assigned to 24 domains based on the location and date of the video recording. While this definition of a domain is imperfect, the variability across location and date of these recordings was sufficient upon examining image clusters of the data’s first two principal components.
The Jellytoring 2.0 dataset [83] contains 2,886 images with 15 jellyfish species, including Aurelia aurita, Pelagia noctiluca, Chrysaora hysoscella, and Cotylorhiza tuberculata. Most of this dataset was compiled from publicly available video recordings, and thus the only information that could be used for assigning domains to images was the video recording identity, which resulted in 188 domains.
The DeepFish, MBEEC-Low-Vis, and Jellytoring datasets were split into training, validation, and test splits using a three-fold cross-validation with domain-based splitting in three randomly seeded replicates (see Section 3.1 for details). The training, validation, and test splits proportions ranged from 40%–64%, 10%–24%, and 26%–40%, respectively, and the numbers of domains in the training splits ranged from 7–10, 6–15, and 72–90 for DeepFish, MBEEC-Low-Vis, and Jellytoring, respectively. Additionally, to empirically demonstrate the need to use domain-based splitting in the proposed framework, we split DeepFish, MBEEC-Low-Vis, and Jellytoring into cross-validation folds also with random splitting, i.e. images were randomly split into cross-validation folds without considering the domains associated with the images. With random splitting, the training, validation, and test splits proportions ranged from 46%–47%, 20%–21%, and 33%–34%, respectively, and all domains were included in the training splits for DeepFish and MBEEC-Low-Vis. For Jellytoring, the number of training domains ranged from 180–185.
To compare our approach with model-centric methods for domain generalization and underwater object detection, we used the S-UODAC dataset [10], which was compiled by splitting the Underwater Object Detection Algorithm Contest (UODAC) 2020 dataset into seven equal parts and employing neural style transfer to synthetically manipulate domain information in each part. S-UODAC comprises seven domains and 5,454 images of four different marine organisms: echinoids (sea urchins), starfish, holothurian (sea cucumber), and scallop.
3.3 Evaluation metrics
Mean average precision (mAP), calculated according to the COCO evaluation standard, was used as the detection accuracy metric. Precision (P) represents the fraction of correct detections, and recall (R) represents the fraction of objects detected:
where TP, FP, and FN are the number of true positives (correct detections), false positives (detection where there was no object), and false negatives (missed detections), respectively. A correct detection was defined for a given intersection over union (IoU) threshold, which is the ratio of the intersection of the ground truth and predicted bounding boxes’ areas to their combined areas.
The average precision (AP) for a given object category (class) k was interpolated across 101 recall values (ranging from 0 to 1 in 0.01 increments):
where \(p(\tilde{r})\) is the precision value at the recall value \(\tilde{r}\). Finally, mAP was the mean of AP values across all object categories (C) in the given dataset:
We used mAP at the IoU threshold of 0.5 (mAP\(_{50}\)) as the main evaluation metric. Additionally, we used mAP\(_{50-95}\) for implementing early stopping and model selection, which was calculated as the mean of mAPs at IoU thresholds ranging from 0.50–0.95 in 0.05 increments. When comparing image enhancement and processing methods to a baseline of using raw images with no processing, we defined \(\Delta \text {mAP}_{50} = \text {mAP}_{50}^{\text {enhanced}} - \text {mAP}_{50}^{\text {raw}}\).
3.4 Object detection algorithms and training
We used two object detection algorithms: Faster R-CNN [80] and YOLO [79]. Faster R-CNN is a two-stage model, which first identifies potential objects in an image, and second, classifies the identified objects into one of the predefined labels (including a ‘not-an-object’ label). In contrast, YOLO is a one-stage model, which treats detection as a regression problem of predicting bounding boxes and their corresponding labels. We used the Faster R-CNN implementation provided by the Detectron2 package [99] employing the ResNet-50 with a feature pyramid network architecture (faster_rcnn_R_50_FPN_3x) containing 42 million parameters. For YOLO, the implementation referred to as YOLOv8 from the Ultralytics package [36], specifically the large architecture (YOLOv8l) containing 43.7 million parameters, was used. For both Faster R-CNN and YOLOv8, model weights pretrained on the COCO dataset [55] were used to initiate the training.
Batch sizes were set to 4 and 5 images for Faster R-CNN and YOLOv8, respectively. For both methods, we set the detection confidence to 0, maximum number of detections per image to 100 (more than the maximum number of objects, 66, in any of our datasets), and the non-maximum suppression (NMS) threshold to 0.7. The NMS threshold of 0.7 is the default value in YOLOv8 as well as in Detectron2’s implementation of the Faster R-CNN’s region-proposal network. We experimented also with the NMS threshold of 0.5 and saw only minimal mAP\(_{50}\) improvements of 0.7 and 0.6 percentage points for Faster R-CNN and YOLOv8, respectively, when evaluated using the discovery (DeepFish) dataset. All other hyper-parameters, unless stated otherwise, were kept unchanged for both Faster R-CNN and YOLOv8 (including input image sizes of a maximum \(1333 \times 800\) and \(640 \times 640\) pixels, respectively) to emulate the effectiveness of the tested image enhancement methods without additional fine-tuning of the underlying object detection architectures.
Two different model selection procedures were used to avoid over-fitting to the source (training) domains. For Faster R-CNN, models were trained for a maximum of 100 epochs, calculating mAP\(_{50-95}\) on the validation set at the end of every epoch. If mAP\(_{50-95}\) was not improved for 10 consecutives epochs, the training was terminated and the model with best mAP\(_{50-95}\) was selected. As a result of this procedure, training of Faster R-CNN models ranged from 8–20 epochs on average, depending on the dataset. For YOLOv8, the model selection strategy implemented in the Ultralytics package was used, which trains the model for the full 100 epochs, stores checkpoints after every epoch, and then selects the model with the best fitness (defined as a weighted sum of mAP\(_{50}\) and mAP\(_{50-95}\) with 0.1 and 0.9 weights, respectively). YOLOv8 models’ training ranged from 55–78 epochs on average, depending on the dataset.
A different training regime was employed to compare our approach with the previously reported results of eight model-centric domain generalization approaches on the S-UODAC dataset [10]. We followed the Chen et al. [10] procedure as closely as possible by training Faster R-CNN for 12 epochs (no model selection procedure) with a stochastic gradient descend optimiser with a learning rate of 0.02, momentum of 0.9, and weight decay of 0.0001. The learning rate was decayed after the first 10 epochs by a factor of 10. We used a batch size of 4, NMS threshold of 0.5, detection threshold of 0.05, and input image size of a maximum \(1333 \times 800\) pixels. As in Chen et al. [10], we trained YOLOv3 (not YOLOv8) for 100 epochs (no model selection procedure) using the Adam optimiser with a learning rate of 0.001 and momentum of 0.9. Multi-scale training was enabled. The batch size of 8, NMS threshold of 0.5, detection threshold of 0.02, and input image size of a maximum \(416 \times 416\) were used.
Detection models were trained on NVIDIA A100 graphical processing units (GPUs), always using the same reproducible environment with Python 3.9.18, Pytorch 1.13.0, CUDA 11.6, Ultralytics 8.0.171, Detectron2 (commit fc9c33b1f6e5d4c37bbb46dde19af41afc1ddb2a), NumPy 1.25.2, Pycocotools 2.0.7, Scikit-learn 1.1.3, Scikit-image 0.19.3, OpenCV 4.8.0.76, and Pillow 9.4.0.
3.5 Image processing and enhancement methods
We selected 14 image processing and image enhancement methods (Table 2 and Fig. 13 in Appendix A) and tested each method individually by comparing object detection performance when the processing or enhancement was applied and when no processing or enhancement was applied. The image processing and enhancement methods were selected based on their implementation availability [5, 13, 92] or having been highlighted in studies on underwater object detection [1, 18, 23, 26, 59, 100, 104], however those studies did not evaluate their suitability for improving domain generalization, which is the focus of our work.
The 14 evaluated methods comprised of general-purpose and underwater-specific approaches of different levels of complexity. Four methods performed global intensity adjustment (gamma up/down, adjust log, adjust sigmoid, and auto-contrast), focusing on pixel-wise transformations without considering local features or spatial information. Conversely, contrast limited adaptive histogram equalisation (CLAHE) [77] is a local contrast enhancement method, which aims to prevent amplification of noise. Unsharp mask sharpens images by increasing contrast around edges. Grey world and automatic colour equalization (ACE) [81] focus on adjusting white balance and colour distribution. Grey world assumes that the average colour in an image should be neutral grey, and balances colours accordingly. ACE combines global and local colour correction based on both grey-world and white-patch assumptions, allowing more flexibility in colour balancing than grey world by incorporating local effects. Dark channel prior (DCP) [28] and colour attenuation prior (CAP) [110] were developed originally for haze removal in terrestrial images, but they were highlighted in a previous study for their positive effects on improving underwater image quality [59]. Multi-scale retinex with colour restoration (MSRCR) [35] balances dynamic range compression by processing the image at multiple scales, followed by colour restoration to avoid greying. Finally, we included three underwater-specific image enhancement methods: automatic red channel restoration (ARCR) [24], minimal colour loss and locally adaptive contrast enhancement (MLLE) [105], and fully unsupervised image enhancement generative adversarial network (FUnIE-GAN) [31]. ARCR corrects the red channel, compensating for the attenuation of red light in water. MLLE adapts to local regions while minimizing colour loss, considering specific features of underwater images. FUnIE-GAN is a deep learning-based approach that uses a generative adversarial network (GAN) to restore underwater images. We used the model trained in a paired fashion, where the generative model learns to map underwater images (\(256 \times 256\) pixels) of low perceptual quality to their high-quality counterparts by minimizing the difference between the generated and ground-truth images.
3.5.1 MSRCR implementation
MSRCR addresses dynamic range compression and colour rendition based on the retinex theory of how human colour vision retains colour consistency across varying illumination conditions [43]. It operates by decomposing an image at multiple scales, typically three, to capture fine, medium, and coarse details. These decompositions are then combined to produce an output that enhances local contrast and tonal rendition while preserving the overall structure of the image. To ensure that the enhanced images maintain natural and vivid colours, MSRCR involves a colour restoration step that prevents a greying effect often encountered with retinex algorithms.
The multi-scale decomposition includes three passes of the Gaussian blur filter at different scales determined by the standard deviation of the Gaussian distribution (\(\sigma \)) and the kernel size, which defines the size of the convolution filter (in pixels) used to perform the blurring operation. MSRCR is configured by choosing the \(\sigma \) values, while the respective kernel sizes (k) are typically determined using a rule-of-thumb formula, such as OpenCV’s createGaussianKernels function, which sets \(k = 8 \sigma + 1\).
We first implemented MSRCR using OpenCV’s implementation of the Gaussian blur filter, but this approach was prohibitively slow, averaging 3,900 milliseconds (ms) per image (\(1333 \times 750\) pixels, Apple M2 chip). To address this inefficiency, we implemented MSRCR using NVIDIA’s DALI library [71], which offers a GPU-accelerated implementation of the Gaussian filter. However, DALI’s implementation limits the kernel size to a maximum of 169 pixels. While this is sufficient to capture fine (local) details, it is too small for medium and coarse (global) details needed for implementing MSRCR. To circumvent the kernel size limitation, we adopted the following approach: for any kernel size \(k > 169\), the image was resized by a factor of \(\frac{169}{k}\), Gaussian filter with \(\sigma = 21\) and \(k = 169\) was applied to this low resolution image, and the result was resized back to the original resolution. Formally, for an intensity value I(x, y) at pixel position (x, y), MSRCR is the product of retinex operation R and colour correction C:
where \(g=5\), \(b=25\), \(\alpha =125\), \(\beta =46\), \(n=3\), \(w_1=w_2=w_3=\frac{1}{3}\), \(\sigma _1 = 15\), \(\sigma _2 = 80\), \(\sigma _3 = 250\), \(k_i = 8 \sigma _i + 1\), \(F_{\sigma _i}\) is the Gaussian filter, \(*\) denotes convolution, and \(I_{\text {scaled}} \left( \hat{x}, \hat{y} \right) \) is the pixel intensity at the corresponding pixel position \((\hat{x}, \hat{y})\) after downscaling the image by a factor of \(\frac{k_i}{169}\). This approach worked well both qualitatively and quantitatively due to the strong blurring effect of the Gaussian filter for the \(\sigma \) values commonly used for MSRCR.
We adopted the default MSRCR configuration with \(\sigma \) values of 15, 80, and 250 [35] and compared it with other configurations of MSRCR by adjusting the default \(\sigma \) values by a factor of \(\frac{1}{3}\), \(\frac{1}{2}\), and 2. Furthermore, we performed an ablation study of MSRCR by reducing the number of scales used in the default configuration to a single scale with \(\sigma \) values of 15, 80, and 250, and two scales with \(\sigma \) values to 15 and 80, 15 and 250, and 80 and 250. For the single scales experiments, no colour correction was performed as we deemed it unnecessary after experimenting with a few images.
3.5.2 Estimation of image processing times
To estimate the delay caused by employing an image pre-processing step prior to inference with an object detection model, we measured the mean processing time of each individual image enhancement method over 100 DeepFish images (the dataset with the highest resolution in our study) at a resolution of \(1333 \times 750\), which is the default input size of the Faster R-CNN model for this dataset. All methods, except for the GPU-accelerated implementations of MSRCR and FUnIE-GAN, were tested on an Apple M2 chip. GPU-accelerated implementations of MSRCR and FUnIE-GAN were tested on an NVIDIA A100.
3.6 Image quality measures
The underwater image quality measure (UIQM) [74] was used to evaluate the effectiveness of image enhancement methods in improving perceived image quality. UIQM is a weighted sum of three components: the underwater image contrast measure (UIConM), the underwater image colourfulness measure (UICM), and the underwater image sharpness measure (UISM), each reflecting different aspects of visual quality in underwater environments. UIConM, UICM, and UISM were designed to correlate with human perception of visual quality. Additionally, we used three general, not underwater-specific, measures of image quality: the naturalness image quality evaluator (NIQE) [67], the structural similarity index measure (SSIM) [96], and the peak signal-to-noise ratio (PSNR). PSNR quantifies the error between the original and the enhanced images and is given by
where M is the maximum possible pixel value of the image (in our case 255), and MSE is the pixel-wise mean squared error between the original and the enhanced images.
3.7 Silhouette analysis
We developed a measure based on silhouette analysis [82], a clustering quality assessment technique, to quantify the effect of the image processing and enhancement methods on domain-associated visual features. First, we resized all images to \(256 \times 256\) resolution (65,536 pixels) and then further reduced the dimensionality using the principal component analysis (PCA) into two dimensions (we repeated the analysis also with five and ten dimensions observing similar results). Next, we quantified how “intermixed” the images (represented with the principal components) from distinct domains were using silhouette scores, a statistic ranging from \(-1\) to 1. A score of 1 represents perfectly separated clusters, a score close to 0 represents overlapping clusters, and a score of \(-1\) represents samples occurring in unrelated clusters. We defined the negative mean silhouette score (NMSS) as the mean of silhouette scores over the entire dataset:
where N is the number of data points in the dataset, D(i) is the domain to which the data point i belongs, a(i) is the mean distance between i and all other data points of D(i), and b(i) is the mean distance between i and all data points of a neighbouring domain (i.e. domain with the smallest mean distance to i). Thus, by treating domain information as cluster membership, an increased NMSS (relative to raw images without any processing) indicates that the image enhancement method diminished some of the domain-specific visual features.
4 Results and discussion
We first tested how well off-the-shelf object detectors work in an out-of-domain underwater object detection setting (Section 4.1). Next, we used the proposed framework to evaluate 14 image processing and enhancement methods in mitigating the drop in the detection accuracy caused by the domain shift (Section 4.2) and analysed which aspects of image enhancement contribute to this improvement (Section 4.3). Finally, the best performing image enhancement method was subjected to parameter sensitivity and computational scaling analyses (Section 4.4) and compared with model-centric domain generalization approaches using an independent test set (Section 4.5).
4.1 Comparison of off-the-shelf object detectors using random and domain-based splitting evaluation
We found striking differences in detection accuracy depending on whether random or domain-based splitting was used (Fig. 2). Under the random splitting evaluation, both Faster R-CNN and YOLOv8 detection algorithms yielded high mAP\(_{50}\) values of 96.3%–96.4%, 86.2%–87.3%, and 90.3%–91.4% for DeepFish, MBEEC-Low-Vis, and Jellytoring datasets, respectively. In contrast, the domain-based splitting evaluation yielded mAP\(_{50}\) values below 53%. Faster R-CNN and YOLOv8 performed comparably to each other with mAP\(_{50}\) values of 41.6%–46.8%, 46.9%–52.6%, and 40.3%–43.1% for DeepFish, MBEEC-Low-Vis, and Jellytoring, respectively.
Under the domain-based splitting evaluation, AP\(_{50}\) values across the different object categories (classes) ranged from 26.9%–78.1% and 0.0%–89.7% for MBEEC-Low-Vis and Jellytoring (the two datasets with multiple object categories), respectively. The detection accuracies of Faster R-CNN and YOLOv8 were consistent across the different fish and jellyfish species with Pearson correlation coefficients (r) of 0.66 and 0.99 for MBEEC-Low-Vis and Jellytoring, respectively.
The variability in the number of training images and domains in the training folds of the cross-validation procedure was relatively small with the number of images ranging from 1,815–2,441, 1,360–2,118, and 1,166–1,452 and the number of domains ranging from 7–10, 6–15, and 72–90 for DeepFish, MBEEC-Low-Vis, and Jellytoring, respectively. This level of variability did not result in detectable association with the detection accuracy (Fig. 14 in Appendix A). Thus, we subjected the DeepFish dataset to subsampling experiments by reducing the number of training images to 400. When the number of training domains was not reduced during subsampling, the difference in mAP\(_{50}\) was limited to 1.0 and 1.4 percentage points (p.p.) for Faster R-CNN and YOLOv8, respectively. However, when the number of training domains was limited to 4 and 2 (down from 9) on average, mAP\(_{50}\) decreased by 5.4 (8.8) and 10.6 (15.2) p.p. for Faster R-CNN (YOLOv8), respectively (Fig. 3). This highlights the benefit of including many diverse domains in the training set.
Despite defining domains simply as location-date or habitat of the video recordings, the influence of domain-associated visual features was observable by projecting the images into the space of the first two principal components (Fig. 4a–c). Furthermore, both detection algorithms achieved high accuracy using the random splitting evaluation, in which visual features of all domains had been observed, and could have be learned, during training. This is in stark contrast to the domain-based splitting evaluation, in which the accuracy plummets because, at test time, the object detectors encounter visual features that had not been observed during training—an effect that was further pronounced when the number of training domains was subsampled. Therefore, the disparity in detection accuracy between within-domain and out-of-domain evaluation can be attributed to the inability of off-the-shelf detectors to respond to domain shift. This highlights that the assumption about image independence between training and test data under the random splitting evaluation is unreasonable in the case of underwater object detection due to strong visual effects of environmental conditions in aquatic environments.
4.2 Evaluation of image processing and enhancement methods for underwater domain generalization
Five image enhancement methods (MSRCR, MLLE, grey world, ARCR, CLAHE) improved mAP\(_{50}\) of Faster R-CNN on the discovery dataset (DeepFish) by more than 3 p.p. compared to the baseline performance (46.8%) when raw images with no processing were used (Table 3). The best performing method—MSRCR (mAP\(_{50}\) 52.3%)—yielded an improvement of 5.5 p.p. (Figs. 5 and 15 in Appendix A).
When evaluated on the replication datasets, MSRCR delivered consistently better mAP\(_{50}\) values (57.7% and 46.3%) than raw images with no processing (52.6% and 43.1%) for both MBEEC-Low-Vis and Jellytoring, respectively (Table 3). These results constituted respective improvements of 5.1 and 3.2 p.p. Except for CLAHE, the other methods that improved mAP\(_{50}\) on DeepFish performed well also on the replication datasets with MLLE and auto-contrast yielding the second highest \(\Delta \text {mAP}_{50}\) values of 5.0 and 3.1 p.p. on MBEEC-Low-Vis and Jellytoring, respectively. These results demonstrate that the proposed framework identified multiple methods that can consistently aid domain generalization in underwater object detection.
To further validate the reproducibility of these results, we tested the ability of the image enhancement methods to aid underwater domain generalization also with YOLOv8. On the discovery dataset, the top methods identified with Faster R-CNN, all delivered mAP50 improvements (compared to raw images with no processing) of more than 6 p.p. when coupled with YOLOv8 (Table 4). This result was reproducible also across the two replication datasets with improvements of more than 1.5 and 2.0 p.p. for MBEEC-Low-Vis and Jellytoring, respectively. MSRCR was again the best performing image enhancement method on the discovery as well as the two replication datasets. MSRCR yielded mAP\(_{50}\) values of 55.1%, 55.7%, and 45.6% for DeepFish, MBEEC-Low-Vis, and Jellytoring, respectively, which constituted improvements of 13.4, 8.8, and 5.3 p.p. compared to raw images with no processing (Fig. 5 and Fig. 16 in Appendix A).
Most methods performed consistently across the three datasets and the two detection algorithms (Figs. 6 and 17 in Appendix A). Two notable exceptions to this consistency were DCP and CAP, which delivered improvements of 12.0 and 7.2 p.p. on DeepFish with YOLOv8 but had minimal or negative influence when coupled with Faster R-CNN, and when applied to any other dataset, regardless of the detection algorithm. This highlights the need for using a combination of several underwater datasets and learning algorithms when evaluating image enhancement methods for downstream applications such as object detection.
Out of the global intensity adjustment methods, only auto-contrast improved detection accuracy consistently across the three datasets and the two detection algorithms. Auto-contrast performs linear rescaling of pixel intensities for each channel independently, which apart from improving contrast resulted in a reasonable colour correction for many of the underwater images (Fig. 13 in Appendix A). Judging based on a visual assessment of selected images, the low (often negative) \(\Delta \text {mAP}_{50}\) performance of the other global intensity adjustment methods were likely due to little improvement in contrast (gamma up, gamma down, and adjust log) or clipping the details in the dark and bright parts of the images (adjust sigmoid). Other than auto-contrast, methods including CLAHE, ACE, and grey world provided mostly positive improvements, but of lesser magnitudes than ARCR, MLLE, and MSRCR. While ARCR and MLLE, both designed specifically for visual restoration of underwater images, performed consistently well in our evaluation, the other underwater-specific method, a deep learning model FUnIE-GAN, resulted in mostly negative effect on detection accuracy, likely because it was trained on images of much smaller resolution (\(256 \times 256\) pixels) than the resolution used by Faster R-CNN and YOLOv8. Finally, MSRCR, the best-performing method, effectively improved object detection despite sometimes producing visual artifacts such as halo effects or low colour consistency (Fig. 18 in Appendix A).
Overall, the proposed framework identified and validated several image enhancement methods which can counteract domain shift in aquatic environments. MSRCR yielded the most accurate detections and was most consistent across the three datasets and the two algorithms. Figure 7 shows several example images for which the effect of MSRCR pre-processing resulted in an improved detection accuracy.
4.3 Analysis of the effect of image enhancement on domain-associated visual features
Plotting the first two principal components of raw and MSRCR-processed images revealed that MSRCR diminished the domain-associated visual features to some degree (Fig. 4). We quantified this effect across all image processing and enhancement methods using the negative mean silhouette score (NMSS). Strikingly, NMSS was correlated with mAP\(_{50}\) across the three datasets and the two detection algorithms (r of 0.43–0.86) with MSRCR yielding the highest NMSS values (Fig. 8). Increasing dimensionality of the PCA projections to five and ten principal components confirmed these results. This demonstrates that MSRCR and other image enhancement methods diminish domain-associated visual features, which is likely the reason for the observed detection accuracy improvements. The correlation between the NMSS measure and the detection accuracy under domain shift could be exploited for optimising or developing new underwater image enhancement methods.
In addition to NMSS, we used several established image quality measures to quantify the relationship between the visual features and detection accuracy. Only NMSS yielded a strong, positive correlation between image quality and mAP\(_{50}\) values for all three datasets (Fig. 9). NIQE, SSIM, and PSNR showed little or negative correlation with mAP\(_{50}\). UIQM, designed to correlate with human perception of underwater image quality, did not show a consistent correlation pattern. Decomposing UIQM into the three scores it consists of revealed that each dataset’s mAP\(_{50}\) values were correlated with a different aspect of image quality. These were sharpness, colourness, and contrast for DeepFish, MBEEC-Low-Vis, and Jellytoring, respectively. Overall, the domain generalization performance of the image processing and enhancement methods did not correlate with measures quantifying the perceived image quality, but with NMSS, which quantifies the presence of domain-associated visual features by assessing the separability of domains in a dimensionality-reduced space of principal components.
To demonstrate that the mAP\(_{50}\) improvements were largely due to image enhancement mitigating domain shift, the object detectors were subjected to the random splitting evaluation. None of the image processing and enhancement methods, including MSRCR (Fig. 10), yielded an improvement greater than 0.5 p.p. on any of the three datasets (Tables 6 and 7 and Figs. 15 and 16 in Appendix A). This discrepancy in \(\Delta \text {mAP}_{50}\) between the domain and random splitting evaluations further supports the notion that MSRCR and other image enhancement methods improve underwater domain generalization by reducing domain-associated visual features. That is, under the random splitting evaluation, in which case all domains are including in the training set, reducing domain-associated visual features does not improve accuracy. This contrasts with domain-splitting evaluation, in which the reduction of domain-associated visual features lessens the domain shift between the training and test sets, leading to improved detection accuracy.
4.4 Evaluation of different MSRCR configurations
We evaluated several MSRCR configurations regarding sensitivity to the number of scales and the \(\sigma \) parameters. On the discovery (DeepFish) dataset using YOLOv8, all tested combinations resulted in mAP\(_{50}\) values greater than using raw images with no processing. The configuration with the best mAP\(_{50}\) performance was the default MSRCR with \(\sigma \) values of 15, 80, and 250 (Fig. 11). Using three scales performed better than using single or two scales, and using two scales performed better than a single scale as long as a scale extracting fine (local) details was included. The importance of the local details was highlighted by the relatively high detection accuracy achieved with a single-scale retinex when \(\sigma \) was set to 15. Inspecting a number of images, this configuration resulted in unnatural visual effects due to a high increase in locally-limited contrast (Fig. 19 in Appendix A). We speculate that the local contrast aided in distinguishing object shapes, thereby improving detection accuracy, despite the low level of perceived visual quality. Conversely, configurations that did not include a scale for extracting local features resulted in soft-looking images and lower detection accuracy, with single-scale retinex \(\sigma =80\) and \(\sigma =250\) being the worst performing of the ten tested configurations.
The number of scales and the \(\sigma \) parameters were both associated with MSRCR’s running time. While large values of \(\sigma \) would generally result in increased running time due to the higher number of neighbouring pixels being processed at each position, the relationship with \(\sigma \) was not linear because our implementation resizes the inputs for the Gaussian blur filter when \(\sigma > 169\). Thus, \(\sigma \gg 169\) had shorter running times than \(\sigma < 169\) (Fig. 11). This effect was diminished when we employed the GPU-accelerated implementation, averaging at 8, 9, and 10 ms for the single-, two-, and three-scale retinex configurations, respectively. The GPU implementation could be further improved by batching inputs, for example when processing videos from multiple cameras, with three-scale retinex achieving speeds of 8 ms per frame. In summary, MSRCR with its robust performance across different configurations and fast GPU-accelerated implementation could be used in a range of real-time underwater object detection applications.
4.5 Comparison of image enhancement with model-centric domain generalization methods
Evaluation of MSRCR image pre-processing using the S-UODAC dataset resulted in mAP\(_{50}\) values of 61.3% and 63.1% when coupled with YOLOv3 and Faster R-CNN, respectively, which represents respective improvements of 23.1 and 9.4 p.p. compared to raw images with no processing (Table 5). Relative to the best model-centric approach (DMCL), MSRCR improved the mAP\(_{50}\) values by 8.0 and 1.7 p.p. for YOLOv3 and Faster R-CNN, respectively. Figure 12 shows the effect MSRCR pre-processing for an example image from the S-UODAC dataset. This result further demonstrates the ability of our framework to identify image enhancement methods that can robustly and consistently aid domain generalization in underwater object detection. Given the orthogonality of our data-centric and the existing model-centric approaches, combining the two could result in further improvements to detection accuracy.
5 Conclusions
Underwater environments are highly variable with domain shifts caused by variable turbidity, colour casts, and light conditions occurring rapidly even when models are deployed at a single geographical location. Therefore, it is imperative to develop and deploy models robust to domain shifts. Here we developed a data-centric framework for testing domain generalization in underwater object detection with a robust and comprehensive cross-validation approach. We used this framework to demonstrate that there is a large difference between within-domain and out-of-domain prediction performance of two widely used off-the-shelf object detectors. We hypothesised that the visual differences associated with the domain shift can be counteracted with an image pre-processing step. Applying the proposed framework to test 14 image processing and enhancement methods revealed that while some of the methods could aid domain generalization, others had negligible effects or even lowered the detection accuracy compared to using raw images with no processing. We proposed that a silhouette-inspired score (NMSS) could quantify the presence of domain-associated visual features and found a correlation between this score and out-of-domain detection accuracy. Thus, NMSS could be used in designing and optimising image enhancement methods for combating domain shift inherent to aquatic environments.
MSRCR was the best performing image enhancement me-thod in our out-of-domain evaluation, consistently improving mAP\(_{50}\) by 3.2–13.4 p.p. across the three real-world aquatic datasets. This contrasts with previous studies which focused on applying image enhancement to the within-domain underwater object detection setting and found only a limited degree of improvement [59, 100]. We further validated our results in a comparison with existing work on an independent test set, which highlighted that detections with MSRCR-based image pre-processing were more accurate by 1.7–8.0 p.p. (mAP\(_{50}\)) than those of the model-centric domain generalization methods.
Because domain shift can occur rapidly in underwater environments, improving domain generalization has significant practical implications across several fields. In aquaculture, object detectors robust to domain shift can improve the accuracy of monitoring fish behaviour and welfare, supporting better fish health management [68]. In environmental and ecological monitoring, robust and reliable detection of a variety of species can assist conservation efforts [16]. Similarly, for underwater exploration with autonomous vehicles, robust detection methods can adapt to diverse underwater environments, enhancing navigation and data collection even with fluctuating light and water conditions [6].
There are several limitations in the scope of our current work. First, most of the 14 image processing and enhancement methods have parameters that could be optimised to achieve better performance, but for practical reasons we used default parameter for all methods. To mitigate the potential problem of default values putting some methods into an unfair advantage, we used three diverse datasets to confirm the robustness of our findings. Second, we used the publicly available weights of the FUnIE-GAN model, which was trained on images of a considerably smaller resolution than the resolution used by the two detection algorithms. This could have limited FUnIE-GAN’s performance in our evaluation. Third, while we included both two-stage and single-stage object detection algorithms, these were based on the CNN architecture. Validating our findings with a transformer-based detection algorithm would provide further evidence to versatility of the proposed approach.
To conclude, the proposed framework allowed us to demonstrate the importance of out-of-domain evaluation in underwater object detection and the significant contribution of image enhancement to combating domain shift. Key ingredients to our framework when applied to underwater object detection are domain-splitting cross-validation, distinct discovery and replication datasets, and randomly seeded replicates. While here we considered 14 image enhancement methods, three datasets, and two object detection algorithms, the proposed framework provides guidelines to conduct other types of empirical evaluation studies in a robust and reproducible fashion. In future work, we plan to explore if online optimisation of the MSRCR’s parameters [29] could further improve the detection accuracy and experiment with combining our data-centric approach with model-centric methods for domain generalization.
Data Availability
Only publicly available datasets were used in this research: DeepFish (https://github.com/alzayats/DeepFish) with the extended annotations (https://github.com/tamim662/YOLO-Fish), MBEEC-Low-Vis (https://github.com/slopezmarcano/dataset-fish-detection-low-visibility), Jellytoring (https://doi.org/10.5281/zenodo.6832131), S-UODAC (https://github.com/mousecpn/DMC-Domain-Generalization-for-Underwater-Object-Detection).
Materials availability
Not applicable.
Code availability
The code to reproduce the results presented in this article is available at https://github.com/lukas-folkman/enhance-to-generalize under the GNU Affero General Public License (AGPL) v3.0.
References
Aguirre-Castro OA, García-Guerrero EE, López-Bonilla OR et al (2022) Evaluation of underwater image enhancement algorithms based on Retinex and its implementation on embedded systems. Neurocomputing 494:148–159. https://doi.org/10.1016/j.neucom.2022.04.074
Akkaynak D, Treibitz T (2018) A Revised underwater image formation model. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. pp 6723–673https://doi.org/10.1109/CVPR.2018.00703
Atlas WI, Ma S, Chou YC et al (2023) Wild salmon enumeration and monitoring using deep learning empowered detection and tracking. Front Mar Sci 1. https://doi.org/10.3389/fmars.2023.1200408
Balaji Y, Sankaranarayanan S, Chellappa R (2018) MetaReg: towards domain generalization using meta-regularization. In: Advances in neural information processing systems
Bradski G (2000) The OpenCV library. Dr Dobb’s journal of software tools
Cai L, McGuire NE, Hanlon R et al (2023) Semi-supervised visual tracking of marine animals using autonomous underwater vehicles. Int J Comput Vision 131(6):1406–1427. https://doi.org/10.1007/s11263-023-01762-5
Carlucci FM, D’Innocente A, Bucci S et al (2019) Domain generalization by Solving jigsaw puzzles. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp 2224–2233.https://doi.org/10.1109/CVPR.2019.00233
Chaitin GJ (1966) On the length of programs for computing finite binary sequences. J ACM 13(4):547–56. https://doi.org/10.1145/321356.321363
Chen X, Lu Y, Wu Z et al (2021) Reveal of domain effect: how visual restoration contributes to object detection in aquatic scenes. In: Visual perception and control of underwater robots. CRC Press
Chen Y, Song P, Liu H et al (2023) Achieving domain generalization for underwater object detection by domain mixup and contrastive learning. Neurocomputing 528:20–3. https://doi.org/10.1016/j.neucom.2023.01.053
Cheng N, Xie H, Zhu X et al (2023) Joint image enhancement learning for marine object detection in natural scene. Eng Appl Artif Intell 120:10590. https://doi.org/10.1016/j.engappai.2023.105905
Chiang JY, Chen YC (2012) Underwater image enhancement by wavelength compensation and dehazing. IEEE Trans Image Process 21(4):1756–176. https://doi.org/10.1109/TIP.2011.2179666
Clark A, others (2023) Pillow (PIL Fork) documentation. https://pillow.readthedocs.io
Cong R, Yang W, Zhang W et al (2023) PUGAN: physical model-guided underwater image enhancement using GAN with dual-discriminators. IEEE Trans Image Process 32:4472–4485. https://doi.org/10.1109/TIP.2023.3286263
Connolly RM, Jinks KI, Herrera C et al (2022) Fish surveys on the move: Adapting automated fish detection and classification frameworks for videos on a remotely operated vehicle in shallow marine waters. Front Mar Sci 9:91850. https://doi.org/10.3389/fmars.2022.918504
Connolly RM, Herrera C, Rasmussen J et al (2024) Estimating enhanced fish production on restored shellfish reefs using automated data collection from underwater videos. J Appl Ecol 61(4):633–646. https://doi.org/10.1111/1365-2664.14617
Costello C, Cao L, Gelcich S et al (2020) The future of food from the sea. Nature 588(7836):95–10. https://doi.org/10.1038/s41586-020-2616-y
Dai L, Liu H, Song P et al (2024) A gated cross-domain collaborative network for underwater object detection. Pattern Recogn 149:110222. https://doi.org/10.1016/j.patcog.2023.110222
Ditria EM, Lopez-Marcano S, Sievers M et al (2020) Automating the analysis of fish abundance using object detection: optimizing animal ecology with deep learning. Front Mar Sci 7. https://doi.org/10.3389/fmars.2020.00429
Dou Q, Castro DC, Kamnitsas K et al (2019) Domain generalization via model-agnostic learning of semantic features. In: Proceedings of the 33rd international conference on neural information processing systems, vol. 579. p 6450–6461
Du D, Li E, Si L et al (2025) UIEDP: Boosting underwater image enhancement with diffusion prior. Expert Syst Appl 259:125271. https://doi.org/10.1016/j.eswa.2024.125271
Eger AM, Marzinelli EM, Beas-Luna R et al (2023) The value of ecosystem services in global marine kelp forests. Nat Commun 14(1):1894. https://doi.org/10.1038/s41467-023-37385-0
Fu C, Liu R, Fan X et al (2023) Rethinking general underwater object detection: datasets, challenges, and solutions. Neurocomputing 517:243–25. https://doi.org/10.1016/j.neucom.2022.10.039
Galdran A, Pardo D, Picón A et al (2015) Automatic Red-Channel underwater image restoration. J Vis Commun Image Represent 26:132–14. https://doi.org/10.1016/j.jvcir.2014.11.006
Ganin Y, Ustinova E, Ajakan H et al (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(1):2096–2030
Gao M, Li S, Wang K et al (2023) Real-time jellyfish classification and detection algorithm based on improved YOLOv4-tiny and improved underwater image enhancement algorithm. Sci Rep 13(1):12989. https://doi.org/10.1038/s41598-023-39851-7
Harary S, Schwartz E, Arbelle A et al (2022) Unsupervised domain generalization by Learning a bridge across domains. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 5280–5290
He K, Sun J, Tang X (2009) Single image haze removal using dark channel prior. In: 2009 IEEE conference on computer vision and pattern recognition. pp 1956–196. https://doi.org/10.1109/CVPR.2009.5206515
Hu K, Zhang Y, Lu F et al (2020) An underwater image enhancement algorithm based on MSR parameter optimization. J Marine Sci Eng 8(10):74. https://doi.org/10.3390/jmse8100741
Huang H, Zhou H, Yang X et al (2019) Faster R-CNN for marine organisms detection and recognition using data augmentation. Neurocomputing 337:372–38. https://doi.org/10.1016/j.neucom.2019.01.084
Islam MJ, Xia Y, Sattar J (2020) Fast underwater image enhancement for improved visual perception. IEEE Robot Autom Lett 5(2):3227–323. https://doi.org/10.1109/LRA.2020.2974710
Jaffe J (1990) Computer modeling and the design of optimal underwater imaging systems. IEEE J Oceanic Eng 15(2):101–111. https://doi.org/10.1109/48.50695
Jia C, Zhang Y (2024) Meta-learning the invariant representation for domain generalization. Mach Learn 113(4):1661–1681. https://doi.org/10.1007/s10994-022-06256-y
Jiang L, Wang Y, Jia Q et al (2021) Underwater species detection using channel sharpening attention. In: Proceedings of the 29th ACM international conference on multimedia. association for computing machinery. pp 4259–426https://doi.org/10.1145/3474085.3475563
Jobson D, Rahman Z, Woodell G (1997) A multiscale retinex for bridging the gap between color images and the human observation of scenes. IEEE Trans Image Process 6(7):965–97. https://doi.org/10.1109/83.597272
Jocher G, Chaurasia A, Qiu J (2023) Ultralytics YOLO. https://github.com/ultralytics/ultralytics
Kabir H, Garg N (2023) Machine learning enabled orthogonal camera goniometry for accurate and robust contact angle measurements. Sci Rep 13(1):149. https://doi.org/10.1038/s41598-023-28763-1
Kang Y, Jiang Q, Li C et al (2023) A perception-aware decomposition and fusion framework for underwater image enhancement. IEEE Trans Circuits Syst Video Technol 33(3):988–1002. https://doi.org/10.1109/TCSVT.2022.3208100
Katija K, Roberts PLD, Daniels J, et al (2021) Visual tracking of deepwater animals using machine learning-controlled robotic underwater vehicles. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). pp 859–86. https://doi.org/10.1109/WACV48630.2021.00090
Katija K, Orenstein E, Schlining B et al (2022) FathomNet: a global image database for enabling artificial intelligence in the ocean. Sci Rep 12(1):1591. https://doi.org/10.1038/s41598-022-19939-2
Kolmogorov AN (1968) Three approaches to the quantitative definition of information. Int J Comput Math 2(1–4):157–168. https://doi.org/10.1080/00207166808803030
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pp 1097–1105
Land EH (1977) The retinex theory of color vision. Sci Am 237(6):108–12. https://doi.org/10.1038/scientificamerican1277-108
Lee W, Hong D, Lim H et al (2024) Object-aware domain generalization for object detection. arXiv:2312.12133 [cs]
Li C, Guo J, Guo C (2018) Emerging from water: underwater image color correction based on weakly supervised color transfer. IEEE Signal Process Lett 25(3):323–327. https://doi.org/10.1109/LSP.2018.2792050
Li C, Guo C, Ren W et al (2019) An underwater image enhancement benchmark dataset and beyond. IEEE Trans Image Process 29:4376–438. https://doi.org/10.1109/TIP.2019.2955241
Li C, Anwar S, Porikli F (2020) Underwater scene prior inspired deep underwater image and video enhancement. Pattern Recogn 98:107038. https://doi.org/10.1016/j.patcog.2019.107038
Li CY, Guo JC, Cong RM et al (2016) Underwater image enhancement by dehazing with minimum information loss and histogram distribution prior. IEEE Trans Image Process 25(12):5664–567. https://doi.org/10.1109/TIP.2016.2612882
Li D, Du L (2022) Recent advances of deep learning algorithms for aquacultural machine vision systems with emphasis on fish. Artif Intell Rev 55(5):4077–411. https://doi.org/10.1007/s10462-021-10102-3
Li H, Pan SJ, Wang S et al (2018) Domain generalization with adversarial feature learning. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. pp 5400–5409.https://doi.org/10.1109/CVPR.2018.00566
Li J, Skinner KA, Eustice RM et al (2017) WaterGAN: unsupervised generative network to enable real-time color correction of monocular underwater images. IEEE Robot Autom Lett 1. https://doi.org/10.1109/LRA.2017.2730363
Li P, Li D, Li W et al (2021) A simple feature augmentation for domain generalization. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 8886–8895
Li Y, Tian X, Gong M et al (2018) Deep domain generalization via conditional invariant adversarial networks. In: Proceedings of the European Conference on Computer Vision (ECCV). pp 624–639
Lin C, Yuan Z, Zhao S et al (2021) Domain-invariant disentangled network for generalizable object detection. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp 8751–8760.https://doi.org/10.1109/ICCV48922.2021.00865
Lin TY, Maire M, Belongie S et al (2014) Microsoft COCO: Common Objects in Context. In: Fleet D, Pajdla T, Schiele B, et al (eds) Computer vision - ECCV 2014, Lecture notes in computer science. pp 740–755.https://doi.org/10.1007/978-3-319-10602-1_48
Liu C, Li H, Wang S et al (2021) A dataset and benchmark of underwater object detection for robot picking. In: 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). pp 1–6.https://doi.org/10.1109/ICMEW53276.2021.9455997
Liu H, Song P, Ding R (2020) Towards domain generalization in underwater object detection. In: 2020 IEEE International Conference on Image Processing (ICIP). pp 1971–1975. https://doi.org/10.1109/ICIP40778.2020.9191364
Liu P, Qian W, Wang Y (2024) YWnet: a convolutional block attention-based fusion deep learning method for complex underwater small target detection. Eco Inform 79:102401. https://doi.org/10.1016/j.ecoinf.2023.102401
Liu R, Fan X, Zhu M et al (2020) Real-world underwater enhancement: challenges, benchmarks, and solutions under natural light. IEEE Trans Circuits Syst Video Technol 30(12):4861–487. https://doi.org/10.1109/TCSVT.2019.2963772
Liu Z, Wang B, Li Y et al (2024) UnitModule: a lightweight joint image enhancement module for underwater object detection. Pattern Recogn 151:11043. https://doi.org/10.1016/j.patcog.2024.110435
Lopez-Marcano S, L. Jinks E, Buelow CA et al (2021) Automatic detection of fish and tracking of movement for ecology. Ecol Evol 11(12):8254–826. https://doi.org/10.1002/ece3.7656
Lyu L, Liu Y, Xu X et al (2023) EFP-YOLO: a quantitative detection algorithm for marine benthic organisms. Ocean Coastal Manag 243:10677. https://doi.org/10.1016/j.ocecoaman.2023.106770
Ma H, Zhang Y, Sun S et al (2024) Weighted multi-error information entropy based you only look once network for underwater object detection. Eng Appl Artif Intell 130:107766. https://doi.org/10.1016/j.engappai.2023.107766
Mandal R, Connolly RM, Schlacher TA et al (2018) Assessing fish abundance from underwater video using deep neural networks. In: 2018 International Joint Conference on Neural Networks (IJCNN). pp 1–6. https://doi.org/10.1109/IJCNN.2018.8489482
Marrable D, Barker K, Tippaya S et al (2022) Accelerating species recognition and labelling of fish from underwater video with machine-assisted deep learning. Front Mar Sci 9. https://doi.org/10.3389/fmars.2022.944582
Meng R, Li X, Chen W et al (2022) Attention diversification for domain generalization. In: Avidan S, Brostow G, Cissé M, et al (eds) Computer Vision - ECCV 2022, Lecture notes in computer science. pp 322–34. https://doi.org/10.1007/978-3-031-19830-4_19
Mittal A, Soundararajan R, Bovik AC (2013) Making a “completely blind’’ image quality analyzer. IEEE Signal Process Lett 20(3):209–212. https://doi.org/10.1109/LSP.2012.2227726
Måløy H, Aamodt A, Misimi E (2019) A spatio-temporal recurrent network for salmon feeding action recognition from underwater videos in aquaculture. Comput Electron Agric 167:105087. https://doi.org/10.1016/j.compag.2019.105087
Motiian S, Piccirilli M, Adjeroh DA et al (2017) Unified deep supervised domain adaptation and generalization. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp 5716–5726.https://doi.org/10.1109/ICCV.2017.609
Muksit AA, Hasan F, Hasan Bhuiyan Emon MF et al (2022) YOLO-Fish: a robust fish detection model to detect fish in realistic underwater environment. Eco Inform 72:101847. https://doi.org/10.1016/j.ecoinf.2022.101847
NVIDIA (2023) The NVIDIA Data Loading Library (DALI). https://github.com/NVIDIA/DALI
Ottaviani E, Francescangeli M, Gjeci N et al (2022) Assessing the image concept drift at the OBSEA coastal underwater cabled observatory. Front Mar Sci. https://doi.org/10.3389/fmars.2022.840088
Pal SK, Pramanik A, Maiti J et al (2021) Deep learning in multi-object detection and tracking: state of the art. Appl Intell 51(9):6400–642. https://doi.org/10.1007/s10489-021-02293-7
Panetta K, Gao C, Agaian S (2016) Human-visual-system-inspired underwater image quality measures. IEEE J Oceanic Eng 41(3):541–551. https://doi.org/10.1109/JOE.2015.2469915
Panetta K, Kezebou L, Oludare V et al (2022) Comprehensive underwater object tracking benchmark dataset and underwater image enhancement with GAN. IEEE J Oceanic Eng 47(1):59–75. https://doi.org/10.1109/JOE.2021.3086907
Peng L, Zhu C, Bian L (2023) U-shape transformer for underwater image enhancement. In: Karlinsky L, Michaeli T, Nishino K (eds) Computer Vision - ECCV 2022 workshops. Springer Nature Switzerland, Cham, pp 290–307. https://doi.org/10.1007/978-3-031-25063-7_18
Pizer SM, Amburn EP, Austin JD et al (1987) Adaptive histogram equalization and its variations. Comput Vis Graph Image Process 39(3):355–36. https://doi.org/10.1016/S0734-189X(87)80186-X
Pu H, Zhang D, Xu K et al (2024) BNN-SAM: Improving generalization of binary object detector by Seeking Flat Minima. Appl Intell. https://doi.org/10.1007/s10489-024-05512-z
Redmon J, Divvala S, Girshick R et al (2016) You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp 779–788. https://doi.org/10.1109/CVPR.2016.91
Ren S, He K, Girshick R et al (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems
Rizzi A, Gatta C, Marini D (2003) A new algorithm for unsupervised global and local color correction. Pattern Recogn Lett 24(11):1663–1677. https://doi.org/10.1016/S0167-8655(02)00323-9
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–6. https://doi.org/10.1016/0377-0427(87)90125-7
Ruiz-Frau A, Martin-Abadal M, Jennings CL et al (2022) The potential of Jellytoring 2.0 smart tool as a global jellyfish monitoring platform. Ecol Evol 12(11):e947. https://doi.org/10.1002/ece3.9472
Saleh A, Laradji IH, Konovalov DA et al (2020) A realistic fish-habitat dataset to evaluate algorithms for underwater visual analysis. Sci Rep 10(1):1467. https://doi.org/10.1038/s41598-020-71639-x
Saleh A, Sheaves M, Jerry D et al (2024) Applications of deep learning in fish habitat monitoring: A tutorial and survey. Expert Syst Appl 238:12184. https://doi.org/10.1016/j.eswa.2023.121841
Schechner Y, Karpel N (2004) Clear underwater vision. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. pp I–I.https://doi.org/10.1109/CVPR.2004.1315078
Schmitt R (2018) The Ocean’s role in climate. Oceanography 31(2):100. https://doi.org/10.5670/oceanog.2018.225
Shankar S, Piratla V, Chakrabarti S et al (2018) Generalizing across domains via cross-gradient training. In: International conference on learning representations
Shui C, Wang B, Gagné C (2022) On the benefits of representation regularization in invariance based domain generalization. Mach Learn 111(3):895–91. https://doi.org/10.1007/s10994-021-06080-w
Sicilia A, Zhao X, Hwang SJ (2023) Domain adversarial neural networks for domain generalization: when it works and how to improve. Mach Learn 112(7):2685–272. https://doi.org/10.1007/s10994-023-06324-x
Volpi R, Namkoong H, Sener O et al (2018) Generalizing to unseen domains via adversarial data augmentation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18. pp 5339–5349
Svd Walt, Schönberger JL, Nunez-Iglesias J et al (2014) scikit-image: image processing in Python. PeerJ 2:e45. https://doi.org/10.7717/peerj.453
Wang J, Lan C, Liu C et al (2023) Generalizing to unseen domains: a survey on domain generalization. IEEE Trans Knowl Data Eng 35(8):8052–8072. https://doi.org/10.1109/TKDE.2022.3178128
Wang P, Zhang Z, Lei Z et al (2023) Sharpness-aware gradient matching for domain generalization. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp 3769–3778. https://doi.org/10.1109/CVPR52729.2023.00367
Wang S, Yu L, Li C et al (2020) Learning from extrinsic and intrinsic supervisions for domain generalization. In: Vedaldi A, Bischof H, Brox T, et al (eds) Computer Vision - ECCV 2020, vol 12354. p 159–176. https://doi.org/10.1007/978-3-030-58545-7_10
Wang Z, Bovik A, Sheikh H et al (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–61. https://doi.org/10.1109/TIP.2003.819861
Wu A, Deng C (2022) Single-domain generalized object detection in urban scene via cyclic-disentangled self-distillation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp 837–84. https://doi.org/10.1109/CVPR52688.2022.00092
Wu X, Zhang L, Huang J et al (2024) Underwater image enhancement via modeling white degradation. IEEE J Oceanic Eng 49(4):1220–123. https://doi.org/10.1109/JOE.2024.3429653
Wu Y, Kirillov A, Massa F et al (2019) Detectron2. https://github.com/facebookresearch/detectron2
Xu S, Zhang M, Song W et al (2023) A systematic review and analysis of deep learning-based underwater object detection. Neurocomputing 527:204–232. https://doi.org/10.1016/j.neucom.2023.01.056
Yang X, Zhang S, Liu J et al (2021) Deep learning for smart fish farming: applications, opportunities and challenges. Rev Aquac 13(1):66–9. https://doi.org/10.1111/raq.12464
Yeh CH, Lin CH, Kang LW et al (2022) Lightweight deep neural network for joint learning of underwater object detection and color conversion. IEEE Trans Neural Netw Learn Syst 33(11):6129–614. https://doi.org/10.1109/TNNLS.2021.3072414
Zhang J, Zhu L, Xu L et al (2020) Research on the correlation between image enhancement and underwater object detection. In: 2020 Chinese Automation Congress (CAC). pp 5928–5933. https://doi.org/10.1109/CAC51589.2020.9326936
Zhang J, Zhang J, Zhou K et al (2023) An improved YOLOv5-based underwater object-detection framework. Sensors 23(7):3693. https://doi.org/10.3390/s23073693
Zhang W, Zhuang P, Sun HH et al (2022) Underwater image enhancement via minimal color loss and locally adaptive contrast enhancement. IEEE Trans Image Process 31:3997–401. https://doi.org/10.1109/TIP.2022.3177129
Zhang W, Zhou L, Zhuang P et al (2024) Underwater image enhancement via weighted wavelet visual perception fusion. IEEE Trans Circuits Syst Video Technol 34(4):2469–2483. https://doi.org/10.1109/TCSVT.2023.3299314
Zhang X, Xu Z, Xu R et al (2022) Towards domain generalization in object detection. arXiv:2203.14387 [cs]
Zhou K, Yang Y, Hospedales T et al (2020) Learning to generate novel domains for domain generalization. In: Vedaldi A, Bischof H, Brox T, et al (eds) Computer Vision - ECCV 2020, vol 12361. p 561–578. https://doi.org/10.1007/978-3-030-58517-4_33
Zhou K, Liu Z, Qiao Y et al (2023) Domain generalization: a survey. IEEE Trans Pattern Anal Mach Intell 45(4):4396–4415. https://doi.org/10.1109/TPAMI.2022.3195549
Zhu Q, Mai J, Shao L (2015) A fast single image haze removal algorithm using color attenuation prior. IEEE Trans Image Process 24(11):3522–353. https://doi.org/10.1109/TIP.2015.2446191
Zion B (2012) The use of computer vision technologies in aquaculture - a review. Comput Electron Agric 88:125–132. https://doi.org/10.1016/j.compag.2012.07.010
Acknowledgements
We thank Dr César Herrera and Dr Sebastian Lopez-Marcano for the insightful discussions regarding suitable datasets and resources for this research. We gratefully acknowledge the support of the Griffith University eResearch Service & Specialised Platforms Team and the use of the High Performance Computing Cluster “Gowonda” to complete this research.
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions. This research was funded by the Blue Economy Cooperative Research Centre, established and supported under the Australian Government’s Cooperative Research Centres Program, grant number CRC20180101.
Author information
Authors and Affiliations
Contributions
Lukas Folkman: Conceptualization, Methodology, Formal analysis and investigation, Data curation, Software, Writing - original draft preparation, Writing - review and editing. Kylie A. Pitt: Conceptualization, Writing - review and editing, Funding acquisition. Bela Stantic: Conceptualization, Writing - review and editing, Funding acquisition.
Corresponding author
Ethics declarations
Conflicts of Interest:
The authors have no competing interests to declare that are relevant to the content of this article.
Ethics approval and consent to participate:
Not applicable.
Consent for publication:
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A
Appendix A
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Folkman, L., Pitt, K.A. & Stantic, B. A data-centric framework for combating domain shift in underwater object detection with image enhancement. Appl Intell 55, 272 (2025). https://doi.org/10.1007/s10489-024-06224-0
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-06224-0